Tehran University of Medical Sciences

Science Communicator Platform

Share By
Machine Learning–Based Prediction of Breast Cancer in Women: Insights From Feature Selection of Clinical and Lifestyle Data Publisher



Allahqoli L ; Behzadi MH ; Aghamohammadi SZ ; Hakimi S ; Salehiniya H ; Fallahi A ; Rahmani A ; Shahabinia Z ; Mazidimoradi A ; Momenimovahed Z ; Ghiyasvand M
Authors

Source: International Journal of Breast Cancer Published:2026


Abstract

Objective: This study is aimed at developing and evaluating a machine learning–based model for breast cancer classification using integrated clinical, demographic, reproductive, and lifestyle data. Methods: A retrospective machine learning framework was developed using data from a case–control study conducted in Tehran. The dataset included demographic, clinical, reproductive, lifestyle, and screening-related variables. Data preprocessing was performed within machine learning pipelines to ensure data quality and prevent data leakage. Duplicate records and noninformative variables were removed. Missing values were imputed using median values for numerical variables and the most frequent value for categorical variables. Categorical features were encoded appropriately, and Min–Max normalization was applied where required. Feature selection was conducted using mutual information (MI) and analysis of variance (ANOVA) within a stratified cross-validation framework. The data were split into training (80%) and test (20%) sets. Several supervised learning algorithms, including Gaussian Naive Bayes (GNB), K-nearest neighbors (KNN), decision tree (DT), random forest (RF), support vector machine (SVM), logistic regression (LR), and artificial neural network (ANN), were trained and evaluated. Model performance was assessed using accuracy, precision, recall (sensitivity), F1-score, and receiver operating characteristic–area under the curve (ROC-AUC), with stratified 5-fold cross-validation and final evaluation on an independent test set. Results: Significant differences were observed between breast cancer patients and healthy controls across multiple demographic and clinical variables. Patients were generally older and more likely to be widowed, belong to higher socioeconomic classes, and be housewives, whereas higher education levels and employment were more frequent among healthy individuals (p < 0.001). Reproductive factors, including age at first marriage and breastfeeding duration, also showed significant differences. Feature selection reduced 414 initial variables to 40 key predictors. The most influential features included genetic factors (BRCA1/2 mutations and family history), reproductive and hormonal characteristics (age at menarche, menopause, and infertility), lifestyle behaviors (dietary patterns and physical activity), anthropometric measures (BMI and weight at age 30), and screening-related variables (mammography, ultrasound, and biopsy). All models demonstrated strong and stable performance with minimal differences between cross-validation and test results, indicating good generalization. RF achieved the highest performance (accuracy: 0.9897, precision: 0.9946, recall: 0.9840, F1-score: 0.9892), followed by SVM and LR, whereas ANN showed the lowest overall performance. Conclusion: Machine learning models can effectively classify breast cancer using multidimensional patient data. Ensemble methods, particularly RF, demonstrated superior accuracy and robustness, highlighting their ability to capture complex nonlinear relationships. The identified predictors are consistent with established clinical and epidemiological risk factors, supporting the validity of the proposed models. These findings suggest that machine learning approaches hold strong potential for personalized risk assessment and early detection of breast cancer; however, external validation across diverse populations is necessary to confirm generalizability. Copyright © 2026 Leila Allahqoli et al. International Journal of Breast Cancer published by John Wiley & Sons Ltd.
Other Related Docs
5. Predicting Cardiovascular Diseases Using Imbalanced Data: An Xgboost-Based Analysis of the 2022 Brfss Dataset, American Heart Journal Plus: Cardiology Research and Practice (2026)