Tehran University of Medical Sciences

Science Communicator Platform

Share By
Predicting Zygosity Based on Physical Similarity of Twin Pairs With the Aid of Machine Learning Methods Publisher Pubmed



Abtahi H ; Edalatifard M ; Gholamzadeh M ; Khakvatan E
Authors

Source: BMC Medical Informatics and Decision Making Published:2026


Abstract

Background: Given the importance of distinguishing identical and non-identical twins and the high cost of genetic testing, the ability to estimate zygosity using self-reported methods could be invaluable in reducing the costs of twin research. This study develops a machine learning framework to predict zygosity using physical similarity features, offering a scalable alternative. Methods: Our data were retrieved from the Iranian School-aged Twin Registry from 2018 to 2021. After preprocessing and cleaning the raw data, class imbalance (60:40 ratio) was addressed using the Synthetic Minority Over-sampling Technique (SMOTE) to enhance the prediction of twin zygosity based on physical similarity. Through this study, eight machine learning (ML) algorithms—including K-nearest neighbors (KNN), support vector machine (SVM), logistic regression (LR), random forest (RF), decision tree (DT), and boosting techniques (Gradient Boosting Classifier, XGBoost, AdaBoost Classifier)—were developed and trained on an 80:20 train-test split to predict zygosity. Model hyperparameters were optimized by GridSearchCV, and performance was evaluated using multiple metrics. Finally, the most influential factors were determined by the SHAP (Shapley Additive Explanations) algorithm. Results: Using data from 5,077 Iranian twin pairs, machine learning models to predict zygosity (monozygotic vs. dizygotic) based on 11 physical similarity traits were developed. To address class imbalance, the SMOTE oversampling technique was employed. Among eight algorithms evaluated, XGBoost demonstrated superior performance after optimization, achieving 85.57% accuracy and an F1-score of 86.72%. While XGBoost, Logistic Regression, and Gradient Boosting performed comparably without statistically significant differences, XGBoost was selected for its marginal lead and further analyzed using SHAP to interpret the feature importance of physical traits in zygosity prediction. Conclusion: This study establishes XGBoost with SMOTE oversampling as an optimal framework for twin zygosity prediction through physical similarity questionnaires. In future studies, with sufficient funding for genetic tests to determine zygosity, external validation of the obtained model and comparison of its results with the results of genetic tests will be performed. © The Author(s) 2025.