Abstract: Diabetes, being a chronic disease, has the potential to result in various severe health complications, such as stroke, heart attack, chronic kidney diseases, and other associated ailments. The main objective of this research was to develop the ensemble ML-based learners for predicting diabetes. Logistic regression (LR), ML-based classifiers, decision tree (DT), random forest (RF), bagging classifier (BC), boosting classifier (AdaBoost), gradient boosting decision tree (GBDT), and supporting vector machine (SVM) were utilized for detecting the diabetic cases and determination of most important attributes related to Type 2 diabetes mellitus (T2DM). The performance of the methods was evaluated by 5-fold validation method through reporting accuracy, sensitivity, precision, the area under curve (AUC), and other indices. The Shahedieh cohort dataset in Yazd province including 9398 participants, conducted from 2014 to 2016, got used in this study. The dataset consisted of 1697 diabetic and 7701 non-diabetic cases, and 13 features, and used synthetic minority oversampling technique (SOMT) to remove the imbalance in dataset. Among ML methods, LR had the highest performance in original data and AdaBoost achieved the highest accuracy and AUC values of 86.2% and 94%, respectively. Based on AdaBoost model, age and years), family history of diabetes, triglycerides and NonHDL were the most important features, respectively. The study indicated that AdaBoost classifier was the best ML-based learning for diabetes prediction, and family history of diabetes.
|
Keywords and phrases: machine learning, AdaBoost, GBDT, logistic regression, DT, random forest.
Received: July 5, 2024; Revised: August 19, 2024; Accepted: September 18, 2024; Published: December 10, 2024
How to cite this article: Elham Khaledi, Farnoosh Ghomi, Azam Ghanei, Abass Meidany and Farimah Shamsi, Machine learning ensemble classifiers to predict Type 2 diabetes in a cohort study, JP Journal of Biostatistics 25(1) (2025), 79-93. https://doi.org/10.17654/0973514325004
This Open Access Article is Licensed under Creative Commons Attribution 4.0 International License
References: [1] N. H. Cho, J. E. Shaw, S. Karuranga, Y. Huang, J. D. da Rocha Fernandes, A. W. Ohlrogge and B. Malanda, IDF diabetes atlas: global estimates of diabetes prevalence for 2017 and projections for 2045, Diabetes Research and Clinical Practice 138 (2018), 271-281. [2] F. Moradpour, S. Rezaei, B. Piroozi, G. Moradi, Y. Moradi, N. Piri and A. Shokri, Prevalence of prediabetes, diabetes, diabetes awareness, treatment, and its socioeconomic inequality in west of Iran, Scientific Reports 12(1) (2022), 17892. [3] P. Saeedi et al., Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: Results from the International Diabetes Federation Diabetes Atlas, Diabetes Research and Clinical Practice 157 (2019), 107843. [4] M. Mirzaei, M. Rahmaninan, M. Mirzaei, A. Nadjarzadeh and A. A. Dehghani Tafti, Epidemiology of diabetes mellitus, pre-diabetes, undiagnosed and uncontrolled diabetes in Central Iran: Results from Yazd health study, BMC Public Health 20(1) (2020), 166. [5] A. Dehghani, H. Korozhdehi, S. Hossein Khalilzadeh, H. Fallahzadeh and V. Rahmanian, Prevalence of diabetes and its correlates among Iranian adults: Results of the first phase of Shahedieh cohort study, Health Science Reports 6(4) (2023), e1170. [6] G. L. Bakris et al., Effect of finerenone on chronic kidney disease outcomes in type 2 diabetes, New England Journal of Medicine 383(23) (2020), 2219-2229. [7] Q. Zou, K. Qu, Y. Luo, D. Yin, Y. Ju and H. Tang, Predicting diabetes mellitus with machine learning techniques, Frontiers in Genetics 9 (2018), 515. [8] H. M. Deberneh and I. Kim, Prediction of type 2 diabetes based on machine learning algorithm, International Journal of Environmental Research and Public Health 18(6) (2021), 3317. [9] M. Maniruzzaman, M. J. Rahman, B. Ahammed and M. M. Abedin, Classification and prediction of diabetes disease using machine learning paradigm, Health Information Science and Systems 8 (2020), 1-14. [10] T. N. Poly, M. M. Islam, Y.-CJ. Li, Early diabetes prediction: A comparative study using machine learning techniques, Advances in Informatics, Management and Technology in Healthcare: IOS Press; 2022, pp. 409-413. [11] X. Tao et al., Predicting three-month fasting blood glucose and glycated hemoglobin changes in patients with type 2 diabetes mellitus based on multiple machine learning algorithms, Scientific Reports 13(1) (2023), 16437. [12] M. Sahebhonar, M. G. Dehaki, M. H. Kazemi-Galougahi and S. Soleiman-Meigooni, A comparison of three research methods: Logistic regression, decision tree, and random forest to reveal association of type 2 diabetes with risk factors and classify subjects in a military population, Journal of Archives in Military Medicine 10(2) (2022), e118525. [13] H. Seto et al., Gradient boosting decision tree becomes more reliable than logistic regression in predicting probability for diabetes with big data, Scientific Reports 12(1) (2022), 15889. [14] F. Vazirian et al., Non-HDL cholesterol and long-term follow-up outcomes in patients with metabolic syndrome, Lipids in Health and Disease 22(1) (2023), 165. [15] R. D. Joshi and C. K. Dhakal, Predicting type 2 diabetes using logistic regression and machine learning approaches, International Journal of Environmental Research and Public Health 18(14) (2021), 7346. [16] C. Kern, T. Klausch and F. Kreuter, editors, Tree-based machine learning methods for survey research, Survey Research Methods; 2019: NIH Public Access. [17] L. Breiman, Random forests, Machine Learning 45 (2001), 5-32. [18] B. Gaye, D. Zhang and A. Wulamu, Improvement of support vector machine algorithm in big data background, Mathematical Problems in Engineering 2021(1) (2021), 5594899. [19] L. Breiman, Bagging predictors, Machine Learning 24 (1996), 123-40. [20] G. Tüysüzoğlu and D. Birant, Enhanced bagging (eBagging): A novel approach for ensemble learning, International Arab Journal of Information Technology 17(4) (2020), 515-528. [21] I. D. Mienye and Y. Sun, A survey of ensemble learning: Concepts, algorithms, applications, and prospects, IEEE Access 10 (2022), 99129-99149. [22] J. H. Friedman, Greedy function approximation: a gradient boosting machine, Annals of Statistics 29 (2001), 1189-1232. [23] Á. Delgado-Panadero, B. Hernández-Lorca, M. T. García-Ordás and J. A. Benítez-Andrades, Implementing local-explainability in gradient boosting trees: Feature contribution, Information Sciences 589 (2022), 199-212. [24] M. K. Hasan, M. A. Alam, D. Das, E. Hossain and M. Hasan, Diabetes prediction using ensembling of different machine learning classifiers, IEEE Access 8 (2020), 76516-76531.
|