Machine Learning Based Early Prediction of Diabetes Mellitus Using Clinical Data: A Comparative Study

Halima Sadia; Shahzadi Saba; Tony T. Williams; Aishath Mala; Azzah Khadim Hussain; Naima Ibrahim Joo

Authors

Halima Sadia Faculty Of Medicine,Universite Laval, Canada Author
Shahzadi Saba NMC Health Care UAE, Oxford Medical Centre, Abudahbi Author
Tony T. Williams Department of Health and Human Services Ashford University- UAGC Author
Aishath Mala Faculty of Health Sciences, Villa College Male, Maldives Author
Azzah Khadim Hussain University of Central Punjab Author
Naima Ibrahim Joo Department of Computer science (Artificial Intelligence), MSCS, National University of Technology, Islamabad Author

Keywords:

Diabetes Mellitus; Machine Learning; Early Prediction; Comparative Study; Gradient Boosting; Random Forest; Clinical Data; SMOTE; Feature Importance; AUC-ROC

Abstract

Background: Diabetes mellitus is one of the major public health problems of the twenty-first century, affecting more than 537 million adults worldwide, and projected to affect more than 780 million by 2045. Early and correct diagnosis of type 2 diabetes is essential for preventive intervention, patient stratification and optimization of healthcare resource. The challenges of creating predictive models from heterogeneous clinical data can be addressed by machine learning (ML).

Objective: The objective of this study is to systematically compare and evaluate 7 popular Machine Learning Classifiers namely Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), Gradient Boosting (GB), and Artificial Neural Network (ANN) for early prediction of Diabetes Mellitus with Structured Clinical/Demographic Data.

Methods: Pima Indians Diabetes Dataset (PIDD) and secondary data from a hospital with a patient count of 1523 were used. Preprocessing techniques were applied to deal with missing values, outlier detection, feature scaling, and class imbalance using the Synthetic Minority Over-sampling Technique (SMOTE). The measures used to evaluate models included accuracy, sensitivity, specificity, precision, F1 score, Matthew's correlation coefficient (MCC), and area under the receiver operating characteristic curve (AUC-ROC). Friedman and Wilcoxon signed-ranked tests were used to determine statistical significance.

Results: Gradient Boosting had the best classification performance in terms of accuracy (91.4%), AUC-ROC (0.958), F1 score (0.913) and MCC (0.819). Random Forest was the second best with an accuracy of 89.7% and AUC-ROC of 0.943. Logistic Regression was the most interpretable model and performed well with 79.6% accuracy. The plasma glucose concentration, body mass index (BMI), age and diabetes pedigree function were consistently identified as the most predictive variables by feature importance analysis.

Conclusions: Ensemble methods, especially Gradient Boosting and Random Forest algorithms, outperform the others in predictive performance for early diagnosis of diabetes and have a great clinical potential. The results highlight the significance of strict preprocessing and correction of class imbalance. Future studies should explore the possibility of federated learning approaches and real-time integration of clinical decision support.

Machine Learning Based Early Prediction of Diabetes Mellitus Using Clinical Data: A Comparative Study

Authors

Keywords:

Abstract

Downloads

Published

Issue

Section

How to Cite

Most read articles by the same author(s)

Testing Block

PJMCR