Machine Learning Based Early Prediction of Diabetes Mellitus Using Clinical Data: A Comparative Study

Authors

  • Halima Sadia Faculty Of Medicine,Universite Laval, Canada Author
  • Shahzadi Saba NMC Health Care UAE, Oxford Medical Centre, Abudahbi Author
  • Tony T. Williams Department of Health and Human Services Ashford University- UAGC Author
  • Aishath Mala Faculty of Health Sciences, Villa College Male, Maldives Author
  • Azzah Khadim Hussain University of Central Punjab Author
  • Naima Ibrahim Joo Department of Computer science (Artificial Intelligence), MSCS, National University of Technology, Islamabad Author

Keywords:

Diabetes Mellitus; Machine Learning; Early Prediction; Comparative Study; Gradient Boosting; Random Forest; Clinical Data; SMOTE; Feature Importance; AUC-ROC

Abstract

Background: Diabetes mellitus is one of the major public health problems of the twenty-first century, affecting more than 537 million adults worldwide, and projected to affect more than 780 million by 2045. Early and correct diagnosis of type 2 diabetes is essential for preventive intervention, patient stratification and optimization of healthcare resource. The challenges of creating predictive models from heterogeneous clinical data can be addressed by machine learning (ML).

Objective: The objective of this study is to systematically compare and evaluate 7 popular Machine Learning Classifiers namely Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), Gradient Boosting (GB), and Artificial Neural Network (ANN) for early prediction of Diabetes Mellitus with Structured Clinical/Demographic Data.

Methods: Pima Indians Diabetes Dataset (PIDD) and secondary data from a hospital with a patient count of 1523 were used. Preprocessing techniques were applied to deal with missing values, outlier detection, feature scaling, and class imbalance using the Synthetic Minority Over-sampling Technique (SMOTE). The measures used to evaluate models included accuracy, sensitivity, specificity, precision, F1 score, Matthew's correlation coefficient (MCC), and area under the receiver operating characteristic curve (AUC-ROC). Friedman and Wilcoxon signed-ranked tests were used to determine statistical significance.

Results: Gradient Boosting had the best classification performance in terms of accuracy (91.4%), AUC-ROC (0.958), F1 score (0.913) and MCC (0.819). Random Forest was the second best with an accuracy of 89.7% and AUC-ROC of 0.943. Logistic Regression was the most interpretable model and performed well with 79.6% accuracy. The plasma glucose concentration, body mass index (BMI), age and diabetes pedigree function were consistently identified as the most predictive variables by feature importance analysis.

Conclusions: Ensemble methods, especially Gradient Boosting and Random Forest algorithms, outperform the others in predictive performance for early diagnosis of diabetes and have a great clinical potential. The results highlight the significance of strict preprocessing and correction of class imbalance. Future studies should explore the possibility of federated learning approaches and real-time integration of clinical decision support.

Downloads

Published

2026-06-30

How to Cite

Machine Learning Based Early Prediction of Diabetes Mellitus Using Clinical Data: A Comparative Study. (2026). Pakistan Journal of Medical & Cardiological Review, 5(2), 4780-4793. https://pakjmcr.com/index.php/1/article/view/1405