Machine Learning-Based Diabetes Classification Using Vital Signs and Clinical Information from the MIMIC-IV Dataset

Authors

  • Huy Huynh Saigon University, Viet Nam
  • Thanh Cao Saigon University, Viet Nam
  • Hai Tran Nguyen Tat Thanh University, Viet Nam

DOI:

https://doi.org/10.64539/sjcs.v2i2.2026.439

Keywords:

Diabetes Classification, MIMIC-IV, Natural Language Processing, Ensemble Learning, Clinical Decision Support, Machine Learning, Vital Signs

Abstract

Diagnosing diabetes based on clinical data is very important because the number of people with diabetes is growing around the world. The main focus of this study is on using machine learning models to figure out what kind of sickness someone has from a variety of clinical data. The MIMIC-IV dataset was used, which has both structured and unstructured data. The structured data includes vital signs, demographics, and lab tests. The unstructured data includes medical notes, major complaints, and a list of medications. Logistic Regression, Random Forest, Gradient Boosting, Support Vector Machine, and XGBoost were some of the models that were tested. Accuracy, Precision, Recall, F1-score, and AUC-ROC were used to measure how well the models worked. When random text data was added to the experiments, the results showed a big improvement in performance: the accuracy increased from approximately 68% to up to 87% across models. The best-performing models achieved AUC-ROC values above 0.95, with Random Forest and XGBoost showing the strongest performance. This shows that semantic mining from clinical notes is a key part of making medical decision support systems more reliable.

References

[1] L. Ismail, H. Materwala, and J. Al Kaabi, “Association of risk factors with type 2 diabetes: A systematic review,” Computational and Structural Biotechnology Journal, vol. 19, pp. 1759–1785, 2021. https://doi.org/10.1016/j.csbj.2021.03.003.

[2] M. Khalifa and M. Albadawy, “Artificial intelligence for diabetes: Enhancing prevention, diagnosis, and effective management,” Computer Methods and Programs in Biomedicine Up-date, vol. 5, Art. no. 100141, 2024. https://doi.org/10.1016/j.cmpbup.2024.100141.

[3] N. Orangi-Fard, “Prediction of COPD using machine learning, clinical summary notes, and vital signs,” arXiv preprint arXiv:2408.13958, 2024. https://doi.org/10.48550/arXiv.2408.13958.

[4] L. Fregoso-Aparicio, J. Noguez, L. Montesinos, and J. A. García-García, “Machine learning and deep learning predictive models for type 2 diabetes: A systematic review,” Diabetology & Metabolic Syndrome, vol. 13, no. 1, 2021. https://doi.org/10.1186/s13098-021-00767-9.

[5] J. A. Morgan-Benita et al., “Setting ranges in potential biomarkers for type 2 diabetes melli-tus patients early detection by sex—An approach with machine learning algorithms,” Diag-nostics, vol. 14, no. 15, Art. no. 1623, 2024. https://doi.org/10.3390/diagnostics14151623.

[6] A. Gudiño-Ochoa et al., “Enhanced diabetes detection and blood glucose prediction using TinyML-integrated e-nose and breath analysis: A novel approach combining synthetic and real-world data,” Bioengineering, vol. 11, no. 11, Art. no. 1065, 2024. https://doi.org/10.3390/bioengineering11111065.

[7] T. L. Hu, C. M. Chao, C. C. Wu, T. N. Chien, and C. Li, “Machine learning-based predictions of mortality and readmission in type 2 diabetes patients in the ICU,” Applied Sciences, vol. 14, no. 18, Art. no. 8443, 2024. https://doi.org/10.3390/app14188443.

[8] F. Rustam et al., “Enhanced detection of diabetes mellitus using novel ensemble feature engi-neering approach and machine learning model,” Scientific Reports, vol. 14, no. 1, Art. no. 23274, 2024. https://doi.org/10.1038/s41598-024-74357-w.

[9] P. S. Muller et al., “Improving diabetes diagnosis in instantaneous situations with MANET and data mining,” Journal of Environmental Protection and Ecology, vol. 25, no. 4, pp. 1330–1343, 2024. https://www.researchgate.net/publication/382500931.

[10] B. Alkalifah, M. T. Shaheen, J. Alotibi, T. Alsubait, and H. Alhakami, “Evaluation of ma-chine learning-based regression techniques for prediction of diabetes level fluctuations,” Heliyon, vol. 11, no. 1, 2025. https://doi.org/10.1016/j.heliyon.2024.e41199.

[11] P. Dinesh, A. S. Vickram, and P. Kalyanasundaram, “Medical image prediction for diagno-sis of breast cancer disease comparing the machine learning algorithms: SVM, KNN, lo-gistic regression, random forest and decision tree to measure accuracy,” in AIP Conference Proceedings, vol. 2853, no. 1, Art. no. 020140, May 2024. https://doi.org/10.1063/5.0203746.

[12] S. Das, S. P. Nayak, B. Sahoo, and S. C. Nayak, “Machine learning in healthcare analytics: A state-of-the-art review,” Archives of Computational Methods in Engineering, vol. 31, pp. 3923–3962, 2024. https://doi.org/10.1007/s11831-024-10098-3.

[13] S. Prasher and L. Nelson, “Early prediction of obesity risk in older adults using XGBoost classifier,” in Proc. 2024 7th Int. Conf. Circuit Power and Computing Technologies (ICCPCT), 2024, pp. 1599–1603. https://doi.org/10.1109/ICCPCT61902.2024.10673336.

[14] P. Jain, R. Gupta, A. Joshi, and A. Kuzmin, “Enhanced cardiovascular diagnostics using wearable ECG and bioimpedance monitoring with LightGBM classifier,” Biosensors and Bioelectronics: X, vol. 24, Art. no. 100617, 2025. https://doi.org/10.1016/j.biosx.2025.100617.

[15] A. Aldaej, T. A. Ahanger, and I. Ullah, “Deep neural network-based secure healthcare framework,” Neural Computing and Applications, vol. 36, no. 28, pp. 17467–17482, 2024. https://doi.org/10.1007/s00521-024-10039-y.

[16] N. Al Mudawi et al., “Innovative healthcare solutions: Robust hand gesture recognition of daily life routines using 1D CNN,” Frontiers in Bioengineering and Biotechnology, vol. 12, Art. no. 1401803, 2024. https://doi.org/10.3389/fbioe.2024.1401803.

[17] M. Carletti et al., “Multimodal AI correlates of glucose spikes in people with normal glucose regulation, pre-diabetes and type 2 diabetes,” Nature Medicine, vol. 31, no. 9, pp. 3121–3127, 2025. https://doi.org/10.1038/s41591-025-03849-7.

[18] A. Johnson et al., “MIMIC-IV (version 1.0),” PhysioNet, 2021. https://doi.org/10.13026/s6n6-xd98.

[19] A. Johnson et al., “MIMIC-IV-ED (version 2.2),” PhysioNet, 2023. https://doi.org/10.13026/5ntk-km72.

[20] S. Sathyanarayanan and B. R. Tantri, “Confusion matrix-based performance evaluation metrics,” African Journal of Biomedical Research, vol. 27, no. 4S, pp. 4023–4031, 2024. https://doi.org/10.53555/AJBR.v27i4S.4345.

[21] L. O. Joel, W. Doorsamy, and B. S. Paul, “On the performance of imputation techniques for missing values on healthcare datasets,” arXiv preprint arXiv:2403.14687, 2024. https://doi.org/10.48550/arXiv.2403.14687.

[22] Z. S. Priyambudi and Y. S. Nugroho, “Which algorithm is better? An implementation of normalization to predict student performance,” in AIP Conference Proceedings, vol. 2926, no. 1, Art. no. 020110, Jan. 2024. https://doi.org/10.1063/5.0182879.

[23] Y. Sun et al., “Modifying the one-hot encoding technique can enhance the adversarial robust-ness of visual models for symbol recognition,” Expert Systems with Applications, vol. 250, Art. no. 123751, 2024. https://doi.org/10.1016/j.eswa.2024.123751.

[24] Z. Labd, S. Bahassine, K. Housni, F. Z. A. H. Aadi, and K. Benabbes, “Text classification su-pervised algorithms with term frequency–inverse document frequency and global vectors for word representation: A comparative study,” International Journal of Electrical and Com-puter Engineering, vol. 14, no. 1, 2024. https://doi.org/10.11591/ijece.v14i1.pp589-599.

[25] N. K. Sahu et al., “Leveraging language models for summarizing mental state examinations: A comprehensive evaluation and dataset release,” in Proc. 31st Int. Conf. Computational Lin-guistics, 2025, pp. 2658–2682. https://aclanthology.org/2025.coling-main.182.pdf.

[26] J. Wu, H. Wang, C. Ni, C. Zhang, and W. Lu, “Data pipeline training: Integrating AutoML to optimize the data flow of machine learning models,” arXiv preprint arXiv:2402.12916, 2024. https://doi.org/10.48550/arXiv.2402.12916.

Downloads

Published

2026-03-29

How to Cite

Huynh, H., Cao, T., & Tran, H. (2026). Machine Learning-Based Diabetes Classification Using Vital Signs and Clinical Information from the MIMIC-IV Dataset. Scientific Journal of Computer Science, 2(2), 148–159. https://doi.org/10.64539/sjcs.v2i2.2026.439

Similar Articles

<< < 1 2 

You may also start an advanced similarity search for this article.