TY - GEN
T1 - Effective diagnosis of heart disease imposed by incomplete data based on fuzzy random forest
AU - Zeinulla, Elzhan
AU - Bekbayeva, Karina
AU - Yazici, Adnan
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/7
Y1 - 2020/7
N2 - This study presents data preprocessing and imputation techniques for creating a model from medical sensor data. We aim to solve the problem of creating a framework to diagnose heart diseases with an incomplete and dirty data, which is common with medical data. The medical dataset is often incomplete and dirty due to its small size, imbalance and many missing, false, inaccurate data. In this study, we utilize the synthetic minority oversampling technique with the combination of Tomek links to increase the size and eliminate the imbalance of the dataset. We performed a number of experiments and measurements on the Cleveland dataset and conducted a comparative study of various prediction models with recent algorithms in the literature. In order to process additional data from Budapest, Zurich and Basel, we apply the technique of semi-supervised pseudo-labelling, which means that the model has been trained on unlabeled data and combined with labelled data by predicting unlabeled values and making them pseudo-labelled. Then, the same algorithm that we used for Cleveland dataset was applied for the entire dataset. As the main classifier, Fuzzy Random Forest technique was implemented. The final accuracy of the approach proposed in this study is 93.4%, with the specificity and sensitivity values of 96.92% and 89.99%, respectively, which is superior to previous models included in the literature.
AB - This study presents data preprocessing and imputation techniques for creating a model from medical sensor data. We aim to solve the problem of creating a framework to diagnose heart diseases with an incomplete and dirty data, which is common with medical data. The medical dataset is often incomplete and dirty due to its small size, imbalance and many missing, false, inaccurate data. In this study, we utilize the synthetic minority oversampling technique with the combination of Tomek links to increase the size and eliminate the imbalance of the dataset. We performed a number of experiments and measurements on the Cleveland dataset and conducted a comparative study of various prediction models with recent algorithms in the literature. In order to process additional data from Budapest, Zurich and Basel, we apply the technique of semi-supervised pseudo-labelling, which means that the model has been trained on unlabeled data and combined with labelled data by predicting unlabeled values and making them pseudo-labelled. Then, the same algorithm that we used for Cleveland dataset was applied for the entire dataset. As the main classifier, Fuzzy Random Forest technique was implemented. The final accuracy of the approach proposed in this study is 93.4%, with the specificity and sensitivity values of 96.92% and 89.99%, respectively, which is superior to previous models included in the literature.
KW - Data Preparation
KW - Fuzzy Random Forest
KW - Heart Disease
KW - Multiple Imputation by Chained Equations (MICE)
KW - Pseudo-labelling
KW - Semi-Supervised Learning
KW - SMOTE
KW - Tomek
UR - http://www.scopus.com/inward/record.url?scp=85090497312&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85090497312&partnerID=8YFLogxK
U2 - 10.1109/FUZZ48607.2020.9177531
DO - 10.1109/FUZZ48607.2020.9177531
M3 - Conference contribution
AN - SCOPUS:85090497312
T3 - IEEE International Conference on Fuzzy Systems
BT - 2020 IEEE International Conference on Fuzzy Systems, FUZZ 2020 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 IEEE International Conference on Fuzzy Systems, FUZZ 2020
Y2 - 19 July 2020 through 24 July 2020
ER -