TY - JOUR
T1 - OUCH
T2 - Oversampling and Undersampling Cannot Help Improve Accuracy in Our Bayesian Classifiers That Predict Preeclampsia
AU - Parrales-Bravo, Franklin
AU - Caicedo-Quiroz, Rosangela
AU - Tolozano-Benitez, Elena
AU - Gómez-Rodríguez, Víctor
AU - Cevallos-Torres, Lorenzo
AU - Charco-Aguirre, Jorge
AU - Vasquez-Cevallos, Leonel
N1 - Publisher Copyright:
© 2024 by the authors.
PY - 2024/11
Y1 - 2024/11
N2 - Unbalanced data can have an impact on the machine learning (ML) algorithms that build predictive models. This manuscript studies the influence of oversampling and undersampling strategies on the learning of the Bayesian classification models that predict the risk of suffering preeclampsia. Given the properties of our dataset, only the oversampling and undersampling methods that operate with numerical and categorical attributes will be taken into consideration. In particular, synthetic minority oversampling techniques for nominal and continuous data (SMOTE-NC), SMOTE—Encoded Nominal and Continuous (SMOTE-ENC), random oversampling examples (ROSE), random undersampling examples (UNDER), and random oversampling techniques (OVER) are considered. According to the results, when balancing the class in the training dataset, the accuracy percentages do not improve. However, in the test dataset, both positive and negative cases of preeclampsia were accurately classified by the models, which were built on a balanced training dataset. In contrast, models built on the imbalanced training dataset were not good at detecting positive cases of preeclampsia. We can conclude that while imbalanced training datasets can be addressed by using oversampling and undersampling techniques before building prediction models, an improvement in model accuracy is not always guaranteed. Despite this, the sensitivity and specificity percentages improve in binary classification problems in most cases, such as the one we are dealing with in this manuscript.
AB - Unbalanced data can have an impact on the machine learning (ML) algorithms that build predictive models. This manuscript studies the influence of oversampling and undersampling strategies on the learning of the Bayesian classification models that predict the risk of suffering preeclampsia. Given the properties of our dataset, only the oversampling and undersampling methods that operate with numerical and categorical attributes will be taken into consideration. In particular, synthetic minority oversampling techniques for nominal and continuous data (SMOTE-NC), SMOTE—Encoded Nominal and Continuous (SMOTE-ENC), random oversampling examples (ROSE), random undersampling examples (UNDER), and random oversampling techniques (OVER) are considered. According to the results, when balancing the class in the training dataset, the accuracy percentages do not improve. However, in the test dataset, both positive and negative cases of preeclampsia were accurately classified by the models, which were built on a balanced training dataset. In contrast, models built on the imbalanced training dataset were not good at detecting positive cases of preeclampsia. We can conclude that while imbalanced training datasets can be addressed by using oversampling and undersampling techniques before building prediction models, an improvement in model accuracy is not always guaranteed. Despite this, the sensitivity and specificity percentages improve in binary classification problems in most cases, such as the one we are dealing with in this manuscript.
KW - ROSE
KW - SMOTE-ENC
KW - SMOTE-NC
KW - bayesian network classifiers
KW - class imbalance
KW - oversampling
KW - preeclampsia
KW - undersampling
UR - https://www.scopus.com/pages/publications/85208442966
U2 - 10.3390/math12213351
DO - 10.3390/math12213351
M3 - Artículo
AN - SCOPUS:85208442966
SN - 2227-7390
VL - 12
JO - Mathematics
JF - Mathematics
IS - 21
M1 - 3351
ER -