SMOTE-ENN-LR: leveraging machine learning for breast cancer classification in microarray gene expression with explainable AI

Citation

Abdul Aziz, Md Faisal and Nazri, Azree and Evamoni, Fatematuz Zuhura and Yaakob, Razali and Mohd Aris, Teh Noranis and Sekawi, Zamberi and Mahmud, Tanjim and Agbolade, Olalekan and Syed, Wajid and Al Arifi, Mohamed N. (2025) SMOTE-ENN-LR: leveraging machine learning for breast cancer classification in microarray gene expression with explainable AI. Publications de l'Institut Mathematique, 118 (132). pp. 190-206. ISSN 0350-1302

Abstract

Breast cancer continues to be a major public health issue worldwide, ranking as the second leading cause of cancer-related deaths among women. Effective early detection and classification are crucial for improving survival rates, yet they are complicated by the challenges posed by imbalanced datasets in microarray gene expression analysis. These imbalances can significantly affect the predictive power and reliability of traditional classification models, underscoring the need for more sophisticated analytical techniques. This study introduces an approach, the SMOTE-ENN-LR method, which combines the Synthetic Minority Over-sampling Technique (SMOTE) with Edited Nearest Neighbors (ENN) for noise removal and Logistic Regression (LR) to accurately classify breast cancer based on microarray data. The SMOTE technique is utilized to over-sample the minority cases in the dataset, thereby addressing the issue of underrepresentation. Simultaneously, the ENN method is employed to clean the data by removing mislabeled instances and noise, which are often prevalent in over-sampled datasets. The cleaned and stable dataset is used to train a LR model, optimizing its ability to discern between cancerous (Abnormal) and non-cancerous (Normal) gene expression profiles effectively. Our comprehensive evaluation shows that the SMOTE-ENN-LR method attained a remarkable classification accuracy of 97.14%, outperforming contemporary state-of-the-art methods. This significant enhancement in accuracy highlights the potential of combining advanced data preprocessing techniques with robust statistical learning models to tackle the inherent challenges of microarray data analysis. Further, we employ Local Interpretable Model-agnostic Explanations (LIME) and SHAP (SHapley Additive exPlanations) to offer an understandings into our model’s decision-making process, enhancing the predictions’ transparency and interpretability. Moreover, the success of the SMOTE-ENN-LR method in this study paves the way for its application in other areas of medical diagnostics where similar data imbalances may impact the accuracy and effectiveness of disease classification. These results substantiate the effectiveness of the SMOTE-ENN-LR approach in managing the complexities of imbalanced microarray gene expression data, proposing a promising path for upcoming research in medical bioinformatics and precision medicine.

Download File

Text
125044.pdf - Published Version
Restricted to Repository staff only
Download (1MB)

Official URL or Download Paper: https://doiserbia.nb.rs/Article.aspx?ID=0350-13022...

Additional Metadata

Item Type:	Article
Subject:	Mathematics (all)
Divisions:	Faculty of Computer Science and Information Technology Faculty of Medicine and Health Science
DOI Number:	https://doi.org/10.2298/PIM2532025S
Publisher:	Mathematical Institute of the Serbian Academy of Sciences and Arts
Keywords:	Breast cancer; Classification; Gene expression; Logistic regression; Machine learning
Sustainable Development Goals (SDGs):	SDG 3: Good Health and Well-being, SDG 9: Industry, Innovation and Infrastructure, SDG 10: Reduced Inequalities
Depositing User:	Ms. Siti Radziah Mohamed@mahmod
Date Deposited:	29 Apr 2026 09:09
Last Modified:	29 Apr 2026 09:09
Altmetrics:	http://www.altmetric.com/details.php?domain=psasir.upm.edu.my&doi=10.2298/PIM2532025S
URI:	http://psasir.upm.edu.my/id/eprint/125044
Statistic Details:	View Download Statistic

Actions (login required)

View Item