Citation
Abstract
Breast cancer continues to be a major public health issue worldwide, ranking as the second leading cause of cancer-related deaths among women. Effective early detection and classification are crucial for improving survival rates, yet they are complicated by the challenges posed by imbalanced datasets in microarray gene expression analysis. These imbalances can significantly affect the predictive power and reliability of traditional classification models, underscoring the need for more sophisticated analytical techniques. This study introduces an approach, the SMOTE-ENN-LR method, which combines the Synthetic Minority Over-sampling Technique (SMOTE) with Edited Nearest Neighbors (ENN) for noise removal and Logistic Regression (LR) to accurately classify breast cancer based on microarray data. The SMOTE technique is utilized to over-sample the minority cases in the dataset, thereby addressing the issue of underrepresentation. Simultaneously, the ENN method is employed to clean the data by removing mislabeled instances and noise, which are often prevalent in over-sampled datasets. The cleaned and stable dataset is used to train a LR model, optimizing its ability to discern between cancerous (Abnormal) and non-cancerous (Normal) gene expression profiles effectively. Our comprehensive evaluation shows that the SMOTE-ENN-LR method attained a remarkable classification accuracy of 97.14%, outperforming contemporary state-of-the-art methods. This significant enhancement in accuracy highlights the potential of combining advanced data preprocessing techniques with robust statistical learning models to tackle the inherent challenges of microarray data analysis. Further, we employ Local Interpretable Model-agnostic Explanations (LIME) and SHAP (SHapley Additive exPlanations) to offer an understandings into our model’s decision-making process, enhancing the predictions’ transparency and interpretability. Moreover, the success of the SMOTE-ENN-LR method in this study paves the way for its application in other areas of medical diagnostics where similar data imbalances may impact the accuracy and effectiveness of disease classification. These results substantiate the effectiveness of the SMOTE-ENN-LR approach in managing the complexities of imbalanced microarray gene expression data, proposing a promising path for upcoming research in medical bioinformatics and precision medicine.
Download File
Official URL or Download Paper: https://doiserbia.nb.rs/Article.aspx?ID=0350-13022...
|
Additional Metadata
| Item Type: | Article |
|---|---|
| Subject: | Mathematics (all) |
| Divisions: | Faculty of Computer Science and Information Technology Faculty of Medicine and Health Science |
| DOI Number: | https://doi.org/10.2298/PIM2532025S |
| Publisher: | Mathematical Institute of the Serbian Academy of Sciences and Arts |
| Keywords: | Breast cancer; Classification; Gene expression; Logistic regression; Machine learning |
| Sustainable Development Goals (SDGs): | SDG 3: Good Health and Well-being, SDG 9: Industry, Innovation and Infrastructure, SDG 10: Reduced Inequalities |
| Depositing User: | Ms. Siti Radziah Mohamed@mahmod |
| Date Deposited: | 29 Apr 2026 09:09 |
| Last Modified: | 29 Apr 2026 09:09 |
| Altmetrics: | http://www.altmetric.com/details.php?domain=psasir.upm.edu.my&doi=10.2298/PIM2532025S |
| URI: | http://psasir.upm.edu.my/id/eprint/125044 |
| Statistic Details: | View Download Statistic |
Actions (login required)
![]() |
View Item |
