SMOTE-ENN-LR: LEVERAGING MACHINE LEARNING FOR BREAST CANCER CLASSIFICATION IN MICROARRAY GENE EXPRESSION WITH EXPLAINABLE AI

Authors

  • Md Faisal Bin Abdul Aziz Department of Computer Science, Universiti Putra Malaysia, Serdang, Malaysia
  • Azree Nazri Department of Computer Science, Universiti Putra Malaysia, Serdang, Malaysia
  • Fatematuz Zuhura Evamoni Dept. of Biotechnology and Genetic Engineering, Noakhali Science and Technology University, Bangladesh
  • Razali Yaakob Department of Computer Science, Universiti Putra Malaysia, Serdang, Malaysia
  • Teh Noranis Mohd Aris Department of Computer Science, Universiti Putra Malaysia, Serdang, Malaysia
  • Zamberi Sekawi Department of Medical Microbiology, Universiti Putra Malaysia, Serdang, Malaysia
  • Tanjim Mahmud Department of Computer Science and Engineering, Rangamati Science and Technology University, Bangladesh
  • Olalekan Agbolade Department of Computer Science, Universiti Putra Malaysia, Serdang, Malaysia
  • Wajid Syed Department of Clinical Pharmacy, College of Pharmacy, King Saud University, Saudi Arabia
  • Mohamed N Al Arifi Department of Clinical Pharmacy, College of Pharmacy, King Saud University, Saudi Arabia

DOI:

https://doi.org/10.22452/mjcs.vol38no2.4

Keywords:

Breast cancer, Gene expression, Machine learning, Logistic Regression, Classification, Explainable AI

Abstract

Breast cancer continues to be a major public health issue worldwide, ranking as the second leading cause of cancer-related deaths among women. Effective early detection and classification are crucial for improving survival rates, yet they are complicated by the challenges posed by imbalanced datasets in microarray gene expression analysis. These imbalances can significantly affect the predictive power and reliability of traditional classification models, underscoring the need for more sophisticated analytical techniques. This study introduces an approach, the SMOTE-ENN-LR method, which combines the Synthetic Minority Over-sampling Technique (SMOTE) with Edited Nearest Neighbors (ENN) for noise removal and Logistic Regression (LR) to accurately classify breast cancer based on microarray data. The SMOTE technique is utilized to over-sample the minority cases in the dataset, thereby addressing the issue of underrepresentation. Simultaneously, the ENN method is employed to clean the data by removing mislabeled instances and noise, which are often prevalent in over-sampled datasets. The cleaned and stable dataset is used to train a LR model, optimizing its ability to discern between cancerous (Abnormal) and non-cancerous (Normal) gene expression profiles effectively. Our comprehensive evaluation shows that the SMOTE-ENN-LR method attained a remarkable classification accuracy of 97.14%, outperforming contemporary state-of-the-art methods. This significant enhancement in accuracy highlights the potential of combining advanced data preprocessing techniques with robust statistical learning models to tackle the inherent challenges of microarray data analysis. Further, we employ Local Interpretable Model-agnostic Explanations (LIME) and SHAP (SHapley Additive exPlanations) to offer an understandings into our model’s decision-making process, enhancing the predictions’ transparency and interpretability. Moreover, the success of the SMOTE-ENN-LR method in this study paves the way for its application in other areas of medical diagnostics where similar data imbalances may impact the accuracy and effectiveness of disease classification. These results substantiate the effectiveness of the SMOTE-ENN-LR approach in managing the complexities of imbalanced microarray gene expression data, proposing a promising path for upcoming research in medical bioinformatics and precision medicine.

Downloads

Download data is not yet available.

Downloads

Published

2025-06-30

How to Cite

Abdul Aziz, M. F. B. ., Nazri, A. ., Evamoni, F. Z. ., Yaakob, R. ., Aris, T. N. M. ., Sekawi, Z. ., Mahmud, T. ., Agbolade, O. ., Syed, W. ., & Arifi, M. N. A. . (2025). SMOTE-ENN-LR: LEVERAGING MACHINE LEARNING FOR BREAST CANCER CLASSIFICATION IN MICROARRAY GENE EXPRESSION WITH EXPLAINABLE AI. Malaysian Journal of Computer Science, 38(2), 190–206. https://doi.org/10.22452/mjcs.vol38no2.4

Most read articles by the same author(s)