UPM Institutional Repository

A model for enhancing pattern recognition in clinical narrative datasets through text-based feature selection and SHAP technique


Citation

Dalhatu, Sirajo Muhammad and Azmi Murad, Masrah Azrifah (2024) A model for enhancing pattern recognition in clinical narrative datasets through text-based feature selection and SHAP technique. International Journal on Informatics Visualization, 8 (4). pp. 2287-2296. ISSN 2549-9610; eISSN: 2549-9904

Abstract

Clinical narratives contain crucial patient information for predicting cardiac failure. Accurate and timely cardiac failure recognition (CFR) significantly impacts patient outcomes but faces challenges like limited dataset sizes, feature space sparsity, and underutilization of vital sign data. This study addresses these issues by developing a methodology to improve CFR accuracy and interpretability within clinical narratives. Four datasets—the Framingham Heart Study, Heart Disease from Kaggle, Cleveland Heart Disease, and Heart Failure Clinical Records—undergo preprocessing, including handling missing values, removing duplicates, scaling, encoding categorical variables, and transforming unstructured data using natural language processing (NLP). Various feature selection methods (Chi-Squared, Forward Selection, L1 Regularization) are used to identify influential features for CFR, and the SHapley Additive exPlanations (SHAP) technique is integrated to improve interpretability. Support Vector Machine (SVM), Logistic Regression (LR), and Random Forest (RF) models are trained and evaluated. Performance was evaluated using accuracy, precision, recall, f1-score, and area under the receiver operating characteristic curve (AUC-ROC). Results indicate that L1 Regularization with LR and Chi-Squared with RF perform best for specific datasets. The final model, combining all datasets with Forward Selection and RF, achieves high accuracy (91%), precision (87%), recall (97%), f1-score (91%), and AUC-ROC (94%). This study concludes that advanced text-based feature selection and SHAP interpretability significantly enhance CFR model accuracy and transparency, aiding clinical decision-making. Future research should incorporate more diverse datasets, explore advanced NLP techniques, and validate models in various clinical settings to enhance robustness and applicability.


Download File

[img] Text
118176.pdf - Published Version
Available under License Creative Commons Attribution Share Alike.

Download (3MB)

Additional Metadata

Item Type: Article
Divisions: Faculty of Computer Science and Information Technology
DOI Number: https://doi.org/10.62527/joiv.8.4.3664
Publisher: Politeknik Negeri Padang
Keywords: Cardiac failure recognition; Clinical narratives; Predictive modelling; Shapley additive explanations (SHAP)
Depositing User: Ms. Zaimah Saiful Yazan
Date Deposited: 26 Jun 2025 04:58
Last Modified: 26 Jun 2025 04:58
Altmetrics: http://www.altmetric.com/details.php?domain=psasir.upm.edu.my&doi=10.62527/joiv.8.4.3664
URI: http://psasir.upm.edu.my/id/eprint/118176
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item