UPM Institutional Repository

Impact of feature set size on the performance of machine learning models in Cross-Project Defect Prediction (CPDP)


Citation

Bala, Yahaya Zakariyau and Abdul Samat, Pathiah and Sharif, Khaironi Yatim and Manshor, Noridayu (2025) Impact of feature set size on the performance of machine learning models in Cross-Project Defect Prediction (CPDP). International Journal on Advanced Science, Engineering and Information Technology, 15 (4). pp. 1353-1360. ISSN 2088-5334; eISSN: 2460-6952

Abstract

Software defect prediction is a vital area in software engineering that helps developers detect potential faults before software is deployed. Cross-Project Defect Prediction (CPDP) is particularly valuable, as it enables the use of defect data from one project to predict errors in another, making it beneficial in cases where project-specific defect data is insufficient. However, the effectiveness of CPDP largely depends on how well the machine learning models are trained, and a key factor influencing their performance is the size of the feature set used. This study focuses on evaluating the impact of feature set size on the performance of two widely used machine learning models, Random Forest (RF) and Support Vector Machine (SVM), in the context of CPDP. We used defect datasets from the AEEEM repository, which consists of multiple real-world software projects. An outlier detection technique was applied to select the number of features in the training and testing data, ensuring a systematic analysis of their impact on model performance. The F1-score was used as the primary evaluation metric, as it provides a balance between precision and recall, making it a reliable measure of defect prediction accuracy. Our findings suggest that the size of the feature set plays a crucial role in determining the effectiveness of both RF and SVM models. Too many features introduce noise, reducing predictive accuracy, while too few cause underfitting, leading to the missed detection of defect patterns. Identifying an optimal feature set size improves model performance, providing practical insights for enhancing CPDP. Optimizing feature selection can lead to more accurate predictions, thereby aiding software maintenance and enhancing overall software quality.


Download File

[img] Text
125328.pdf - Published Version
Available under License Creative Commons Attribution Share Alike.

Download (2MB)

Additional Metadata

Item Type: Article
Subject: Computer Science (all)
Subject: Agricultural and Biological Sciences (all)
Subject: Engineering (all)
Divisions: Faculty of Computer Science and Information Technology
DOI Number: https://doi.org/10.18517/ijaseit.15.4.20535
Publisher: Insight Society
Keywords: Cross-project; Defect prediction; Features selection; Software
Sustainable Development Goals (SDGs): SDG 9: Industry, Innovation and Infrastructure, SDG 16: Peace, Justice and Strong Institutions, SDG 8: Decent Work and Economic Growth
Depositing User: Ms. Nur Faseha Mohd Kadim
Date Deposited: 07 May 2026 03:13
Last Modified: 07 May 2026 03:13
Altmetrics: http://www.altmetric.com/details.php?domain=psasir.upm.edu.my&doi=10.18517/ijaseit.15.4.20535
URI: http://psasir.upm.edu.my/id/eprint/125328
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item