UPM Institutional Repository

K-gen phishguard: an ensemble approach for phishing detection with k-means and genetic algorithm


Citation

Al-Hafiz, Ali Raheem and Jabir, Adnan J. and Subramaniam, Shamala (2025) K-gen phishguard: an ensemble approach for phishing detection with k-means and genetic algorithm. Al-Khwarizmi Engineering Journal, 21 (2). pp. 117-135. ISSN 1818-1171; eISSN: 2312-0789

Abstract

Phishing detection is considered a critical problem in cybersecurity, and utilising machine learning with an efficient feature selection method for precisely identifying malicious websites is deemed the most critical challenge. This research presents a two-phase phishing detection system by employing unsupervised feature selection and supervised classification. In the first phase, the best set of features is identified by the Genetic algorithm and is utilised by the K-means clustering algorithm to divide the dataset into groups with similar traits. In the second phase, the best set of features in each group is identified through the Genetic algorithm to enhance the classification process. Finally, a voting ensemble technique is applied, in which the Support Vector Machine (SVM), Random Forest (RF), Extreme Gradient Boosting (XGBoost) and Adaptive boosting (AdaBoost) models are combined. Predictions are aggregated using a soft voting mechanism. This research utilises the web page phishing detection dataset, which consists of 11,430 URLs with 87 features. From the results, an accuracy of 99% is achieved using the voting ensemble technique with feature selection compared with 77.3% without feature selection. The model performance experiences a significant boost through the GA-optimised feature selection by reducing computational complexity and improving key metrics such as accuracy, precision and F1-score. Additionally, the performance across four clusters demonstrates the positive impact of K-Means clustering in improving classification accuracy for specific data groups. As proven by the obtained results, integrating feature selection with ensemble learning is effective for phishing detection; moreover, the scalability and efficiency of such a solution in real-world applications are demonstrated.


Download File

[img] Text
121042.pdf - Published Version

Download (824kB)

Additional Metadata

Item Type: Article
Divisions: Faculty of Computer Science and Information Technology
DOI Number: https://doi.org/10.22153/kej.2025.04.011
Publisher: University of Baghdad
Keywords: Adaboost; Ensemble learning; Feature selection; Genetic algorithm; K-means clustering; Machine learning; Phishing detection
Depositing User: Ms. Nuraida Ibrahim
Date Deposited: 23 Oct 2025 00:42
Last Modified: 23 Oct 2025 00:42
Altmetrics: http://www.altmetric.com/details.php?domain=psasir.upm.edu.my&doi=10.22153/kej.2025.04.011
URI: http://psasir.upm.edu.my/id/eprint/121042
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item