UPM Institutional Repository

An empirical study of pattern leakage impact during data preprocessing on machine learning-based intrusion detection models reliability


Citation

Bouke, Mohamed Aly and Abdullah, Azizol (2023) An empirical study of pattern leakage impact during data preprocessing on machine learning-based intrusion detection models reliability. Expert Systems with Applications, 230. pp. 1-9. ISSN 0957-4174; ESSN: 1873-6793

Abstract

In this paper, we investigate the impact of pattern leakage during data preprocessing on the reliability of Machine Learning (ML) based intrusion detection systems (IDS). Data leakage, also known as pattern leakage, occurs during data preprocessing when information from the testing set is used in training, leading to overfitting and inflated accuracy scores. Our study uses three well-known intrusion detection datasets: NSL-KDD, UNSW-NB15, and KDDCUP99. We preprocess the data to create versions with and without pattern leakage and train and test six ML models: Decision Tree (DT), Gradient Boosting (GB), K-neighbours (KNN), Support Vector Machine (SVM), Random Forest (RF), Logistic Regression (LR). Our results show that building IDS models with data leakage leads to higher accuracy but is unreliable. Additionally, we find that some algorithms are more sensitive to data leakage than others, as seen by the drop in model accuracy when built without leakage. To address this problem, we provide suggestions for mitigating data leakage in the training process and analyzing the sensitivity of different algorithms. Overall, our study emphasizes the importance of addressing data leakage in the training process to ensure the reliability of ML-based IDS models.


Download File

Full text not available from this repository.

Additional Metadata

Item Type: Article
Divisions: Faculty of Computer Science and Information Technology
DOI Number: https://doi.org/10.1016/j.eswa.2023.120715
Publisher: Elsevier B.V.
Keywords: Machine learning; Intrusion detection; Data leakage; Model performance; Data preprocessing; Industry; Innovation and infrastructure
Depositing User: Ms. Che Wa Zakaria
Date Deposited: 03 Oct 2024 04:25
Last Modified: 03 Oct 2024 04:25
Altmetrics: http://www.altmetric.com/details.php?domain=psasir.upm.edu.my&doi=10.1016/j.eswa.2023.120715
URI: http://psasir.upm.edu.my/id/eprint/106552
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item