Twofold Integer Programming Model for Improving Rough Set Classification Accuracy in Data Mining.
Saeed, Walid (2005) Twofold Integer Programming Model for Improving Rough Set Classification Accuracy in Data Mining. PhD thesis, Universiti Putra Malaysia.
The fast growing size of databases has resulted in a great demand for tools capable of analyzing data with the aim of discovering new knowledge and patterns. These tools will hopefully close the gap between the steady growth of information and the escalating demand to understand and discover the value of such knowledge. These tools are known as Data Mining (DM). One aims of DM is to discover decision rules for extracting meaningful knowledge. These rules consist of conditions over attribute value pairs called the descriptions, and decision attributes. Therefore generating a good decision model or classification model is a major component in many data mining researches. The classification approach basically produces a function that maps data item into one of several predefined classes, by way of inputting training dataset and building a model of the class attribute based on the rest of the attributes.This research undertakes three main tasks. The first task is to introduce a new rough model for minimum reduct selection and default rules generation, which is known as a Twofold Integer Programming (TIP). The second task is to enhance rules accuracy based on the first task, while the third task is to classify new objects or cases. The TIP model is based on translation of the discernibility relation of a Decision System (DS) into an Integer Programming (IP) model, resolved by using the branch and bound search method in order to generate the full reduct of the DS. The TIP model is then applied to the reduct to generate the default rules, which in turn are used to classify unseen objects with a satisfying accuracy. Apart from introducing the TIP model, this research also addressed the issues of missing values, discretization and extracting minimum rules. The treatment of missing values and discretization are being carried out during the preprocessing stage. The extraction of minimum rules operation is conducted after the default rules have been generated in order to obtain the most useful discovered rules. Eight datasets from machine learning repositories and domain theories are tested by the TIP model. Total rules number, rules length and rules accuracy for the generation rules are recorded. The accuracy for rules and classification resulted from the TIP method are compared with other methods such as Standard Integer Programming (SIP) and Decision Related Integer Programming (DRIP) from Rough Set, Genetic Algorithm (GA), Johnson reducer, HoltelR method, Multiple Regression (MR), Neural Network (NN), Induction of Decision Tree Algorithm (ID3) and Base Learning Algorithm (C4.5); all other classifiers that are mostly used in the classification tasks. Based on the experiment results, the classification method using the TIP approach has successfully performed rules generation and classification tasks as required during a classification operation. The outcome of a considerably good accuracy is mainly due to the right selection of relevant attributes. This research has proven that the TIP method has shown the ability to cater for different kinds of datasets and obtained a good rough classification model with promising results as compared with other commonly used classifiers. This research opens a wide range of future work to be considered, which includes applying the proposed method in other areas such as web mining, text mining or multimedia mining; and extending the proposed approach to work in parallel computing in data mining.
Repository Staff Only: Edit item detail