Citation
Abu Bakar, Azuraliza
(2002)
Propositional satisfiability method in rough classification modeling for data mining.
Doctoral thesis, Universiti Putra Malaysia.
Abstract
The fundamental problem in data mining is whether the whole information available is
always necessary to represent the information system (IS). The goal of data mining is to
find rules that model the world sufficiently well. These rules consist of conditions over
attributes value pairs called description and classification of decision attribute. However,
the set of all decision rules generated from all conditional attributes can be too large and
can contain many chaotic rules that are not appropriate for unseen object classification.
Therefore the search for the best rules must be performed because it is not possible to
determine the quality of all rules generated from the information systems. In rough set
approach to data mining, the set of interesting rules are determined using a notion of reduct. Rules were generated from reducts through binding the condition attribute values
of the object class from which the reduct is originated to the corresponding attribute. It is
important for the reducts to be minimum in size. The minimal reducts will decrease the
size of the conditional attributes used to generate rules. Smaller size of rules are
expected to classify new cases more properly because of the larger support in data and in
some sense the most stable and frequently appearing reducts gives the best decision
rules.
The main work of the thesis is the generation of classification model that contains
smaller number of rules, shorter length and good accuracy. The propositional
satisfiability method in rough classification model is proposed in this thesis. Two
models, Standard Integer Programming (SIP) and Decision Related Integer
Programming (DRIP) to represent the minimal reduct computation problem were
proposed. The models involved a theoretical formalism of the discemibility relation of a
decision system (DS) into an Integer Programming (IP) model. The proposed models
were embedded within the default rules generation framework and a new rough
classification method was obtained. An improved branch and bound strategy is proposed
to solve the SIP and DRIP models that pruned certain amount of search. The proposed
strategy used the conflict analysis procedure to remove the unnecessary attribute
assignments and determined the branch level for the search to backtrack in a nonchronological
manner.
Five data sets from VCI machine learning repositories and domain theories were
experimented. Total number rules generated for the best classification model is recorded where the 30% of data were used for training and 70% were kept as test data. The
classification accuracy, the number of rules and the maximum length of rules obtained
from the SIPIDRIP method was compared with other rough set method such as Genetic
Algorithm (GA), Johnson, Holte l R, Dynamic and Exhaustive method. Four of the
datasets were then chosen for further experiment. The improved search strategy
implemented the non-chronological backtracking search that potentially prunes the large
portion of search space. The experimental results showed that the proposed SIPIDRIP
method is a successful method in rough classification modeling. The outstanding feature
of this method is the reduced number of rules in all classification models. SIPIDRIP
generated shorter rules among other methods in most dataset. The proposed search
strategy indicated that the best performance can be achieved at the lower level or shorter
path of the tree search. SIPIDRIP method had also shown promising across other
commonly used classifiers such as neural network and statistical method. This model is
expected to be able to represent the knowledge of the system efficiently.
Download File
Additional Metadata
Actions (login required)
|
View Item |