Frequent Lexicographic Algorithm for Mining Association Rules

Citation

Mustapha, Norwati (2005) Frequent Lexicographic Algorithm for Mining Association Rules. Doctoral thesis, Universiti Putra Malaysia.

Abstract

The recent progress in computer storage technology have enable many organisations to collect and store a huge amount of data which is lead to growing demand for new techniques that can intelligently transform massive data into useful information and knowledge. The concept of data mining has brought the attention of business community in finding techniques that can extract nontrivial, implicit, previously unknown and potentially useful information from databases. Association rule mining is one of the data mining techniques which discovers strong association or correlation relationships among data. The primary concept of association rule algorithms consist of two phase procedure. In the first phase, all frequent patterns are found and the second phase uses these frequent patterns in order to generate all strong rules. The common precision measures used to complete these phases are support and confidence. Having been investigated intensively during the past few years, it has been shown that the first phase involves a major computational task. Although the second phase seems to be more straightforward, it can be costly because the size of the generated rules are normally large and in contrast only a small fraction of these rules are typically useful and important. As response to these challenges, this study is devoted towards finding faster methods for searching frequent patterns and discovery of association rules in concise form. An algorithm called Flex (Frequent lexicographic patterns) has been proposed in obtaining a good performance of searching li-equent patterns. The algorithm involved the construction of the nodes of a lexicographic tree that represent frequent patterns. Depth first strategy and vertical counting strategy are used in mining frequent patterns and computing the support of the patterns respectively. The mined frequent patterns are then used in generating association rules. Three models were applied in this task which consist of traditional model, constraint model and representative model which produce three kinds of rules respectively; all association rules, association rules with 1-consequence and representative rules. As an additional utility in the representative model, this study proposed a set-theoretical intersection to assist users in finding duplicated rules. Four datasets from UCI machine learning repositories and domain theories except the pumsb dataset were experimented. The Flex algorithm and the other two existing algorithms Apriori and DIC under the same specification are tested toward these datasets and their extraction times for mining frequent patterns were recorded and compared. The experimental results showed that the proposed algorithm outperformed both existing algorithms especially for the case of long patterns. It also gave promising results in the case of short patterns. Two of the datasets were then chosen for further experiment on the scalability of the algorithms by increasing their size of transactions up to six times. The scale-up experiment showed that the proposed algorithm is more scalable than the other existing algorithms. The implementation of an adopted theory of representative model proved that this model is more concise than the other two models. It is shown by number of rules generated from the chosen models. Besides a small set of rules obtained, the representative model also having the lossless information and soundness properties meaning that it covers all interesting association rules and forbid derivation of weak rules. It is theoretically proven that the proposed set-theoretical intersection is able to assist users in knowing the duplication rules exist in representative model.

Download File

Text
FSKTM_2005_9 IR.pdf
Download (1MB)

Additional Metadata

Item Type:	Thesis (Doctoral)
Subject:	Data mining
Subject:	Lexicography - Data processing
Call Number:	FSKTM 2005 9
Chairman Supervisor:	Associate Professor Md. Nasir Sulaiman, PhD
Divisions:	Faculty of Computer Science and Information Technology
Depositing User:	Nur Izyan Mohd Zaki
Date Deposited:	05 May 2010 08:28
Last Modified:	06 Jan 2022 07:21
URI:	http://psasir.upm.edu.my/id/eprint/5857
Statistic Details:	View Download Statistic

Actions (login required)

View Item