Frequent Lexicographic Algorithm for Mining Association Rules
Mustapha, Norwati (2005) Frequent Lexicographic Algorithm for Mining Association Rules. PhD thesis, Universiti Putra Malaysia.
The recent progress in computer storage technology have enable many organisations to collect and store a huge amount of data which is lead to growing demand for new techniques that can intelligently transform massive data into useful information and knowledge. The concept of data mining has brought the attention of business community in finding techniques that can extract nontrivial, implicit, previously unknown and potentially useful information from databases. Association rule mining is one of the data mining techniques which discovers strong association or correlation relationships among data. The primary concept of association rule algorithms consist of two phase procedure. In the first phase, all frequent patterns are found and the second phase uses these frequent patterns in order to generate all strong rules. The common precision measures used to complete these phases are support and confidence. Having been investigated intensively during the past few years, it has been shown that the first phase involves a major computational task. Although the second phase seems to be more straightforward, it can be costly because the size of the generated rules are normally large and in contrast only a small fraction of these rules are typically useful and important. As response to these challenges, this study is devoted towards finding faster methods for searching frequent patterns and discovery of association rules in concise form. An algorithm called Flex (Frequent lexicographic patterns) has been proposed in obtaining a good performance of searching li-equent patterns. The algorithm involved the construction of the nodes of a lexicographic tree that represent frequent patterns. Depth first strategy and vertical counting strategy are used in mining frequent patterns and computing the support of the patterns respectively. The mined frequent patterns are then used in generating association rules. Three models were applied in this task which consist of traditional model, constraint model and representative model which produce three kinds of rules respectively; all association rules, association rules with 1-consequence and representative rules. As an additional utility in the representative model, this study proposed a set-theoretical intersection to assist users in finding duplicated rules. Four datasets from UCI machine learning repositories and domain theories except the pumsb dataset were experimented. The Flex algorithm and the other two existing algorithms Apriori and DIC under the same specification are tested toward these datasets and their extraction times for mining frequent patterns were recorded and compared. The experimental results showed that the proposed algorithm outperformed both existing algorithms especially for the case of long patterns. It also gave promising results in the case of short patterns. Two of the datasets were then chosen for further experiment on the scalability of the algorithms by increasing their size of transactions up to six times. The scale-up experiment showed that the proposed algorithm is more scalable than the other existing algorithms. The implementation of an adopted theory of representative model proved that this model is more concise than the other two models. It is shown by number of rules generated from the chosen models. Besides a small set of rules obtained, the representative model also having the lossless information and soundness properties meaning that it covers all interesting association rules and forbid derivation of weak rules. It is theoretically proven that the proposed set-theoretical intersection is able to assist users in knowing the duplication rules exist in representative model.
Repository Staff Only: Edit item detail