UPM Institutional Repository

Term frequency and inverse document frequency with position score and mean value for mining web content outliers


Citation

Wan Zulkifeli, Wan Rusila (2013) Term frequency and inverse document frequency with position score and mean value for mining web content outliers. Masters thesis, Universiti Putra Malaysia.

Abstract

In the past few years, there was a rapid expansion of activities in the Web Content Mining area. However, the focus was only on the technical, visual design and frequent web content pattern while less frequent web content pattern called outliers was undervalued. Mining Web Content Outliers is used to detect irrelevant web content within a web portal. It is important to detect outliers especially when a web portal is hacked. Recently, there are only a few approaches suggested to Mining Web Content Outliers such as Signed-with-Weight technique and mining through mathematical approach. The mathematical approach developed is based on two way rectangular representations and correlation method. However the approaches do not take the advantage of position score and stemmed domain dictionary. Position score and stemmed domain dictionary are very useful in mining web content outliers because it may effects on reduction the relevance of documents. Therefore, this study was made to resolve the problems in Mining Web Content Outliers by combining the strength of word-based techniques, position score weighting technique and stemmed domain dictionary. The existing weighting technique was transformed to the Term Frequency and Inverse Document Frequency with Position Score and Mean Value (TF.IDF.PSM) weighting technique by implementing a standard weighting technique from Information Retrieval called Term Frequency and Inverse Document Frequency (TF.IDF) and a weighting technique from Text Categorization called the Term Frequency and Relevance Frequency (TF.RF) into Web Content Mining. This technique is started with extracting the web pages, preprocess it and then generate the full word profile. Depending on the length of the character, the respective index on the stemmed domain dictionary is searched. Positive count is incremented by one, if the word is present in the dictionary and document. Then word frequency in a web page and in every web pages and position score are counted. Finally the dissimilarity measure is computed to determine outliers. In the dissimilarity measure part, the TF.IDF.PSM is used not only to calculate and analyze the relevant words but also to consider the importance of the irrelevant words by assigning weight based on the word position in a page. A statistical approach ‗mean‘ is added to balance the weight of position score. The technique has been tested on 431 web pages from the Course folder of University Wisconsin, provided by World Wide Knowledge Base. While the 43 benchmark dataset is from Science Medical folder provided by The 20 Newsgroups Dataset. Term Frequency and Inverse Document Frequency (TF.IDF) weighting technique from Information Retrieval (IR) and the Term Frequency and Relevance Frequency (TF.RF) weighting technique by Text Categorization are used during experimental phase and the results are qualified by two parameters which is the percentage of the accuracy and the F1-measure. The experimental results show that the TF.IDF.PSM weighting technique achieves up to 98.95% of accuracy, which is about 3.21% higher than the Signed-with-Weight technique. Besides, it also achieves up to 94.19% of F1-measure, which is a 18.12% improvement from the Signed-with-Weight technique.


Download File

[img]
Preview
PDF
FSKTM 2013 8 IR.pdf

Download (1MB) | Preview

Additional Metadata

Item Type: Thesis (Masters)
Subject: Data mining
Subject: Web databases
Subject: Outliers (Statistics)
Call Number: FSKTM 2013 8
Chairman Supervisor: Norwati Mustapha, PhD
Divisions: Faculty of Computer Science and Information Technology
Depositing User: Hasimah Adam
Date Deposited: 07 Apr 2016 01:24
Last Modified: 07 Apr 2016 01:24
URI: http://psasir.upm.edu.my/id/eprint/39114
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item