UPM Institutional Repository

Classic term weighting technique for mining web content outliers


Wan Zulkifeli, Wan Rusila and Mustapha, Norwati and Mustapha, Aida (2012) Classic term weighting technique for mining web content outliers. In: International Conference on Computational Techniques and Artificial Intelligence (ICCTAI'2012), 11-12 Feb. 2012, Penang, Malaysia. (pp. 271-275).


Outlier analysis has become a popular topic in the field of data mining but there have been less work on how to detect outliers in web content. Mining Web Content Outliers is used to detect irrelevant web content within a web portal. Term Frequency (TF) techniques from Information Retrieval (IR) have been used to detect the relevancy of a term in a web document. However, when document length varies, relative frequency is preferred. This study used maximum frequency normalization and applied Inverse Document Frequency (IDF) weighting technique which is a traditional term weighting method in IR to use the value of less frequent terms among documents which are considered as more discriminative than frequent terms. The dataset is from The 20 Newsgroups Dataset. TF.IDF is used in dissimilarity measure and the result achieves up to 91.10% of accuracy, which is about 17.77% higher than the previous technique.

Download File

PDF (Abstract)
Classic term weighting technique for mining web content outliers.pdf

Download (34kB) | Preview

Additional Metadata

Item Type: Conference or Workshop Item (Paper)
Divisions: Faculty of Computer Science and Information Technology
Publisher: Planetary Scientific Research Center
Keywords: Information retrieval; Outliers; Term weighting; Web content
Depositing User: Nabilah Mustapa
Date Deposited: 30 Dec 2016 05:31
Last Modified: 30 Dec 2016 05:31
URI: http://psasir.upm.edu.my/id/eprint/49837
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item