UPM Institutional Repository

Position score weighting technique for mining web content outliers.


Citation

Mustapha, Norwati and Mustapha, Aida (2013) Position score weighting technique for mining web content outliers. International Journal of Applied Mathematics and Statistics, 36 (6). pp. 77-86. ISSN 0973-7545

Abstract

The existing mining web content outlier methods used stemming algorithm to preprocess the web documents and leave the domain dictionary in their root words. The stemming algorithm was usually used to reduce derived words to their stem, base or root form. The stemming algorithm sometimes does not leave a real word after removing the stem and it caused a problem to match words in the full word profile with the domain dictionary. Therefore this study uses stemmed domain dictionary and applies it with Term Frequency with Position Score (TF.PS) weighting technique which is derived from TF.IDF weighting technique from Information Retrieval (IR) in dissimilarity measure phase to see the efficiency of these technique for determining the outliers in the web content. The dataset is from The 20 Newsgroups Dataset. The result for stemmed domain dictionary with TF.PS weighting technique achieves up to 98.19% of accuracy and 90% of F1-Measure which is higher than previous techniques.


Download File

[img]
Preview
PDF (Abstract)
Position score weighting technique for mining web content outliers.pdf

Download (83kB) | Preview

Additional Metadata

Item Type: Article
Divisions: Faculty of Computer Science and Information Technology
Publisher: CESER Publications
Keywords: Information retrieval; Outliers; Web content; Weighting technique.
Depositing User: Ms. Nida Hidayati Ghazali
Date Deposited: 14 Jul 2014 07:47
Last Modified: 08 Oct 2015 06:52
URI: http://psasir.upm.edu.my/id/eprint/30631
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item