UPM Institutional Repository

Fake review annotation model and classification through reviewers' writing style


Shojaee, Somayeh (2019) Fake review annotation model and classification through reviewers' writing style. Doctoral thesis, Universiti Putra Malaysia.


In the last decade, online product reviews have become the main source of information during customers' decision making and business' purchasing processes. Unfortunately, fraudsters have produced untruthful reviews driven intentionally for profit or publicity. Their activities deceive potential organizations to reshape their businesses, customers from making best decisions and opinion mining techniques from reaching accurate conclusions. One of the big challenges of spam review detection is the lack of available labeled gold standard real-life product review dataset. Manually labeling product reviews as fake or real is one of the approaches to deal with the problem. However, recognizing whether a review is fake or real is very difficult by only reading the content of the review, because spammers can easily craft a fake review that is just like any other real reviews. To address this problem we enhance the inter-annotator agreement in manually labeling approach by proposing a model to annotate product reviews as fake or real. This is the first contribution of this research study. The proposed annotation model is designed, implemented and accessed online. Our crawled reviews are labeled by three annotators who were trained and paid to complete the labeling through our system. The spamicity score has been calculated for each review and a label has been assigned to every review based on their spamicity score. The Fleiss's Kappa is calculated for three annotators with value of 0.89, which shows \almost perfect agreement" between them. The labeled real-life product review dataset is the second contribution of this study. To test the accuracy of our model, we also re-labeled a portion of available Yelp.com dataset through our system and calculated the disagreement with their actual label based on the Yelp.com's filltering system. We found that only 7% of the reviews were labeled differently. The other open problem of fake product review classification is the lack of historic knowledge independent feature sets. Most of the feature-based fake review detection techniques are only applicable on a specific product domain or historic knowledge is needed to extract these features. To address the problem, this study presents a set of domain and historic knowledge independent features, namely writing style and readability, which can be applied to almost any review hosting site. The feature set is the third contribution of this study. Writing style here refers to linguistic aspects that identify fake and real reviewers. Fake reviewers try hard to write a review that sounds like genuine, hence it affects their writing style and also readability of their fake reviews consequently. The method dependently detects reviewers' writing style before spamming can hurt a product or a business. The evaluation results of our features on the only available crowdsourced labeled gold standard dataset, with the accuracy of 90.7%, and on our proposed dataset with the accuracy of 98.9%, suggest significant differences between fake and real reviews on writing style and readability level.

Download File

[img] Text
FSKTM 2020 3 IR.pdf

Download (1MB)

Additional Metadata

Item Type: Thesis (Doctoral)
Subject: Computer networks - Security measures
Subject: Security systems
Call Number: FSKTM 2020 3
Chairman Supervisor: Assoc. Prof. Masrah Azrifah Azmi Murad, PhD
Divisions: Faculty of Computer Science and Information Technology
Depositing User: Mas Norain Hashim
Date Deposited: 27 Sep 2021 03:38
Last Modified: 27 Sep 2021 03:38
URI: http://psasir.upm.edu.my/id/eprint/90777
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item