Citation
Shojaee, Somayeh
(2019)
Fake review annotation model and classification through reviewers' writing style.
Doctoral thesis, Universiti Putra Malaysia.
Abstract
In the last decade, online product reviews have become the main source of
information during customers' decision making and business' purchasing processes.
Unfortunately, fraudsters have produced untruthful reviews driven
intentionally for profit or publicity. Their activities deceive potential organizations
to reshape their businesses, customers from making best decisions
and opinion mining techniques from reaching accurate conclusions.
One of the big challenges of spam review detection is the lack of available
labeled gold standard real-life product review dataset. Manually labeling
product reviews as fake or real is one of the approaches to deal with the
problem. However, recognizing whether a review is fake or real is very difficult
by only reading the content of the review, because spammers can easily craft
a fake review that is just like any other real reviews.
To address this problem we enhance the inter-annotator agreement in manually
labeling approach by proposing a model to annotate product reviews as
fake or real. This is the first contribution of this research study. The proposed
annotation model is designed, implemented and accessed online. Our
crawled reviews are labeled by three annotators who were trained and paid
to complete the labeling through our system. The spamicity score has been
calculated for each review and a label has been assigned to every review based
on their spamicity score. The Fleiss's Kappa is calculated for three annotators
with value of 0.89, which shows \almost perfect agreement" between them. The labeled real-life product review dataset is the second contribution of this
study. To test the accuracy of our model, we also re-labeled a portion of available
Yelp.com dataset through our system and calculated the disagreement
with their actual label based on the Yelp.com's filltering system. We found
that only 7% of the reviews were labeled differently.
The other open problem of fake product review classification is the lack of
historic knowledge independent feature sets. Most of the feature-based fake
review detection techniques are only applicable on a specific product domain
or historic knowledge is needed to extract these features. To address the
problem, this study presents a set of domain and historic knowledge independent
features, namely writing style and readability, which can be applied to
almost any review hosting site. The feature set is the third contribution of
this study. Writing style here refers to linguistic aspects that identify fake
and real reviewers. Fake reviewers try hard to write a review that sounds
like genuine, hence it affects their writing style and also readability of their
fake reviews consequently. The method dependently detects reviewers' writing
style before spamming can hurt a product or a business. The evaluation
results of our features on the only available crowdsourced labeled gold standard
dataset, with the accuracy of 90.7%, and on our proposed dataset with
the accuracy of 98.9%, suggest significant differences between fake and real
reviews on writing style and readability level.
Download File
Additional Metadata
Actions (login required)
|
View Item |