Citation
Alshari, Eissa Mohammed Mohsen
(2018)
Feature extraction based on word embeddings and opinion lexicals for sentiment analysis.
Doctoral thesis, Universiti Putra Malaysia.
Abstract
Sentiment Analysis has become one of the important researches in natural language
processing due to the exponential increase of user reviews and comments
online. The goal of sentiment analysis is to determine the polarity orientation
of a review text to either positive or negative. Many techniques rely on generic
opinion lexicons such as the SentiWordNet to construct features for the sentiment
classification task. The lexicons consist of words with positive or negative polarity,
and sometimes with assigned scores reflecting the degree of the sentiment
polarity. The presence of the opinion lexicons in a text indicates the overall sentiment
of the text. The lexical based sentiment analysis works by the summation of
all polarity scores given by the opinion lexicons in the text to indicate its polarity,
while feature vectors are constructed from the opinion lexicons and their scores
to be used by the machine learning classifiers in the supervised learning task.
Firstly, in this context, the features to be used for classification are limited to only
that opinion words presence in the text, while other non-opinion words in the text
will be neglected (will be assigned zero values in the vector). It has become the
limiting factor to the effectiveness of sentiment analysis. It is assumed that the
collection of features should be enriched by including other non-opinion words in
the text as features. In this thesis, the Dic2vec model is proposed to learn the
polarity of non-opinion words based on the Word2vec. As such, the features for
sentiment analysis are enriched by the combination of opinion words and nonopinion
words.
Secondly, many feature extraction techniques have been proposed to alleviate the data density and sparsity issue by mean of feature clustering. Such methods often
result in the reduction of vector dimension and assign a more effective weighting
scheme to improve the efficiency and effectiveness of sentiment analysis. One of
the feature clustering methods used for sentiment analysis is based on computing
semantic orientation of words in the labeled corpus and groups those words based
on predefined ranges of semantic orientation scores. The score is measured based
on the Pointwise Mutual Information (PMI) of words in the positive and negative
reviews dataset. As a result, clusters of words are derived and used as features.
The main disadvantage of this feature clustering method is that the strength in
the polarity of words will be under represented in the vector. Two or more words
with similar but high scores will only be represented by a binary value of 1, which
is equals to any two or more words with similar but lower scores. As such, the
effect of the significant words in the classification is diminished. In this thesis, the
Senti2vec model is proposed to discover polarity clusters from the corpus to be
used as features. The aim is to group non-opinion words around opinion words to
produce more effective weighting scheme for the features in the sentiment analysis
task.
Finally, the thesis focuses on the problem generating domain-dependent opinion
lexicons through semi-supervised learning. It is based on the assumption that
generic opinion lexicons such as the SentiWordNet is unable to capture the specific
characteristics of the domain in order to discriminate among classes. The problem
can be defined as assigning the polarity of target words based on a given set
of opinion lexicons as the seed. The recent method proposed for this problem
constructs a graph where nodes corresponds to subjective words and the edges
reflect the similarity between those words. The similarity is measured by the
co-occurrence of words pair within the same linguistic unit, such as an n-gram,
phrase or sentence. Given that the polarity of a seed word is known, the polarity
of target words is derived based on the strength of the edges between the seed
word and the target word. It is argued that the Word2vec is much superior in
representing the distributional semantics among words in a language. As such, in
this thesis a semi-supervised learning method is proposed to learn the polarity of
words from seeds opinion words by using the Word2vec.
All proposed methods and models in this thesis are evaluated by using a collection
of movie reviews labeled dataset with 50,000 reviews. Based on the experiment,
the performance of the Dic2vec model is about 2.5% to 6% better than the baseline.
In addition, the Senti2vec model shows an improvement of up to 6.5% as
compared to the baseline. Finally, the proposed semi-supervised method for learning
opinion lexicons is better than the recent co-occurence graph method by more
than 12%.
Download File
Additional Metadata
Actions (login required)
|
View Item |