UPM Institutional Repository

Feature extraction based on word embeddings and opinion lexicals for sentiment analysis


Citation

Alshari, Eissa Mohammed Mohsen (2018) Feature extraction based on word embeddings and opinion lexicals for sentiment analysis. Doctoral thesis, Universiti Putra Malaysia.

Abstract

Sentiment Analysis has become one of the important researches in natural language processing due to the exponential increase of user reviews and comments online. The goal of sentiment analysis is to determine the polarity orientation of a review text to either positive or negative. Many techniques rely on generic opinion lexicons such as the SentiWordNet to construct features for the sentiment classification task. The lexicons consist of words with positive or negative polarity, and sometimes with assigned scores reflecting the degree of the sentiment polarity. The presence of the opinion lexicons in a text indicates the overall sentiment of the text. The lexical based sentiment analysis works by the summation of all polarity scores given by the opinion lexicons in the text to indicate its polarity, while feature vectors are constructed from the opinion lexicons and their scores to be used by the machine learning classifiers in the supervised learning task. Firstly, in this context, the features to be used for classification are limited to only that opinion words presence in the text, while other non-opinion words in the text will be neglected (will be assigned zero values in the vector). It has become the limiting factor to the effectiveness of sentiment analysis. It is assumed that the collection of features should be enriched by including other non-opinion words in the text as features. In this thesis, the Dic2vec model is proposed to learn the polarity of non-opinion words based on the Word2vec. As such, the features for sentiment analysis are enriched by the combination of opinion words and nonopinion words. Secondly, many feature extraction techniques have been proposed to alleviate the data density and sparsity issue by mean of feature clustering. Such methods often result in the reduction of vector dimension and assign a more effective weighting scheme to improve the efficiency and effectiveness of sentiment analysis. One of the feature clustering methods used for sentiment analysis is based on computing semantic orientation of words in the labeled corpus and groups those words based on predefined ranges of semantic orientation scores. The score is measured based on the Pointwise Mutual Information (PMI) of words in the positive and negative reviews dataset. As a result, clusters of words are derived and used as features. The main disadvantage of this feature clustering method is that the strength in the polarity of words will be under represented in the vector. Two or more words with similar but high scores will only be represented by a binary value of 1, which is equals to any two or more words with similar but lower scores. As such, the effect of the significant words in the classification is diminished. In this thesis, the Senti2vec model is proposed to discover polarity clusters from the corpus to be used as features. The aim is to group non-opinion words around opinion words to produce more effective weighting scheme for the features in the sentiment analysis task. Finally, the thesis focuses on the problem generating domain-dependent opinion lexicons through semi-supervised learning. It is based on the assumption that generic opinion lexicons such as the SentiWordNet is unable to capture the specific characteristics of the domain in order to discriminate among classes. The problem can be defined as assigning the polarity of target words based on a given set of opinion lexicons as the seed. The recent method proposed for this problem constructs a graph where nodes corresponds to subjective words and the edges reflect the similarity between those words. The similarity is measured by the co-occurrence of words pair within the same linguistic unit, such as an n-gram, phrase or sentence. Given that the polarity of a seed word is known, the polarity of target words is derived based on the strength of the edges between the seed word and the target word. It is argued that the Word2vec is much superior in representing the distributional semantics among words in a language. As such, in this thesis a semi-supervised learning method is proposed to learn the polarity of words from seeds opinion words by using the Word2vec. All proposed methods and models in this thesis are evaluated by using a collection of movie reviews labeled dataset with 50,000 reviews. Based on the experiment, the performance of the Dic2vec model is about 2.5% to 6% better than the baseline. In addition, the Senti2vec model shows an improvement of up to 6.5% as compared to the baseline. Finally, the proposed semi-supervised method for learning opinion lexicons is better than the recent co-occurence graph method by more than 12%.


Download File

[img] Text
FSKTM 2018 78 - ir.pdf

Download (1MB)

Additional Metadata

Item Type: Thesis (Doctoral)
Subject: Computational linguistics
Subject: Public opinion - Data processing
Call Number: FSKTM 2018 78
Chairman Supervisor: Azreen Azman, PhD
Divisions: Faculty of Computer Science and Information Technology
Depositing User: Ms. Nur Faseha Mohd Kadim
Date Deposited: 28 Oct 2020 11:06
Last Modified: 07 Jan 2022 08:34
URI: http://psasir.upm.edu.my/id/eprint/83235
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item