Citation
Alksher, Mostafa Ahmed
(2018)
Modeling lexical semantics of terms based on synword identification for idea mining in information retrieval.
Doctoral thesis, Universiti Putra Malaysia.
Abstract
The exponential accumulation of digital information in the form of business or
public data has brought with it great challenges about how to extract more value
from data. Individuals and organizations can no longer rely on human review
and extraction of useful data or ideas from huge volumes of digital data because
it is time-consuming to identify useful ideas within a large amount of textual
information. The idea is an important component in the information retrieval
and plays a key role in Idea Mining (IM) from unstructured text. An idea has
been defined as a pair of problem and solution (or a pair of mean and end) within
the same context. IM is introduced as an automatic process of mining new and
innovative ideas from unstructured text by using text-mining tools. Nowadays,
many companies have invested in Text Mining (TM) technology to discover hidden
valuable information from unstructured text, which is very important for decisionmaking.
Though there is no doubt about great ideas hidden within the huge public and
business data, technically speaking, the major challenge is the idea characterization
and reasoning. The traditional formation of ideas relies on identifying an
individual idea either as a pair of the unknown solution to a known problem or
known solution to an unknown problem. Then the idea mining identifier of this
model makes a textual comparison between a new text (i.e., input query) and
the collection documents. The output of the comparison should be in the form of
unknown words and known words. known words refer to the terms that appear
in both new text and collection documents. While the unknown words refer to
the terms that only appear in the new text and has no matches in the document collection. Identification of ideas is then made according to the balancing between
known and unknown words.
However, this existing approach models the problem as an information retrieval
problem, which relies on retrieving part of a text that potentially contains the
pair of the unknown solution to a known problem (or known solution to the unknown
problem). In other words, this existing approach of idea characterization is
syntactical, and it lacks characterization of semantic relationships between terms
in the new text and collection documents. We believe that considering the semantic
dimension of examined words would contribute to improving the degree of
balancing between known and unknown words. This is accomplished by the proposed
balancing model that relies on characterizing the text as a triple of known,
SynWord, and unknown terms.
The main aim of this research is to propose an idea mining model using a syntactic
approach to extract the overlapping relations between terms that are not
appearing in the matching process. It works by comparing part of the abstract
with other text as a context text to find pairs of similar texts from the abstract
and the context text. The (known, unknown, and SynWord) model is proposed
to consider the semantic balancing between candidate text and description text.
SynWord words in the proposed model refer to the terms that only existed in
the query and not syntactically detected in the documents being searched, but
there is a semantic relation between these words with the terms in the target
documents. The processing of the standard idea mining framework is modified
according to the new proposed balancing model. In contrast to the previous research,
characterizing the SynWord attribute would help to characterize more
candidate ideas effectively. The mean average precision is used in idea mining
measurements and has achieved an overall MAP of (0.967) for identifying the
idea which is comparatively better than the other approaches.
Furthermore, this research seeks to identify the pairs of text with similar and
redundant content at higher ranks. Thus, this thesis attempts to improve the
performance of the model by incorporating dissimilarity measure in the idea mining
measurement to discriminate the redundancy in the text. The effectiveness of
the measure is evaluated and the result is promising, showing that the proposed
model can be more effective.
In addition, this research assumes that the text position within the abstract has a
potential to be an effective feature for mining ideas. Therefore, this study investigates
the impact of text position on the effectiveness of the idea mining method.
In particular, modeling the text position measure is proposed by modifying the
existing approaches to incorporate the weighting position method in the idea mining
measurements. The proposed model enables calculating the importance of the
position of the candidate idea based on the derived rules. Based on the observed
results, applying rules in SynWord model achieved a MAP score of (0.967) which showed that the conclusion section in the abstract has a higher chance to contain
the idea as compared to the introduction and body sections.
Download File
Additional Metadata
Actions (login required)
|
View Item |