UPM Institutional Repository

Modeling lexical semantics of terms based on synword identification for idea mining in information retrieval


Citation

Alksher, Mostafa Ahmed (2018) Modeling lexical semantics of terms based on synword identification for idea mining in information retrieval. Doctoral thesis, Universiti Putra Malaysia.

Abstract

The exponential accumulation of digital information in the form of business or public data has brought with it great challenges about how to extract more value from data. Individuals and organizations can no longer rely on human review and extraction of useful data or ideas from huge volumes of digital data because it is time-consuming to identify useful ideas within a large amount of textual information. The idea is an important component in the information retrieval and plays a key role in Idea Mining (IM) from unstructured text. An idea has been defined as a pair of problem and solution (or a pair of mean and end) within the same context. IM is introduced as an automatic process of mining new and innovative ideas from unstructured text by using text-mining tools. Nowadays, many companies have invested in Text Mining (TM) technology to discover hidden valuable information from unstructured text, which is very important for decisionmaking. Though there is no doubt about great ideas hidden within the huge public and business data, technically speaking, the major challenge is the idea characterization and reasoning. The traditional formation of ideas relies on identifying an individual idea either as a pair of the unknown solution to a known problem or known solution to an unknown problem. Then the idea mining identifier of this model makes a textual comparison between a new text (i.e., input query) and the collection documents. The output of the comparison should be in the form of unknown words and known words. known words refer to the terms that appear in both new text and collection documents. While the unknown words refer to the terms that only appear in the new text and has no matches in the document collection. Identification of ideas is then made according to the balancing between known and unknown words. However, this existing approach models the problem as an information retrieval problem, which relies on retrieving part of a text that potentially contains the pair of the unknown solution to a known problem (or known solution to the unknown problem). In other words, this existing approach of idea characterization is syntactical, and it lacks characterization of semantic relationships between terms in the new text and collection documents. We believe that considering the semantic dimension of examined words would contribute to improving the degree of balancing between known and unknown words. This is accomplished by the proposed balancing model that relies on characterizing the text as a triple of known, SynWord, and unknown terms. The main aim of this research is to propose an idea mining model using a syntactic approach to extract the overlapping relations between terms that are not appearing in the matching process. It works by comparing part of the abstract with other text as a context text to find pairs of similar texts from the abstract and the context text. The (known, unknown, and SynWord) model is proposed to consider the semantic balancing between candidate text and description text. SynWord words in the proposed model refer to the terms that only existed in the query and not syntactically detected in the documents being searched, but there is a semantic relation between these words with the terms in the target documents. The processing of the standard idea mining framework is modified according to the new proposed balancing model. In contrast to the previous research, characterizing the SynWord attribute would help to characterize more candidate ideas effectively. The mean average precision is used in idea mining measurements and has achieved an overall MAP of (0.967) for identifying the idea which is comparatively better than the other approaches. Furthermore, this research seeks to identify the pairs of text with similar and redundant content at higher ranks. Thus, this thesis attempts to improve the performance of the model by incorporating dissimilarity measure in the idea mining measurement to discriminate the redundancy in the text. The effectiveness of the measure is evaluated and the result is promising, showing that the proposed model can be more effective. In addition, this research assumes that the text position within the abstract has a potential to be an effective feature for mining ideas. Therefore, this study investigates the impact of text position on the effectiveness of the idea mining method. In particular, modeling the text position measure is proposed by modifying the existing approaches to incorporate the weighting position method in the idea mining measurements. The proposed model enables calculating the importance of the position of the candidate idea based on the derived rules. Based on the observed results, applying rules in SynWord model achieved a MAP score of (0.967) which showed that the conclusion section in the abstract has a higher chance to contain the idea as compared to the introduction and body sections.


Download File

[img] Text
FSKTM 2018 88 - ir.pdf

Download (1MB)

Additional Metadata

Item Type: Thesis (Doctoral)
Subject: Data mining
Subject: Information retrieval - Computer programs
Subject: Semantics - Mathematical models
Call Number: FSKTM 2018 88
Chairman Supervisor: Azreen Azman, PhD
Divisions: Faculty of Computer Science and Information Technology
Depositing User: Ms. Nur Faseha Mohd Kadim
Date Deposited: 19 Oct 2020 11:04
Last Modified: 04 Jan 2022 08:01
URI: http://psasir.upm.edu.my/id/eprint/83758
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item