UPM Institutional Repository

Enhanced normalization approach to address stop-word complexity in compound-word schema labels


Hossain, Jafreen (2014) Enhanced normalization approach to address stop-word complexity in compound-word schema labels. Masters thesis, Universiti Putra Malaysia.


An extensive review of the existing research work in the field of schema matching uncovers the significance of semantics in this subject. It is beyond doubt that both structural and semantics aspect of schema matching have been the topic of research for many years and there are strong references available for both. However, an indepth analysis of all the available approaches suggests there are further scopes for improvement in the field of semantic schema matching. Normalization and lexical annotation methods using WordNet have been proposed in several studies. However the results show comparatively poor accuracy due to the presence of stop-words in schema labels. Stop-words have previously been ignored in most studies resulting in false negative conclusions. This research work proposes, NORMSTOP (NORMalizer of schemata having STOP-words), an improved schema normalization approach, addressing the complexity of stop-words (e.g. ‗by‘, ‗at‘, ‗and,‘ or‘) in Compound Word (CW) schema labels. NORMSTOP isolates these labels during the preprocessing stage and resets the base-form to a relevant WordNet term, or an annotable compound noun; using a combined set of WordNet features like Attributes, Derivationally Related Forms, and LexNames. When tested on the same real dataset used in the earlier approach - (NORMS or NORMalizer of Schemata), NORMSTOP shows up to 13% improvement in annotation recall measurement. This level of improvement takes the overall schema matching process one step closer to perfect accuracy; and the lack of it exposes a gap in expectation, especially in today‘s databases where stop-words are in abundance.

Download File

FSKTM 2014 26IR.pdf

Download (1MB) | Preview

Additional Metadata

Item Type: Thesis (Masters)
Subject: Data integration (Computer science)
Call Number: FSKTM 2014 26
Chairman Supervisor: Nor Fazlida Mohd Sani, PhD
Divisions: Faculty of Computer Science and Information Technology
Depositing User: Haridan Mohd Jais
Date Deposited: 08 May 2018 03:23
Last Modified: 08 May 2018 03:23
URI: http://psasir.upm.edu.my/id/eprint/60506
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item