UPM Institutional Repository

A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage


Citation

Ektefa, Mohammadreza (2011) A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage. Masters thesis, Universiti Putra Malaysia.

Abstract

Since integrated data have got richer information, integration of different data sources is a key step in most data warehousing and mining projects. One of the principal challenges in integrating databases is duplication. In other words, in different databases, one entity may be available in different formats. Therefore, when these databases are combined, the availability of entities in different formats causes duplication. Record linkage is a technique which is used to detect and match duplicate records which are generated in data integration process. A variety of record linkage models with different steps have been developed in order to detect such duplicate records. For this purpose, string similarity measures are widely utilized for comparing record-pairs in different studies. However, in addition to string similarity, considering the semantic relatedness between two records can be also beneficial in the process of detecting duplicate records. This issue is not regarded in existing record linkage models. To determine the importance of semantic similarity in improving the effectiveness of detecting duplicate records, a similarity measure based on the combination of string and semantic similarity measures is proposed in this study. For combination purpose, a threshold-based method which considers the semantic similarity for each field of the dataset is proposed. This threshold determines the influence of semantic similarity in the final combination algorithm. The combined similarity measure is experimented on two real world datasets, namely Restaurant and Cora and its effectiveness is measured based on several standard evaluation metrics. As experimental results indicate, the combined similarity measure which is based on the combination of string and semantic similarity measures outperforms the string and semantic similarity measures, which are used individually, with the F-measure of 99.1% in Restaurant dataset, and 88.3% in Cora dataset. Therefore, based on the experimental results, semantic similarity should be taken into account in addition to string similarity in order to detect duplicate records more effectively in recork linkage


Download File

Full text not available from this repository.

Additional Metadata

Item Type: Thesis (Masters)
Subject: Semantic computing
Subject: Semantic integration (Computer systems)
Subject: Data warehousing
Call Number: FSKTM 2011 8
Chairman Supervisor: Fatimah Sidi, PhD
Divisions: Faculty of Computer Science and Information Technology
Depositing User: Ms. Nida Hidayati Ghazali
Date Deposited: 30 Jun 2014 07:17
Last Modified: 30 Jun 2014 07:17
URI: http://psasir.upm.edu.my/id/eprint/19638
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item