Citation
Mahdi, Osamah Abdul Sattar
(2014)
A new approach for instance-based schema matching.
Masters thesis, Universiti Putra Malaysia.
Abstract
Schema matching is a crucial phase in data integration that aims to find correspondences between schema attributes by utilizing schema information. However, this information is not always available or useful to be used since it could be abbreviation. Consequently, instances could be an alternative choice for schema information. Various instance based schema matching approaches have been proposed to achieve the goal of discovering correspondences between schema attributes, by treating the instances as strings including the numeric instances. This prevents discovering common patterns or performing statistical computation among the numeric instances. As a consequence, this causes unidentified matches especially for attribute with numeric instances which further reduces the quality of match results.
This thesis aims at proposing an efficient approach which is able to identify attribute matches between schemas by fully exploiting the instances. The approach utilizes the concept of pattern recognition to determine attribute matches for numeric and mix instances. This is acquired by automatically creating regular expression based on the instances. While, for alphabetic instances the approach calculates the semantic similarity score by utilizing Google similarity to capture the semantic relationships between instances. The proposed approach consists of five main phases, namely: (i) analysing instances, (ii) classifying schema attributes, (iii) extracting the optimal sample size, (iv) identifying instance similarity, and (v) identifying the match.
Three analyses have been designed and conducted on two different data sets, namely: (i) Restaurant and (ii) Census, with respect to precision (P), recall (R), and F-measure (F). The first analysis aims at identifying the optimal sample size of tuples to be used during the phase of extracting the optimal sample size. The purpose of identifying the optimal sample size is to reduce the number of comparisons between the instances which lead to reduce the processing time of matching operation. This analysis showed that the optimal sample size is 50% from the actual table size of both data sets. The second analysis aims to investigate and to prove that combining both Google similarity and regular expression as in our proposed approach achieve higher accuracy compared to utilizing Google similarity or regular expression separately. The results showed that our proposed approach achieved precision (P), recall (R), and F-measure (F) in the range of 93% - 99% for both data sets. On the other hand, Google similarity and regular expression which are performed separately achieved precision (P), recall (R), and F-measure (F) in the range of 36% - 74%. While the third analysis intents to compare the performance of our proposed approach to the previous approaches. The results showed that our proposed approach outperformed the previous approaches although only a sample of instances is used instead of considering the whole instances during the process of instance based schema matching as used in the previous works.
Download File
Additional Metadata
Actions (login required)
|
View Item |