UPM Institutional Repository

Effective query structuring with ranking using named entity categories for XML retrieval


Roko, Abubakar (2016) Effective query structuring with ranking using named entity categories for XML retrieval. Doctoral thesis, Universiti Putra Malaysia.


A large number of documents are now represented and stored using an XML document structure on the web. Thus, there is a need for effective and user-friendly search systems for XML document search. Query languages are largely used to compose structured queries by users to extract data from XML documents. However, using query languages to express queries prove to be difficult for most users since this requires learning a query language and knowledge of the underlying data schema. On the other hand, the success of Web search engines has made many users to be familiar with keyword search and therefore prefer to use a keyword search query interface to search XML data. Keyword queries are inherently ambiguous and it is difficult for users to clearly state their intentions, which causes keyword search systems to inevitably return irrelevant results, making search engines less effective. Therefore, to improve the effectiveness of search engines, keyword search systems are highly needed. Query structuring system is one of the keyword search systems recently used for effective retrieval of XML documents. The systems focus on user query representation, user search intention identification and ranking algorithms to improve keyword search. However, firstly, existing systems return wrong query representation because of their inability to put keyword query ambiguity problems into consideration during query pre-processing. For example, none of the systems consider the following ambiguities: (i) a query term can appear as the text values of different XML nodes and having different semantics (ii) a query term can appear as both a tag name and as part of text content of some node. Secondly, the systems return wrong user search intention. Specifically, the systems return irrelevant predicates as well as noninformative entity nodes. Thirdly, the systems fail to generate and select best structured query that match a user input keyword query. Finally, the systems' ranking functions ignore to consider the semantics of XML tags into account which leads to irrelevant results. These problems are addressed as follows: Firstly, an enrichment method has been proposed to investigate whether enriching document content with semantic tags improves the performance of keyword queries. The method employs Semantic Tags Extraction (STSE) algorithm to extract semantic tags of an element and Element Enrichment (EERM) algorithm to enrich the elements. Secondly, a XML Keyword Query Structuring System (XKQSS) has been developed to relegate the task of generating structured queries from a user to itself while retaining the simple keyword search query interface that allows users to submit a schema independent keyword query. The XKQSS uses a Semantic Aware Index scheme (SAIS) to record the proportion of Named Entity Categories (NECs) and an Entity based Query Segmentation (EBQS) method to interpret the user query as a list of keywords and named entities (resolves ambiguity). Furthermore, it employs Predicates Identification Algorithm (PIA) and Entity Identification Algorithm (EIA) to identify user search intention. Finally, the system utilizes a query formulation algorithm (QRYF) to select the structured queries that best interpret user query. Thirdly, a modification to XKQSS called Ranking Aware XML Keyword Query Structuring System (RAXKQSS) has been developed to effectively return a ranked list of elements as answer to a user query. The RAXKQSS, first, introduces an improve SAIS (ISAIS) to record the Named Entity Category (NEC) of each indexed term, in addition to the usual information such as term frequencies, term position, as well as element that contains the term in the inverted index. Then, the system uses a ranking function rk_BM25TOPF to assign relevance scores to XML fragments with respect to a query and an N-gram based Query Segmentation (NBQS) method to interpret the user query as a list of N-grams (resolves ambiguity). Next, it introduces an Improved PIA (IPIA) and a Compute Return Node Algorithm (CRNA) to return relevant predicates and return node, respectively. Finally, the system employs a query formulation via node algorithm (QRYFv) algorithm to improve the selection of structured queries that best match user query Experiments have been conducted to evaluate the performance of the proposed enrichment method, XKQSS and RAXKQSS. The experimental results have shown that the enrichment method has an insignificant improvement compared with the baseline in terms of Mean Average Precision (MAP). The results also demonstrated that the propose XKQSS outperforms XReal and StruX in terms of precision. Moreover, the results also illustrated that the proposed RAXKQSS achieved higher precision when compared with the StruX, the SLCA. These results have shown that the enrichment method is ineffective in improving retrieval performance while the proposed systems XKQSS and RAXKQSS have proved effective compared to the StruX and the SLCA in terms of retrieval performance.

Download File

FSKTM 2016 18 IR.pdf

Download (1MB) | Preview

Additional Metadata

Item Type: Thesis (Doctoral)
Subject: XML (Document markup language)
Subject: Information retrieval
Call Number: FSKTM 2016 18
Chairman Supervisor: Associate Professor Shyamala Doraisamy, PhD
Divisions: Faculty of Computer Science and Information Technology
Depositing User: Ms. Nur Faseha Mohd Kadim
Date Deposited: 10 Jul 2019 03:48
Last Modified: 10 Jul 2019 03:48
URI: http://psasir.upm.edu.my/id/eprint/69351
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item