UPM Institutional Repository

Improved feature extraction and lexicon reduction methods classified by support vector machine for Farsi handwritten word recognition system


Citation

Akbarpour, Shahin (2011) Improved feature extraction and lexicon reduction methods classified by support vector machine for Farsi handwritten word recognition system. PhD thesis, Universiti Putra Malaysia.

Abstract

Automatic word recognition has proved an intensive research subject for many languages in the last decades, but it is still far from the final frontier for some languages. The word recognition is divided into two types: online and offline. The current research is focused on the offline handwritten word recognition (FHWR). An offline handwritten word recognition system includes many stages. All stages should be improved in order to enhance accuracy of the system. In addition, one of the most significant current discussions in enhancement of the accuracy of handwritten word recognition is reducing the lexicon size. Many studies have been carried out so far, but FHWR has not been researched as thoroughly as Latin or Chinese handwritten systems. Several attempts have been made to address FHWR, most of which focusing on the image preprocessing and segmentation. It is also worth mentioning that some studies have already been done on the feature extraction, classification and lexicon reduction methods. In the latest and the most successful prior studies, a feature extraction method, a lexicon reduction, and hidden Markov model (HMM) have been used. However, the recognition rate is not superior owing to the fact that the feature extraction method could not truly describe the Farsi word. Moreover, there exist some limitations in HMM, and several segmentation errors occurred in their lexicon reduction. The current research is focused on solving the mentioned problems through improving the accuracy of recognition rate of FHWR by proposing a new feature extraction and lexicon reduction methods, and finding a suitable classification. In this regard, some special attributes of Farsi manuscripts such as the stroke directions, non-unique black pixels distribution on binary image of the word, the number of the sub-word(s) and dot(s) of the word will be considered. In addition, several classification methods will be tested in order to determine which one is the best for better accuracy of recognition rate other than HMM. We developed two word recognizer systems to cater for different applications based on different lexicon size. For small lexicons, the word recognizer system consists of a new feature extraction and a classifier, and for medium and large lexicons, the system includes a new feature extraction and lexicon reduction methods and a classifier. For the performance evaluation of the proposed methods, we use four different Farsi handwritten datasets such as Farshids‟ Legal amount, 198-Cities, Iranshahr, and IFN-AUT, which contained 45, 198, 503, and 1080 class-words, respectively. In addition, for comparison of the obtained results with the previous works, we need proper datasets used by prior researchers. AUT and IFN-AUT were applied previously. The AUT, which included 198 class-words, was not available, but a similar dataset, 198-Cities, was created by random selection of 198 class-words from Iranshahr dataset. In order to conduct more experiments based on different lexicon size, the proposed methods were run on Farshids‟ Legal amount and Iranshahr datasets as well. Moreover, we re-implemented the existing word recognizer and lexicon reduction method so that we could test for comparison using the same dataset such as 198-Cities and IFN-AUT. It might be concluded that our methods, which consist of a new feature extraction and lexicon reduction methods and the classifier, perform better than the latest works.


Download File

[img]
Preview
PDF
FSKTM 2011 21R.pdf

Download (903kB) | Preview

Additional Metadata

Item Type: Thesis (PhD)
Subject: Support vector machines
Subject: Persian language - Written Persian
Subject: APT (Computer program language)
Call Number: FSKTM 2011 21
Chairman Supervisor: Associate Professor Md. Nasir bin Sulaiman
Divisions: Faculty of Computer Science and Information Technology
Depositing User: Haridan Mohd Jais
Date Deposited: 14 May 2015 07:24
Last Modified: 14 May 2015 07:24
URI: http://psasir.upm.edu.my/id/eprint/26987
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item