Development of an Isolated Digit Speech Recognition Based on Multilayer Perceptron Model
Mohamad Hussin, Ummu Salmah (2004) Development of an Isolated Digit Speech Recognition Based on Multilayer Perceptron Model. PhD thesis, Universiti Putra Malaysia.
The automatic speech recognition (ASR) field has become one of the leading speech technology areas nowadays. The research in ASR has always been emphasizing on developing man-machine communication and promising in ease of use over the traditional keyboard and mouse. The speech recognition task is simple to be identified by human, but a very complex process for the machine to understand. Various methods have been introduced to develop an efficient ASR system. A Neural Network (NN) approach is one of the famous methods and widely used in this field. A Multilayer perceptron (MLP) is a popular NN model used in ASR field. In this study, a MLP with back propagation learning algorithm is implemented to perform the isolated digit speech recognition task for Malay language. However, one of the current problems faced by MLP and most NN models in ASR field is the long learning time. Besides that, the requirement to produce high recognition rate for isolated digit speech recognition system performed by MLP is also not trivial because it has been widely used in many applications. Thus, this study focuses on improving the learning time and recognition rate of the MLP neural network for Malay isolated digit speech recognition system. This current study proposes three new methods to fulfill the objective above. The improvement is made in preprocessing and recognition phase. In preprocessing phase, a new endpoint detection method is proposed and it is known as variance method. This method is introduced to overcome the disadvantages of the conventional method. The obstacles in the conventional method are unstable and difficult to set the threshold during the silence detection. Hence, poor recognition rate is produced. Another contribution in the preprocessing phase is in normalization phase. Three normalization methods are introduced to normalize the speech data before propagating to NN. The proposed methods consist of exponent, hybrid I and hybrid II. These methods are compared with 4 widely used conventional normalization methods. These include range I, range II, simple and variance method. The conventional methods have two limitations. The first is that some of the methods are very slow in learning phase but produce good recognition rate such as variance and range I methods. The second is that few of them are very fast in learning phase but produce low recognition rate such as simple and range II methods. Therefore, the new normalization methods are proposed to accelerate learning time and to produce high recognition rate. In recognition phase, a simple novel approach is introduced to increase the recognition rate. An adaptive sigmoid function is implemented to achieve this objective. A typical or fixed sigmoid function method is used in learning phase. In the recognition phase, an adaptive sigmoid function is employed. In this sense, the slope of the activation function is adjusted to gain highest recognition rate. This study emphasizes on 10 Malay words that comprise of “sifar” to “sembilan” (“0” to “9”). All utterances were recorded through single male speaker and each utterance was repeated 100 times. Thus the data set consist of 1000 utterances of Malay words. Four hundred data sets were split to utilize in the learning phase and the remaining 600 data for recognition phase. The TI46 standard data set was used to evaluate the performance of the all proposed method and 10 English words, consisting of “zero” to “nine” (“0” to “9”) are utilized throughout this study. Eight male and female speakers uttered each word 8 times. Hence, the total data set is 1600 for both speakers. The data set based on male and female speaker is trained separately. In this sense, four hundred male data sets were experimented during learning phase; meanwhile 400 data sets are kept as test data. The same approach is utilized in learning and recognition phase for female data sets. The Linear Predictive Coding (LPC) is implemented as a feature extraction method to represent the speech data. The experimental results show that the proposed endpoint detection (variance method) produced promising results in term of learning time and recognition rate. Meanwhile, the proposed normalization method has shown excellent results over all experiments. The adaptive sigmoid function also successfully increased the recognition rate in the most of the experiments. Finally, from the overall experiments, it can be concluded that the highest recognition rate for Malay data set is 99.83% with 82s convergence time. Meanwhile, for TI46 data set (female and male data set), the yielded convergence time is 55s and 111s with the recognition rate of 96.75% and 94.75% respectively.
Repository Staff Only: Edit item detail