Pani Prithvi Raj
VOICE RECOGNITION—©SHUTTERSTOCK.COM/PLATAA, BACKGROUND—©SHUTTERSTOCK.COM/VITALY SOSNOVSKIY
Natural barriers necessitate technological support. A user-adaptive verbal calculator (UAVC) is an attempt to let differently abled students and researchers perform calculations with their voices. Automatic speech recognition in real-time mode is used. The UAVC adapts to users’ accents by retraining the speech models with the initial data obtained during usage.
For testing the functionality, the application was initially trained for eight speakers with a very small vocabulary. It was then tested on a user with a heavy accent. When used for the first time, it resulted in a 75.83% word error rate (WER). However, after retraining, it was tested with several mutually exclusive utterances from the same person. The application showed an improvement to a 26.47% WER. After retraining with a few more utterances, the WER was remarkably reduced to 3.76%. The important takeaways are the features of the adaptability of the application to the speaker or ambiance and on-the-fly system training while decoding.
With the evolution of machine learning techniques, machines are learning to interact with humans as naturally as possible, and speech recognition is one of the widely addressed problems in this domain. Speech recognition essentially means to transform a speech signal into representational features upon which various speech models are applied to come up with the most probable utterance.
The words that a normal human being speaks might require much greater effort to speak by persons with speech-related disabilities. This article describes a user-adaptive and, more specifically, disability-adaptive implementation of a voice-based calculator. Despite its focus on the calculator, the idea of this work is highly extensible to other applications and fields as well.
The key contributions of this article are as follows:
Despite the current overwhelming success of neural networks, I stuck to conventional techniques because of the multiple complexities involved in training a neural network on the fly, considering the size of the application in the present work. Overfitting issues are also absent in conventional speech recognition techniques.
As with any machine learning process, speech recognition also consists of two sections. The front end, known as feature extraction, is the conversion of a speech signal to a distinctive numerical representation. The back end, called Viterbi decoding, is a widely used solution to the decoding problem of HMMs and extensively used in speech decoding. It is a dynamic programming methodology to find a single best state sequence from many possibilities using some conditioning techniques, like the maximum (or minimum) of all possibilities and so on. A simple depiction is given in Fig. 1.
Fig 1 The Viterbi decoding concept. A set of the most probable paths is developed based on the probabilities ${\delta}_{t}{(}{i}{)}$ of the transition to a particular state ${S}_{j}$ at time ${t} + {1}$ from any of the states at time ${\text{t}}$. mpp: most probably path; Prob: probability.
The toy example depicted in Fig. 2 describes Viterbi decoding in a simple illustrative manner. Consider that a sequence of features is given, and the sequence of underlying binary numbers has to be determined. Starting from an empty state, the next possible states are determined based on the trained models and incoming feature at a time step.
Fig 2 A Viterbi decoding example. Beginning from an empty state, the possible next states are evaluated and scored. Based on the thumb rule that a state can have a unique parent, a set of the most probable paths are developed with time. At the end of the utterance, the best cost state is selected and traced back to obtain the underlying sequence.
For each state, we combine the costs of the transition to another state as well as an observation of the feature from that state with the accumulated cost so far. For instance, the cost of state A is the sum of the costs of start state (0), transition (4), and observation of the feature vector (2). This process goes on until the end, at which the state with the minimum cost is identified. Then, the past states are traced back to find the underlying pattern of binary numbers.
Most of the words in the vocabulary shown in Table 1 are commonplace in a commercial scientific calculator. However, there are several additional words. “Open” and “close” are used for giving matrix inputs to the UAVC. “Point” is used for speaking fractional numbers. The word “base” is used while making use of logarithms. The number “ten” is included in the vocabulary for the purpose of using exponents in scientific format. The word “equals” is significant here, as that marks the end of a calculation input and triggers the computation part that follows the decoding.
Table 1. The UAVC vocabulary.
There are two sets of similar-sounding words in the vocabulary that could most probably hit the error rate. They are
One way the issue of similar pronunciations could be resolved is by training more on these specific words. Nonetheless, there is still a possibility of incorrect decoding. This is, admittedly, a minor limitation of this work. However, the model is flexible to accommodate these words along with constants, unit conversions, and so on.
As depicted in Fig. 3, the UAVC takes in the speech input of the standard math data, does the computation, and returns the answer. As the decoder figures out the words, if it finds the word “equals” in the decoded text, it sends the words decoded so far to a module, which deciphers the corresponding meaning of each of the decoded words, performs the respective math operation, and returns back the answer.
Fig 3 The entire UAVC processing flow.
The main cause for the adaptability and robustness of speech recognition comes from the fact that it is retrained in the context of a person and usage. As the application obtains data for decoding, it simultaneously stores it for retraining purposes. However, the user may choose to disable this feature.
A tiny prototype of this idea is considered to get a working system for demonstration. Therefore, the example considered here varies slightly compared to the intended use. The application is trained on eight speakers with clean speech, which is an extremely insignificant training set. This is done to have an initial setup of the application. The digits (zero to nine) with four basic math operations (+, –, ×, and /) are considered as the vocabulary for demonstration purposes; however, the application is seamlessly extensible to the intended number of words.
For example, let the user name be ramu. If the user utters “two plus three equals,” the decoder attempts to decode it with the existing trained model. There are two possible cases:
Fig 4 A UAVC demonstration when the decoding result is correct.
Fig 5 A UAVC demonstration when the decoding result is incorrect.
This process is repeated several times. Then, the collected data are used to retrain the model, after which the application is trained for the speaker as an assistance.
The WER is a negative measurement of accuracy and, so, is used interchangeably with accuracy. It is a measurement of how many words go wrong out of several test utterances.
A new user with a speech disability was chosen for testing. With the aforementioned initial setup, the application resulted in an accuracy of just 24.17%. It was then trained with 25 utterances from the same speaker under various ambiances. Then, the same speaker was tested under different conditions on a testing set that is mutually exclusive to the training data set. The accuracy rose to 73.53%.
When trained on 50 such utterances (which are again collected while testing), the application resulted in 96.24% accuracy, and, with 75 such utterances, it gave an almost 0% WER (Fig. 6). Since we used HMM-based speech decoding techniques, the system does not suffer overfitting issues, as with neural networks.
Fig 6 The WER trend with the number of user samples.
The most important takeaway from this work is that this application can be successfully trained to personalize it for a single or multiple speakers as required. This idea works across any user without a loss of generality since no prerequisites are assumed from the user. Therefore, although the accuracy and WER with respect to testing on one person are presented, the trend will remain more or less the same, irrespective of the user’s accent, language, or disabilities.
The UAVC learns the words as well as the speaker, which is useful for several applications. For instance, this technique can be used in voice-and-words-based security systems where only a particular authority saying a specific utterance is allowed.
This application was developed and demonstrated to speech professionals as well as the student community. It was highly appreciated. One of the experts suggested using speech models as a combination of numbers as well as operators. However, since the retraining needs to be very fast, I resorted to employing individual word models. A person with a heavy accent was tested as discussed in the “Results and Inferences” section. Despite initial discomfort due to having to input the correct words most of the time, the user felt comfortable after training on several of his own utterances.
In this article, speech recognition is used for a novel application. A UAVC, an assistive technology based on speech recognition, is demonstrated and used specifically as an interface for researchers and educationalists who are involved in using the scientific calculator but are physically disabled. The on-the-fly training with user data is exploited, and the idea is extensible to other techniques, languages, and so on. This technology, if implemented, can become a great aid for the scientific community. The complete algorithm is presented with a flowchart, and an example is demonstrated, validating its utility. Feedback from experts and a user is also provided.
Pani Prithvi Raj (paniprithviraj@smail.iitm.ac.in) is pursuing his doctoral studies at the Indian Institute of Technology (IIT) Madras, Chennai 600042, India. His research interests include developing hardware solutions for various paradigms of speech technology, beginning with speech recognition. He is with the Integrated Circuits and Systems Group of Electrical Engineering Department at IIT Madras. He has been a Graduate Student Member of the IEEE Madras Section since 2017.
Digital Object Identifier 10.1109/MPOT.2020.3027245