User-adaptive verbal calculator for the physically challenged: An assistive technology

User-adaptive verbal calculator for the physically challenged: An assistive technologyPani Prithvi Raj42mpot02-pani-opener-3027245VOICE RECOGNITIONâ€”Â©SHUTTERSTOCK.COM/PLATAA, BACKGROUNDâ€”Â©SHUTTERSTOCK.COM/VITALY SOSNOVSKIYNatural barriers necessitate technological support. A user-adaptive verbal calculator (UAVC) is an attempt to let differently abled students and researchers perform calculations with their voices. Automatic speech recognition in real-time mode is used. The UAVC adapts to usersâ€™ accents by retraining the speech models with the initial data obtained during usage.For testing the functionality, the application was initially trained for eight speakers with a very small vocabulary. It was then tested on a user with a heavy accent. When used for the first time, it resulted in a 75.83% word error rate (WER). However, after retraining, it was tested with several mutually exclusive utterances from the same person. The application showed an improvement to a 26.47% WER. After retraining with a few more utterances, the WER was remarkably reduced to 3.76%. The important takeaways are the features of the adaptability of the application to the speaker or ambiance and on-the-fly system training while decoding.IntroductionWith the evolution of machine learning techniques, machines are learning to interact with humans as naturally as possible, and speech recognition is one of the widely addressed problems in this domain. Speech recognition essentially means to transform a speech signal into representational features upon which various speech models are applied to come up with the most probable utterance.The words that a normal human being speaks might require much greater effort to speak by persons with speech-related disabilities. This article describes a user-adaptive and, more specifically, disability-adaptive implementation of a voice-based calculator. Despite its focus on the calculator, the idea of this work is highly extensible to other applications and fields as well.The key contributions of this article are as follows:

A hidden Markov model (HMM)-based UAVC is developed.

The present work highlights the user-adaptability intuition in general. It is not always certain that a physically challenged person will be able to speak the same way as a normal person. Therefore, training the application for a particular individual will make it more assistive than having the usual voice recognition application combined with a calculator.

Training is done on the fly so that the model can be fine-tuned at the userâ€™s convenience.

The same is applicable even across multiple languages. Although an example using English vocabulary is presented, the retraining makes it possible to work on other languages as well. However, the acoustic model and vocabulary concerning the particular languages have to be made available to the system.

Despite the current overwhelming success of neural networks, I stuck to conventional techniques because of the multiple complexities involved in training a neural network on the fly, considering the size of the application in the present work. Overfitting issues are also absent in conventional speech recognition techniques.Viterbi decodingâ€”A simplified explanationAs with any machine learning process, speech recognition also consists of two sections. The front end, known as feature extraction, is the conversion of a speech signal to a distinctive numerical representation. The back end, called Viterbi decoding, is a widely used solution to the decoding problem of HMMs and extensively used in speech decoding. It is a dynamic programming methodology to find a single best state sequence from many possibilities using some conditioning techniques, like the maximum (or minimum) of all possibilities and so on. A simple depiction is given in Fig. 1.pani01-3027245Fig 1 The Viterbi decoding concept. A set of the most probable paths is developed based on the probabilities

${\delta}_{t}{(}{i}{)}$

of the transition to a particular state

${S}_{j}$

at time

${t} + {1}$

from any of the states at time

${\text{t}}$

. mpp: most probably path; Prob: probability.The toy example depicted in Fig. 2 describes Viterbi decoding in a simple illustrative manner. Consider that a sequence of features is given, and the sequence of underlying binary numbers has to be determined. Starting from an empty state, the next possible states are determined based on the trained models and incoming feature at a time step.pani02-3027245Fig 2 A Viterbi decoding example. Beginning from an empty state, the possible next states are evaluated and scored. Based on the thumb rule that a state can have a unique parent, a set of the most probable paths are developed with time. At the end of the utterance, the best cost state is selected and traced back to obtain the underlying sequence.For each state, we combine the costs of the transition to another state as well as an observation of the feature from that state with the accumulated cost so far. For instance, the cost of state A is the sum of the costs of start state (0), transition (4), and observation of the feature vector (2). This process goes on until the end, at which the state with the minimum cost is identified. Then, the past states are traced back to find the underlying pattern of binary numbers.VocabularyMost of the words in the vocabulary shown in Table 1 are commonplace in a commercial scientific calculator. However, there are several additional words. â€œOpenâ€ and â€œcloseâ€ are used for giving matrix inputs to the UAVC. â€œPointâ€ is used for speaking fractional numbers. The word â€œbaseâ€ is used while making use of logarithms. The number â€œtenâ€ is included in the vocabulary for the purpose of using exponents in scientific format. The word â€œequalsâ€ is significant here, as that marks the end of a calculation input and triggers the computation part that follows the decoding.Table 1. The UAVC vocabulary.pani_t1-3027245There are two sets of similar-sounding words in the vocabulary that could most probably hit the error rate. They are

â€œsine,â€ â€œnine,â€ and â€œtimesâ€

â€œbyâ€ and â€œpiâ€

One way the issue of similar pronunciations could be resolved is by training more on these specific words. Nonetheless, there is still a possibility of incorrect decoding. This is, admittedly, a minor limitation of this work. However, the model is flexible to accommodate these words along with constants, unit conversions, and so on.ImplementationAs depicted in Fig. 3, the UAVC takes in the speech input of the standard math data, does the computation, and returns the answer. As the decoder figures out the words, if it finds the word â€œequalsâ€ in the decoded text, it sends the words decoded so far to a module, which deciphers the corresponding meaning of each of the decoded words, performs the respective math operation, and returns back the answer.pani03-3027245Fig 3 The entire UAVC processing flow.The main cause for the adaptability and robustness of speech recognition comes from the fact that it is retrained in the context of a person and usage. As the application obtains data for decoding, it simultaneously stores it for retraining purposes. However, the user may choose to disable this feature.A tiny prototype of this idea is considered to get a working system for demonstration. Therefore, the example considered here varies slightly compared to the intended use. The application is trained on eight speakers with clean speech, which is an extremely insignificant training set. This is done to have an initial setup of the application. The digits (zero to nine) with four basic math operations (+, â€“, Ã—, and /) are considered as the vocabulary for demonstration purposes; however, the application is seamlessly extensible to the intended number of words.For example, let the user name be ramu. If the user utters â€œtwo plus three equals,â€ the decoder attempts to decode it with the existing trained model. There are two possible cases:

The output displayed is â€œ2 + 3 = 5â€ (Fig. 4). The user is asked if the output is correct. If the decoded output is correct, the training directory is updated with an audio ramu/twoplusthree.wav file. The wav.scp, utt2spk, and text files are also updated accordingly.

If the output displayed is anything other than this, then it is a mistake, as recognized by the user. Then, the correct words are taken from the user (Fig. 5), and the training directory is updated as stated in the previous case.

pani04-3027245Fig 4 A UAVC demonstration when the decoding result is correct.pani05-3027245Fig 5 A UAVC demonstration when the decoding result is incorrect.This process is repeated several times. Then, the collected data are used to retrain the model, after which the application is trained for the speaker as an assistance.Results and inferencesThe WER is a negative measurement of accuracy and, so, is used interchangeably with accuracy. It is a measurement of how many words go wrong out of several test utterances.A new user with a speech disability was chosen for testing. With the aforementioned initial setup, the application resulted in an accuracy of just 24.17%. It was then trained with 25 utterances from the same speaker under various ambiances. Then, the same speaker was tested under different conditions on a testing set that is mutually exclusive to the training data set. The accuracy rose to 73.53%.When trained on 50 such utterances (which are again collected while testing), the application resulted in 96.24% accuracy, and, with 75 such utterances, it gave an almost 0% WER (Fig. 6). Since we used HMM-based speech decoding techniques, the system does not suffer overfitting issues, as with neural networks.pani06-3027245Fig 6 The WER trend with the number of user samples.The most important takeaway from this work is that this application can be successfully trained to personalize it for a single or multiple speakers as required. This idea works across any user without a loss of generality since no prerequisites are assumed from the user. Therefore, although the accuracy and WER with respect to testing on one person are presented, the trend will remain more or less the same, irrespective of the userâ€™s accent, language, or disabilities.The UAVC learns the words as well as the speaker, which is useful for several applications. For instance, this technique can be used in voice-and-words-based security systems where only a particular authority saying a specific utterance is allowed.User feedbackThis application was developed and demonstrated to speech professionals as well as the student community. It was highly appreciated. One of the experts suggested using speech models as a combination of numbers as well as operators. However, since the retraining needs to be very fast, I resorted to employing individual word models. A person with a heavy accent was tested as discussed in the â€œResults and Inferencesâ€ section. Despite initial discomfort due to having to input the correct words most of the time, the user felt comfortable after training on several of his own utterances.ConclusionIn this article, speech recognition is used for a novel application. A UAVC, an assistive technology based on speech recognition, is demonstrated and used specifically as an interface for researchers and educationalists who are involved in using the scientific calculator but are physically disabled. The on-the-fly training with user data is exploited, and the idea is extensible to other techniques, languages, and so on. This technology, if implemented, can become a great aid for the scientific community. The complete algorithm is presented with a flowchart, and an example is demonstrated, validating its utility. Feedback from experts and a user is also provided.Read more about it

R. E. Savoie, J. Brugler, and J. C. Bliss, â€œDevelopment of a hand-held talking calculator for the blind,â€ in Proc. National Comput. Conf. Expo., 1976, pp. 221â€“225, doi: 10.1145/1499799.1499833.

H.-W. Hon, K.-F. Lee, and R. Weide, â€œTowards speech recognition without vocabulary-specific training,â€ in Proc. Workshop Speech Natural Lang., Assoc. Comput. Linguistics, 1989, pp. 271â€“275.

U. Shrawankar and V. Thakare, â€œSpeech user interface for computer-based education system,â€ in Proc. Int. Conf. Signal Image Process., 2010, pp. 148â€“152, doi: 10.1109/ICSIP.2010.5697459.

E. C. Bouck, S. Flanagan, G. S. Joshi, W. Sheikh, and D. Schleppenbach, â€œSpeaking mathâ€”A voice input, speech output calculator for students with visual impairments,â€ J. Special Educ. Technol., vol. 26, no. 4, pp. 1â€“14, 2011, doi: 10.1177/016264341102600401.

T. Ahmed, M. F. Wahid, and M. A. Habib, â€œImplementation of Bangla speech recognition in voice input speech output (VISO) calculator,â€ in Proc. Int. Conf. Bangla Speech Language Process. (ICBSLP), 2018, pp. 1â€“5, doi: 10.1109/ICBSLP.2018.8554773.

S. Davis and P. Mermelstein, â€œComparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,â€ IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 4, pp. 357â€“366, 1980, doi: 10.1109/TASSP.1980.1163420.

T. Bocklet and A. Marek, â€œCepstral variance normalization for audio feature extraction,â€ U.S. Patent Appl. 15/528,068, Nov. 8, 2018.

About the author

Pani Prithvi Raj (paniprithviraj@smail.iitm.ac.in) is pursuing his doctoral studies at the Indian Institute of Technology (IIT) Madras, Chennai 600042, India. His research interests include developing hardware solutions for various paradigms of speech technology, beginning with speech recognition. He is with the Integrated Circuits and Systems Group of Electrical Engineering Department at IIT Madras. He has been a Graduate Student Member of the IEEE Madras Section since 2017.Digital Object Identifier 10.1109/MPOT.2020.3027245CoverCall for PapersMastheadTime and energy for networkingCharacterizing the wireless data transmission of biosignalsUser-adaptive verbal calculator for the physically challenged: An assistive technologyNanoscale on-chip inductors using a linearized meminductive circuitLocation and characterization of faults in coaxial cablesEstimating micro- and small hydroelectric generation potentialThermal radiation modeVisiting the Montreal Science CentreUniversity of IllinoisTechRxivPower grid? What is that?IEEE AccessSubscribe to ProceedingsIEEE Student MemberIEEE InsuranceArchives