This learning resource is about automatic conversion of spoken language into text, that can be stored as documents or processed as commands to control devices e.g. for handicapped people or elderly people or in a commercial setting allows to order goods and services by audio commands. The learning resource is based on the Open Community Approach so the used tools are Open Source to assure that learner have access to the tools.
(Applications of Speech Recognition) Analyse the possible applications of speech recognition and identify challenges of the application!
(Human Speech Recognition) Compare human comprehension of speech with the algorithmic speech recognition approach. What are the similarities and differences of human and algorithmic speech recognition?
(Speech and Detection of Emotions) Speech contains more information than the encoded text. Is it possible to detect emotions in the speech with methods developed in computer science?
What are similarities and difference between text and emotion recognition in speech analysis?
What are possible application areas in digital assitants for both speech recognition and emotion recognition?
Analyze the different types of information systems and identify different areas of application of speech recognition and include mobile devices in your consideration!
(History) Analyse the history of speech recognition and compare the steps of development with current applications. Identify the major steps that are required for the current applications of speech recognition!
(Risk Literacy) Identify possible areas of risks and possible risk mitigation strategies if speech recognition is implemented in mobile devices, or with voice control for Internet of Things in general? What are required capacity building measures for business, research and development!
(Commercial Data Harvesting) Apply the concept of speech recognition to commercial data harvesting. What are potential benefits for generation of tailored advertisments for the users according to their generated profile? How is speech recognition contributing to user profile? What is the difference between offline and online speech recognition systems due to submission of recognized text or audio files submitted to remote servers for speech recognition?
(Context Awareness of Speech Recognition) The word "Fire" with a candle in your hand and with burning house in the background creates a different context and different expectations of people listening to what someone is going to tell you. Exlain why context awareness can be helpful to optimize the recognition correctness? How can a speech recognition system detect a context to the speech recognition. I.e. detecting the context without a user setting that switches to a dictation mode e.g. for medical report for X-Ray images.
(Performance) Explain why the performance of speech recognition and accurancy is relevant in many applications. Discuss application in cars or in general in vehicles. Which voice commands can be applied in a traffic situation and which command (not accurately recognized) could cause trouble or even an accident for the driver. Order the theortical application of speech recognition (e.g. "turn right at crossing", "switch on/off music",...) in terms of required performance and accuracy resp. to current available technologies to perform the command in an acceptable way.
Explain how the concept of speech recognition can support handicapped people[1] with navigating in a WebApp or offline AppLSAC for digital learning environments.
(Size of Vocabulary) Explain how the size of the recognized vocabulary determines the precision of recognition.
(People with Disabilities)[2] Explore the available frameworks Open Source offline infrastructure for speech recognition without sending audio streams to a remote server for processing. Identify options to control robots or in the context of Ambient Assisted Living with voice recognition[3].
Collaborative development of the Open Source code base of the speech recognition infrastructure,
Application on the collaborative development of a domain specific vocabulary for speech recognition for specific application scenarios.
Application on Open Educational Resources that support learners in using speech recognition and Open Source developers in integrating Open Source frameworks into learning environments.
Speech recognition is the interdisciplinary subfield of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the linguistics, computer science, and electrical engineering fields.
Some speech recognition systems require "training" (also called "enrollment") where an individual speaker reads text or isolated vocabulary into the system. The system analyzes the person's specific voice and uses it to fine-tune the recognition of that person's speech, resulting in increased accuracy. Systems that do not use training are called "speaker independent"[4] systems. Systems that use training are called "speaker dependent".
Speech recognition applications include voice user interfaces such as voice dialing (e.g. "call home"), call routing (e.g. "I would like to make a collect call"), domotic appliance control, search (e.g. find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g. a radiology report), determining speaker characteristics,[5] speech-to-text processing (e.g., word processorsemails, and generating a string-searchable transcript from an audio track), and aircraft (usually termed direct voice input).
The term voice recognition[6][7][8] or speaker identification[9][10] refers to identifying the speaker, rather than what they are saying. Recognizing the speaker can simplify the task of translating speech in systems that have been trained on a specific person's voice or it can be used to authenticate or verify the identity of a speaker as part of a security process.
From the technology perspective, speech recognition has a long history with several waves of major innovations. Most recently, the field has benefited from advances in deep learning and big data. The advances are evidenced not only by the surge of academic papers published in the field, but more importantly by the worldwide industry adoption of a variety of deep learning methods in designing and deploying speech recognition systems.
For language learning, speech recognition can be useful for learning a second language. It can teach proper pronunciation, in addition to helping a person develop fluency with their speaking skills.[11]
Students who are blind (see Blindness and education) or have very low vision can benefit from using the technology to convey words and then hear the computer recite them, as well as use a computer by commanding with their voice, instead of having to look at the screen and keyboard.[12]
Students who are physically disabled or suffer from Repetitive strain injury/other injuries to the upper extremities can be relieved from having to worry about handwriting, typing, or working with scribe on school assignments by using speech-to-text programs. They can also utilize speech recognition technology to freely enjoy searching the Internet or using a computer at home without having to physically operate a mouse and keyboard.[12]
Speech recognition can allow students with learning disabilities to become better writers. By saying the words aloud, they can increase the fluidity of their writing, and be alleviated of concerns regarding spelling, punctuation, and other mechanics of writing.[13] Also, see Learning disability.
Use of voice recognition software, in conjunction with a digital audio recorder and a personal computer running word-processing software has proven to be positive for restoring damaged short-term-memory capacity, in stroke and craniotomy individuals.
Popular speech recognition conferences held each year or two include SpeechTEK and SpeechTEK Europe, ICASSP, Interspeech/Eurospeech, and the IEEE ASRU. Conferences in the field of natural language processing, such as ACL, NAACL, EMNLP, and HLT, are beginning to include papers on speech processing. Important journals include the IEEE Transactions on Speech and Audio Processing (later renamed IEEE Transactions on Audio, Speech and Language Processing and since Sept 2014 renamed IEEE/ACM Transactions on Audio, Speech and Language Processing—after merging with an ACM publication), Computer Speech and Language, and Speech Communication.
Books like "Fundamentals of Speech Recognition" by Lawrence Rabiner can be useful to acquire basic knowledge but may not be fully up to date (1993). Another good source can be "Statistical Methods for Speech Recognition" by Frederick Jelinek and "Spoken Language Processing (2001)" by Xuedong Huang etc. More up to date are "Computer Speech", by Manfred R. Schroeder, second edition published in 2004, and "Speech Processing: A Dynamic and Optimization-Oriented Approach" published in 2003 by Li Deng and Doug O'Shaughnessey. The recently updated textbook Speech and Language Processing (2008) by Jurafsky and Martin presents the basics and the state of the art for ASR. Speaker recognition also uses the same features, most of the same front-end processing, and classification techniques as is done in speech recognition. A most recent comprehensive textbook, "Fundamentals of Speaker Recognition" is an in depth source for up to date details on the theory and practice.[16] A good insight into the techniques used in the best modern systems can be gained by paying attention to government sponsored evaluations such as those organised by DARPA (the largest speech recognition-related project ongoing as of 2007 is the GALE project, which involves both speech recognition and translation components).
A good and accessible introduction to speech recognition technology and its history is provided by the general audience book "The Voice in the Machine. Building Computers That Understand Speech" by Roberto Pieraccini (2012).
The most recent book on speech recognition is Automatic Speech Recognition: A Deep Learning Approach (Publisher: Springer) written by D. Yu and L. Deng and published near the end of 2014, with highly mathematically oriented technical detail on how deep learning methods are derived and implemented in modern speech recognition systems based on DNNs and related deep learning methods.[17] A related book, published earlier in 2014, "Deep Learning: Methods and Applications" by L. Deng and D. Yu provides a less technical but more methodology-focused overview of DNN-based speech recognition during 2009–2014, placed within the more general context of deep learning applications including not only speech recognition but also image recognition, natural language processing, information retrieval, multimodal processing, and multitask learning.[18]
In terms of freely available resources, Carnegie Mellon University's Sphinx toolkit is one place to start to both learn about speech recognition and to start experimenting. Another resource (free but copyrighted) is the HTK book (and the accompanying HTK toolkit). For more recent and state-of-the-art techniques, Kaldi toolkit can be used[19]. In 2017 Mozilla launched the open source project called Common Voice[20] to gather big database of voices that would help build free speech recognition project DeepSpeech (available free at GitHub)[21] using Google open source platform TensorFlow[22].
A demonstration of an on-line speech recognizer is available on Cobalt's webpage.[23]
↑Pacheco-Tallaj, Natalia M., and Claudio-Palacios, Andrea P. "Development of a Vocabulary and Grammar for an Open-Source Speech-driven Programming Platform to Assist People with Limited Hand Mobility". Research report submitted to Keyla Soto, UHS Science Professor.
↑Stodden, Robert A., and Kelly D. Roberts. "The Use Of Voice Recognition Software As A Compensatory Strategy For Postsecondary Education Students Receiving Services Under The Category Of Learning Disabled." Journal Of Vocational Rehabilitation 22.1 (2005): 49--64. Academic Search Complete. Web. 1 Mar. 2015.
↑Zaman, S., & Slany, W. (2014). Smartphone-Based Online and Offline Speech Recognition System for ROS-Based Robots. Information Technology and Control, 43(4), 371-380.
↑"Speaker Identification (WhisperID)". Microsoft Research. Microsoft. Archived from the original on 25 February 2014. Retrieved 21 February 2014. When you speak to someone, they don't just recognize what you say: they recognize who you are. WhisperID will let computers do that, too, figuring out who you are by the way you sound.
↑Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., ... & Vesely, K. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society.
Pieraccini, Roberto (2012). The Voice in the Machine. Building Computers That Understand Speech.. The MIT Press. ISBN 978-0262016858.
Woelfel, Matthias; McDonough, John (2009-05-26). Distant Speech Recognition. Wiley. ISBN 978-0470517048.
Karat, Clare-Marie; Vergo, John; Nahamoo, David (2007). "Conversational Interface Technologies". In Sears, Andrew; Jacko, Julie A.. The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies, and Emerging Applications (Human Factors and Ergonomics). Lawrence Erlbaum Associates Inc. ISBN 978-0-8058-5870-9.
Cole, Ronald; Mariani, Joseph; Uszkoreit, Hans et al., eds (1997). Survey of the state of the art in human language technology. Cambridge Studies in Natural Language Processing. XII–XIII. Cambridge University Press. ISBN 978-0-521-59277-2.
Junqua, J.-C.; Haton, J.-P. (1995). Robustness in Automatic Speech Recognition: Fundamentals and Applications. Kluwer Academic Publishers. ISBN 978-0-7923-9646-8.
Pirani, Giancarlo, ed (2013). Advanced algorithms and architectures for speech understanding. Springer Science & Business Media. ISBN 978-3-642-84341-9.