The tutorial for Windows Speech Recognition in Windows Vista depicting the selection of text in WordPad for deletion. | |
Developer(s) | Microsoft |
---|---|
Operating system | Windows Vista and later Windows Server 2008 and later |
Type | Speech recognition |
Windows Speech Recognition (WSR) is a speech recognition component developed by Microsoft for Windows Vista that enables voice commands to control the desktop user interface; dictate text in electronic documents and email; navigate websites; perform keyboard shortcuts; and to operate the mouse cursor. It also supports the creation of custom macros to perform additional or supplementary tasks.
WSR is a locally-processed speech recognition platform; it does not rely on cloud computing for accuracy, dictation, or recognition, but adapts based on contexts, grammars, speech samples, training sessions, and vocabularies. It provides a personal dictionary that allows users to include or exclude words or expressions from dictation and to optionally record pronunciations to increase recognition accuracy. With Windows Search,[1] it can optionally analyze and collect text in documents, email, as well as handwritten tablet PC input to contextualize and disambiguate terms to further adapt the recognizer.[2] Custom language models that adapt the recognizer to the specific contexts, phonetics, and terminologies of users in particular occupational fields such as legal or medical are also supported.[3]
WSR was developed to be integrated into Windows Vista, as Windows previously only supported speech recognition features exclusive to applications such as Windows Media Player. Microsoft Office XP introduced speech recognition, but it was mainly limited to Internet Explorer and Office. With the release of Windows Vista, Office 2007 and later versions of Office rely on WSR, replacing the separate Office speech recognition.[4] The majority of integrated applications in Windows Vista can be controlled through speech.[5] WSR is present in Windows 7,[6] Windows 8,[7] Windows 8.1,[8] Windows RT,[8] and Windows 10.[9]
Microsoft was involved in speech recognition and speech synthesis research for many years before WSR. In 1993, Microsoft hired Xuedong Huang from Carnegie Mellon University to lead its speech development efforts; the company's research led to the development of the Speech API introduced in 1994.[10] Speech recognition had also been used in previous Microsoft products. Office XP and Office 2003 provided speech recognition capabilities among Internet Explorer and Office applications;[11] it also enabled limited speech functionality in Windows 98, Windows ME, Windows NT 4.0, and Windows 2000.[12] Windows XP Tablet PC Edition 2002 included speech recognition capabilities with the Tablet PC Input Panel,[13][14] and the Microsoft Plus! for Windows XP expansion package enabled voice commands to be used in Windows Media Player.[15] However, this required installation of speech recognition as an additional component (with support primarily limited to individual applications); before Windows Vista, Windows did not include extensive or integrated speech recognition capabilities.[14]
At the 2002 Windows Hardware Engineering Conference (WinHEC 2002) Microsoft announced that Windows Vista (then codenamed "Longhorn") would include advances in speech recognition and in features such as microphone array support;[16] these features were part of the company's goal to "provide a consistent quality audio infrastructure for natural (continuous) speech recognition and (discrete) command and control."[17] Bill Gates stated during the 2003 Professional Developers Conference (PDC 2003) that Microsoft would "build speech capabilities into the system -- a big advance for that in 'Longhorn,' in both recognition and synthesis, real-time";[18][19] and pre-release builds throughout the development of Windows Vista included a speech engine with training features.[20] A PDC 2003 developer presentation stated that Windows Vista would also include a user interface for microphone feedback and control, and user configuration and training features.[21] Microsoft later clarified the extent to which speech recognition would be integrated when it stated in a pre-release software development kit that "the common speech scenarios, like speech-enabling menus and buttons, will be enabled system-wide."[22]
During WinHEC 2004, Microsoft listed WSR as part of its "Longhorn" mobile PC strategy to improve productivity.[23][24] At WinHEC 2005, Microsoft emphasized accessibility, new mobility scenarios, and improvements to the speech user experience. Unlike the speech support included in Windows XP, which was integrated with the Tablet PC Input Panel and required switching between separate Commanding and Dictation modes, Windows Vista would introduce a dedicated interface for speech input on the desktop and unify the separate speech modes;[25] users previously could not speak a command after dictating or vice versa without first switching between these two modes.[26] Microsoft also stated that Windows Vista would improve dictation accuracy and support additional language;[25] a demonstration emphasized email dictation,[25] and a presentation about microphone arrays was also shown.[27] Windows Vista Beta 1 included an integrated speech recognition application.[28] To incentivize company employees to analyze WSR for software glitches and provide feedback during its development, Microsoft offered an opportunity for testers to win a Premium model of the Xbox 360.[29]
During a demonstration by Micorosoft on July 27, 2006, before Windows Vista's release to manufacturing (RTM), a notable incident involving WSR occurred that resulted in an unintended output of "Dear aunt, let's set so double the killer delete select all" when several attempts to dictate led to consecutive output errors;[30][31] the incident was a subject of significant derision among analysts and journalists in the audience.[32][33] Microsoft later revealed that these issues were due to an audio gain glitch that caused the speech recognizer to distort the dictated words;[34] the glitch was fixed before Windows Vista's release.[34]
In early 2007, reports surfaced that WSR might be vulnerable to an attack that could allow attackers to play audio through a computer's speakers, thereby using speech recognition to perform undesired user operations on a target computer;[35][36] it was the first vulnerability discovered after Windows Vista's general availability.[37] While Microsoft stated that such an attack is theoretically possible, it would have to meet a number of prerequisites to be successful: the target system would have to have the speech recognition feature properly configured and activated; speakers and microphone(s) connected to the targeted system would need to be turned on; and the exploit would require the software to interpret commands without a user noticing—an unlikely scenario as the affected system would perform visible interface operations and produce audible feedback. Mitigating factors include dictation clarity and microphone feedback and placement. Because of User Account Control, an exploit of this nature also would not be able to perform privileged operations for users or protected administrators without explicit consent.[38]
With Windows 7 Microsoft introduced several changes to improve the user experience. The recognizer was updated to use Microsoft UI Automation—substantially enhancing its performance—and the recognition engine now uses the WASAPI audio stack, which enables support for echo cancellation. The document harvester, which optionally analyzes and collects text in email and documents to contextualize and disambiguate user terms has improved performance, and has been updated to run periodically in the background instead of only after recognizer startup. Sleep mode has also seen performance improvements and, to address security issues, Windows 7 introduces a new "voice activation" option—enabled by default—that turns the recognizer off after users speak "stop listening" instead of putting the recognizer to sleep. Windows 7 also introduces an option to submit speech training data to Microsoft to improve future recognizer versions.[39]
Windows 7 introduced an optional dictation scratchpad interface that functions as a temporary document into which users can dictate or type text for insertion into applications that are not compatible with the Text Services Framework.[39] WSR previously provided an "enable dictation everywhere option" in Windows Vista.[40]
WSR can be used to control the Metro user interface in Windows 8, Windows 8.1, and Windows RT with commands to open the Charms bar ("Press Windows C"); to dictate or display commands in Metro-style apps ("Press Windows Z"); to perform tasks in apps (e.g., "Change to Celsius" in MSN Weather); and to display all installed apps listed by the Start screen ("Apps").[8]
WSR is featured in the Settings application starting with the Windows 10 April 2018 Update (Version 1803); the change first appeared in Insider Preview Build 17083.[41] The April 2018 Update also introduces a new ++ keyboard shortcut to activate WSR.[42]
WSR allows a user to control a computer, including the operating system desktop user interface, through voice commands. Applications, including most of those bundled with Windows, can also be controlled through voice commands.[5] By using speech recognition, users can dictate text within documents, email, and forms; control the operating system user interface; perform keyboard shortcuts; and move the mouse cursor.[43]
WSR uses a local speech profile to store information about a user's voice.[2] Accuracy of speech recognition increases through use, which helps the feature adapt to a user's grammar, speech patterns, vocabulary, and word usage.[44][2] Speech recognition also includes a tutorial to improve accuracy,[44] and can optionally review a user's personal documents—including email—to improve its command and dictation accuracy.[45] Individual speech profiles can be created on a per-user basis,[2] and backups of profiles can be performed via Windows Easy Transfer.[46] WSR supports the following languages: Chinese (Traditional), Chinese (Simplified), English (U.S.), English (U.K.), French, German, Japanese, and Spanish.[44] WSR relies on the Speech API developed by Microsoft,[10] and third-party applications must support the Text Services Framework.[5]
The WSR interface consists of a status area that displays instructions, information about commands (e.g., if a command is not heard by the recognizer), and the status of the recognizer; a voice meter displays visual feedback about volume levels. The status area represents the current state of WSR in a total of three modes, listed below with their respective meanings:
Colors of the recognizer listening mode button denote its various modes of operation: blue when listening; blue-gray when sleeping; gray when turned off; and yellow when the user switches context (e.g., from the desktop to the taskbar) or when a voice command is misinterpreted. The status area can also display custom user information as part of Windows Speech Recognition Macros.[47][48]
An alternates panel disambiguation interface displays a list of items interpreted as being relevant to a user's spoken word(s); if the word or phrase that a user desired to insert into an application is listed among results, a user can speak the corresponding number of the word or phrase in the results and confirm this choice by speaking "OK" to insert it within the application.[49] The alternates panel will also appear when launching applications or speaking commands that refer to more than one item (e.g., speaking "Start Internet Explorer" may list the web browser and a version of it with browser add-ons disabled). However, an ExactMatchOverPartialMatch Windows Registry entry can limit commands to items with exact names if there is more than one instance included in results.[50]
Listed below are common WSR commands. Words in italics indicate a word that can be substituted for a desired item (e.g., the word "direction" in the "scroll direction" command can be substituted with the word "down").[43] A "start typing" command enables WSR to interpret all dictation commands as keyboard shortcuts.[49]
A mousegrid command enables users to control the mouse cursor by overlaying numbers across nine regions on the screen; these regions gradually narrow as a user speaks the number(s) of the region on which to focus until the desired interface element is reached. The regions with which a user can interact are based on commands including "Click number of region," which moves the mouse cursor to the desired region and then clicks it; and "Mark number of region", which allows an item (such as a computer icon) in a region to be selected, which can then be clicked with the previous click command. A user can also simultaneously interact with multiple regions of the mousegrid.[43]
Applications and interface elements that do not present identifiable commands can still be controlled by asking the system to overlay numbers on top of them through a show numbers command. Once active, speaking the overlaid number selects that item so a user can open it or perform other operations.[43] Show numbers was designed so that users could interact with items that are not readily identifiable.[52]
WSR enables dictation of text in the operating system and applications. If a dictation mistake occurs it can be corrected by speaking "Correct word" or "Correct that" and the alternates panel will appear and provide suggestions for correction; these suggestions can be selected by speaking the number corresponding to the number of the suggestion in the list and by speaking "OK." If the desired item is not listed among suggestions, a user can speak it so that it might appear. Alternatively, users can speak "Spell it" or "I'll spell it myself" to speak the desired item on a per-letter basis; users can use their personal alphabet or the NATO phonetic alphabet when spelling. Multiple words in a sentence can be corrected simultaneously (for example, if a user speaks "dictating" but the recognizer interprets this word as "the thing," a user can state "correct the thing" to correct both words). In the English language over 100,000 words are recognized by default.[3]
WSR includes a personal dictionary that allows users to include or exclude certain words or expressions from dictation.[3] When a user adds a word beginning with a capital letter to the dictionary, a user can specify whether it should always be capitalized or if capitalization depends on the context in which the word is spoken. Users can also record pronunciations for words added to the dictionary to increase recognition accuracy; words written via a stylus on a tablet PC for the Windows handwriting recognition feature are also stored. Most of the information stored within a dictionary is included as part of a user's speech profile.[2]
WSR supports custom macros through a supplementary application by Microsoft that enables additional natural language commands.[53][54] As an example of this functionality, an email macro released by Microsoft enables a natural language command where a user can state "send email to contact about subject," which opens Microsoft Outlook to compose a new message with the designated contact and subject automatically inserted.[55] Microsoft has also released sample macros for the speech dictionary,[56] for Windows Media Player,[57] for Microsoft PowerPoint,[58] for speech synthesis,[59] to switch between multiple microphones,[60] to customize various aspects of audio device configuration such as volume levels,[61] and for general natural language queries such as "What is the weather forecast?"[62] "What time is it?"[59] and "What's the date?"[59] Answers to these queries are spoken via a speech synthesizer.
Users and developers can create their own macros that can be based on text transcription and substitution; application execution (with support for command-line arguments); keyboard shortcuts; emulation of existing voice commands; or a combination of these items. XML, JScript and VBScript are supported.[49] Macros can be limited to individual applications if desired[63] and rules for macros can be defined programmatically.[55] For a macro to load, it must be stored in a Speech Macros folder within the current user's Documents directory. All macros are digitally signed by default if a user certificate is available, to ensure that commands are not corrupted or loaded by third-parties; if one is not available, an administrator can create a certificate for use.[64] The macros utility also includes security levels to prohibit unsigned macros from being loaded; to prompt users to sign macros; and to load unsigned macros.[63]
(As of 2017) WSR uses Microsoft Speech Recognizer 8.0, which has not been changed since Windows Vista. For dictation it was found to be 93.6% accurate without training by Mark Hachman, a Senior Editor of PC World—a rate that is not as accurate as competing software. According to Microsoft, the rate of accuracy when trained is 99%. Hachman commented that Microsoft does not publicly discuss WSR, attributing this to the 2006 incident during development of Windows Vista, with few users knowing that documents could be dictated within Windows before the introduction of Cortana.[65]
|archivedate=
, you must also specify |archiveurl=
. https://blogs.technet.microsoft.com/msrc/2007/01/31/issue-regarding-windows-vista-speech-recognition/. Retrieved March 31, 2018.
Original source: https://en.wikipedia.org/wiki/Windows Speech Recognition.
Read more |