Speech verification

Short description: Use of speech recognition to verify pronunciation

Automatic pronunciation assessment is the use of speech recognition to verify the correctness of pronounced speech,^[1] as distinguished from manual assessment by an instructor or proctor.^[2] Also called speech verification, the main application of this technology is computer-aided pronunciation teaching (CAPT) when combined with computer-aided instruction for computer-assisted language learning (CALL) or speech remediation. Pronunciation assessment does not determine unknown speech (as in dictation or automatic transcription) but instead, knowing the expected word(s) in advance, it attempts to verify the correctness of the learner's pronunciation and ideally their intelligibility to listeners,^[3]^[4] sometimes along with often inconsequential prosody, such as intonation, pitch, tempo, rhythm, and stress.^[5] Pronunciation assessment is also used in reading tutoring, for example in products such as Microsoft Teams^[6] and from Amira Learning.^[7] Automatic pronunciation assessment can also be used to help diagnose and treat speech disorders such as apraxia.^[8]

The earliest work on pronunciation assessment avoided measuring genuine listener intelligibility,^[9] a shortcoming corrected in 2011 at the Toyohashi University of Technology,^[10] and included in the Versant high-stakes English fluency assessment from Pearson^[11] and mobile apps from 17zuoye Education & Technology,^[12] but still missing in 2023 products from Google Search,^[13] Microsoft,^[14] Educational Testing Service,^[15] Speechace,^[16] and ELSA.^[17] Assessing authentic listener intelligibility is essential for avoiding inaccuracies from accent bias, especially in high-stakes assessments;^[18]^[19]^[20] from words with multiple correct pronunciations;^[21] and from phoneme coding errors in machine-readable pronunciation dictionaries.^[22] In the Common European Framework of Reference for Languages (CEFR) assessment criteria for "overall phonological control", intelligibility outweighs formally correct pronunciation at all levels.^[23]

Although there are as yet no industry-standard benchmarks for evaluating pronunciation assessment accuracy, researchers occasionally release evaluation speech corpuses for others to use for improving assessment quality.^[24]^[25] Such evaluation databases often emphasize formally unaccented pronunciation to the exclusion of genuine intelligibility evident from blinded listener transcriptions.^[4] Some promising areas for improvement being developed in 2023 include articulatory feature extraction^[26] and transfer learning to attenuate unnecessary corrections.^[27] Other interesting advances under development include augmented reality interfaces for mobile devices to provide pronunciation training on text found in the users' environments.^[28]^[29]

References

↑ Ehsani, Farzad; Knodt, Eva (July 1998). "Speech technology in computer-aided language learning: Strengths and limitations of a new CALL paradigm" (in en). Language Learning & Technology (University of Hawaii National Foreign Language Resource Center; Michigan State University Center for Language Education and Research) 2 (1): 54–73. https://www.lltjournal.org/item/10125-25032/. Retrieved 11 February 2023.
↑ Isaacs, Talia; Harding, Luke (July 2017). "Pronunciation assessment" (in en). Language Teaching 50 (3): 347–366. doi:10.1017/S0261444817000118. ISSN 0261-4448. https://www.cambridge.org/core/journals/language-teaching/article/pronunciation-assessment/AA9DD6C481AB7E5C52C5490CCD857089. Retrieved 26 February 2023.
↑ Loukina, Anastassia et al. (September 6, 2015), "Pronunciation accuracy and intelligibility of non-native speech", INTERSPEECH 2015, Dresden, Germany: International Speech Communication Association, pp. 1917–1921, https://www.isca-speech.org/archive/pdfs/interspeech_2015/loukina15_interspeech.pdf, "only 16% of the variability in word-level intelligibility can be explained by the presence of obvious mispronunciations."
↑ ^4.0 ^4.1 O’Brien, Mary Grantham et al. (31 December 2018). "Directions for the future of technology in pronunciation research and teaching" (in en). Journal of Second Language Pronunciation 4 (2): 182–207. doi:10.1075/jslp.17001.obr. ISSN 2215-1931. https://www.jbe-platform.com/content/journals/10.1075/jslp.17001.obr. Retrieved 11 February 2023. "pronunciation researchers are primarily interested in improving L2 learners’ intelligibility and comprehensibility, but they have not yet collected sufficient amounts of representative and reliable data (speech recordings with corresponding annotations and judgments) indicating which errors affect these speech dimensions and which do not. These data are essential to train ASR algorithms to assess L2 learners’ intelligibility.".
↑ Eskenazi, Maxine (January 1999). "Using automatic speech processing for foreign language pronunciation tutoring: Some issues and a prototype" (in en). Language Learning & Technology 2 (2): 62–76. https://www.lltjournal.org/item/10125-25043/. Retrieved 11 February 2023.
↑ Tholfsen, Mike (9 February 2023). "Reading Coach in Immersive Reader plus new features coming to Reading Progress in Microsoft Teams" (in en). Techcommunity Education Blog (Microsoft). https://techcommunity.microsoft.com/t5/education-blog/reading-coach-in-immersive-reader-plus-new-features-coming-to/ba-p/3734079.
↑ "Amira Learning". https://amiralearning.com/.
↑ Hair, Adam et al. (19 June 2018). "Apraxia world: a speech therapy game for children with speech sound disorders". Proceedings of the 17th ACM Conference on Interaction Design and Children: 119–131. doi:10.1145/3202185.3202733. https://psi.engr.tamu.edu/wp-content/uploads/2018/04/hair2018idc.pdf.
↑ Bernstein, Jared et al. (November 18, 1990), "Automatic Evaluation and Training in English Pronunciation", First International Conference on Spoken Language Processing (ICSLP 90), Kobe, Japan: International Speech Communication Association, pp. 1185–1188, https://www.isca-speech.org/archive_v0/archive_papers/icslp_1990/i90_1185.pdf, retrieved 11 February 2023, "listeners differ considerably in their ability to predict unintelligible words.... Thus, it seems the quality rating is a more desirable... automatic-grading score." (Section 2.2.2.)
↑ Hiroshi, Kibishi; Nakagawa, Seiichi (August 28, 2011), "New feature parameters for pronunciation evaluation in English presentations at international conferences", INTERSPEECH 2011, Florence, Italy: International Speech Communication Association, pp. 1149–1152, https://www.isca-speech.org/archive_v0/archive_papers/interspeech_2011/i11_1149.pdf, retrieved 11 February 2023, "we investigated the relationship between pronunciation score / intelligibility and various acoustic measures, and then combined these measures.... As far as we know, the automatic estimation of intelligibility has not yet been studied."
↑ Bonk, Bill (25 August 2020). "New innovations in assessment: Versant's Intelligibility Index score". Pearson English. https://www.english.com/blog/intelligibility-index-score-versant/. "you don’t need a perfect accent, grammar, or vocabulary to be understandable. In reality, you just need to be understandable with little effort by listeners."
↑ Gao, Yuan et al. (May 25, 2018), "Spoken English Intelligibility Remediation with PocketSphinx Alignment and Feature Extraction Improves Substantially over the State of the Art", 2nd IEEE Advanced Information Management, Communication, Electronic and Automation Control Conference (IMCEC 2018), doi:10.1109/IMCEC.2018.8469649
↑ Snir, Tal (14 November 2019). "How do you pronounce quokka? Practice with Search" (in en-us). Google. https://blog.google/products/search/how-do-you-pronounce-quokka-practice-search/.
↑ "Pronunciation assessment tool". Microsoft. https://speech.microsoft.com/portal/pronunciationassessmenttool.
↑ Chen, Lei et al. (December 2018) (in en). Automated Scoring of Nonnative Speech: Using the SpeechRater v. 5.0 Engine. ETS Research Report Series. 2018. Princeton, NJ: Educational Testing Service. pp. 1–31. doi:10.1002/ets2.12198. https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12198. Retrieved 11 February 2023.
↑ Alnafisah, Mutleb (September 2022), "Technology Review: Speechace", Proceedings of the 12th Pronunciation in Second Language Learning and Teaching Conference (Virtual PSLLT), no. 40, 12, St. Catharines, Ontario, ISSN 2380-9566, https://www.iastatedigitalpress.com/psllt/article/14315/galley/14157/view, retrieved 14 February 2023
↑ Gorham, Jon; et al. (March 10, 2022). Speech Recognition for English Language Learning (video). Technology in Language Teaching and Learning. Education Solutions. Retrieved 2023-02-14.
↑ "Computer says no: Irish vet fails oral English test needed to stay in Australia". The Guardian. Australian Associated Press. 8 August 2017. https://www.theguardian.com/australia-news/2017/aug/08/computer-says-no-irish-vet-fails-oral-english-test-needed-to-stay-in-australia.
↑ Ferrier, Tracey (9 August 2017). "Australian ex-news reader with English degree fails robot's English test" (in en). The Sydney Morning Herald. https://www.smh.com.au/technology/australian-exnews-reader-with-english-degree-fails-robots-english-test-20170809-gxsjv2.html.
↑ Main, Ed; Watson, Richard (9 February 2022). "The English test that ruined thousands of lives". BBC News. https://www.bbc.com/news/uk-60264106.
↑ Joyce, Katy Spratte (January 24, 2023). "13 Words That Can Be Pronounced Two Ways". Reader's Digest. https://www.rd.com/list/words-that-can-be-pronounced-two-ways/.
↑ E.g., CMUDICT, "The CMU Pronouncing Dictionary". http://www.speech.cs.cmu.edu/cgi-bin/cmudict. Compare "four" given as "F AO R" with the vowel AO as in "caught," to "row" given as "R OW" with the vowel OW as in "oat."
↑ Common European framework of reference for languages learning, teaching, assessment: Companion volume with new descriptors. Language Policy Programme, Education Policy Division, Education Department, Council of Europe. February 2018. p. 136. OCLC 1090351600. https://rm.coe.int/cefr-companion-volume-with-new-descriptors-2018/1680787989.
↑ Vidal, Jazmín et al. (15 September 2019), "EpaDB: A Database for Development of Pronunciation Assessment Systems", Interspeech 2019, pp. 589–593, doi:10.21437/Interspeech.2019-1839, https://www.isca-speech.org/archive/pdfs/interspeech_2019/vidal19_interspeech.pdf, retrieved 19 February 2023 ; database .zip file.
↑ Zhang, Junbo et al. (30 August 2021), "speechocean762: An Open-Source Non-Native English Speech Corpus for Pronunciation Assessment", Interspeech 2021, pp. 3710–3714, doi:10.21437/Interspeech.2021-1259, https://www.isca-speech.org/archive/pdfs/interspeech_2021/zhang21x_interspeech.pdf, retrieved 19 February 2023 ; GitHub corpus repository.
↑ Wu, Peter; et al. (14 February 2023), "Speaker-Independent Acoustic-to-Articulatory Speech Inversion", arXiv:2302.06774 [eess.AS]
↑ Sancinetti, Marcelo et al. (23 May 2022). "A Transfer Learning Approach for Pronunciation Scoring". 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022): 6812–6816. doi:10.1109/ICASSP43922.2022.9747727. ISBN 978-1-6654-0540-9.
↑ Che Dalim, Che Samihah et al. (February 2020). "Using augmented reality with speech input for non-native children's language learning". International Journal of Human-Computer Studies 134: 44–64. doi:10.1016/j.ijhcs.2019.10.002. https://core.ac.uk/download/pdf/233047951.pdf. Retrieved 28 February 2023.
↑ Tolba, Rahma M. et al. (2023). "Mobile Augmented Reality for Learning Phonetics: A Review (2012–2022)" (in en). Extended Reality and Metaverse (Springer International Publishing): 87–98. doi:10.1007/978-3-031-25390-4_7. https://link.springer.com/chapter/10.1007/978-3-031-25390-4_7. Retrieved 28 February 2023.

External links

International Speech Communication Association (ISCA) Special Interest Group on Speech and Language Technologies in Education (SLaTE)

0.00

(0 votes)

[1] Ehsani, Farzad; Knodt, Eva (July 1998). "Speech technology in computer-aided language learning: Strengths and limitations of a new CALL paradigm" (in en). Language Learning & Technology (University of Hawaii National Foreign Language Resource Center; Michigan State University Center for Language Education and Research) 2 (1): 54–73. https://www.lltjournal.org/item/10125-25032/. Retrieved 11 February 2023.

[2] Isaacs, Talia; Harding, Luke (July 2017). "Pronunciation assessment" (in en). Language Teaching 50 (3): 347–366. doi:10.1017/S0261444817000118. ISSN 0261-4448. https://www.cambridge.org/core/journals/language-teaching/article/pronunciation-assessment/AA9DD6C481AB7E5C52C5490CCD857089. Retrieved 26 February 2023.

[3] Loukina, Anastassia et al. (September 6, 2015), "Pronunciation accuracy and intelligibility of non-native speech", INTERSPEECH 2015, Dresden, Germany: International Speech Communication Association, pp. 1917–1921, https://www.isca-speech.org/archive/pdfs/interspeech_2015/loukina15_interspeech.pdf, "only 16% of the variability in word-level intelligibility can be explained by the presence of obvious mispronunciations."

[obrien-4] 4.0 ^4.1 O’Brien, Mary Grantham et al. (31 December 2018). "Directions for the future of technology in pronunciation research and teaching" (in en). Journal of Second Language Pronunciation 4 (2): 182–207. doi:10.1075/jslp.17001.obr. ISSN 2215-1931. https://www.jbe-platform.com/content/journals/10.1075/jslp.17001.obr. Retrieved 11 February 2023. "pronunciation researchers are primarily interested in improving L2 learners’ intelligibility and comprehensibility, but they have not yet collected sufficient amounts of representative and reliable data (speech recordings with corresponding annotations and judgments) indicating which errors affect these speech dimensions and which do not. These data are essential to train ASR algorithms to assess L2 learners’ intelligibility.".

[5] Eskenazi, Maxine (January 1999). "Using automatic speech processing for foreign language pronunciation tutoring: Some issues and a prototype" (in en). Language Learning & Technology 2 (2): 62–76. https://www.lltjournal.org/item/10125-25043/. Retrieved 11 February 2023.

[6] Tholfsen, Mike (9 February 2023). "Reading Coach in Immersive Reader plus new features coming to Reading Progress in Microsoft Teams" (in en). Techcommunity Education Blog (Microsoft). https://techcommunity.microsoft.com/t5/education-blog/reading-coach-in-immersive-reader-plus-new-features-coming-to/ba-p/3734079.

[7] "Amira Learning". https://amiralearning.com/.

[8] Hair, Adam et al. (19 June 2018). "Apraxia world: a speech therapy game for children with speech sound disorders". Proceedings of the 17th ACM Conference on Interaction Design and Children: 119–131. doi:10.1145/3202185.3202733. https://psi.engr.tamu.edu/wp-content/uploads/2018/04/hair2018idc.pdf.

[9] Bernstein, Jared et al. (November 18, 1990), "Automatic Evaluation and Training in English Pronunciation", First International Conference on Spoken Language Processing (ICSLP 90), Kobe, Japan: International Speech Communication Association, pp. 1185–1188, https://www.isca-speech.org/archive_v0/archive_papers/icslp_1990/i90_1185.pdf, retrieved 11 February 2023, "listeners differ considerably in their ability to predict unintelligible words.... Thus, it seems the quality rating is a more desirable... automatic-grading score." (Section 2.2.2.)

[10] Hiroshi, Kibishi; Nakagawa, Seiichi (August 28, 2011), "New feature parameters for pronunciation evaluation in English presentations at international conferences", INTERSPEECH 2011, Florence, Italy: International Speech Communication Association, pp. 1149–1152, https://www.isca-speech.org/archive_v0/archive_papers/interspeech_2011/i11_1149.pdf, retrieved 11 February 2023, "we investigated the relationship between pronunciation score / intelligibility and various acoustic measures, and then combined these measures.... As far as we know, the automatic estimation of intelligibility has not yet been studied."

[11] Bonk, Bill (25 August 2020). "New innovations in assessment: Versant's Intelligibility Index score". Pearson English. https://www.english.com/blog/intelligibility-index-score-versant/. "you don’t need a perfect accent, grammar, or vocabulary to be understandable. In reality, you just need to be understandable with little effort by listeners."

[12] Gao, Yuan et al. (May 25, 2018), "Spoken English Intelligibility Remediation with PocketSphinx Alignment and Feature Extraction Improves Substantially over the State of the Art", 2nd IEEE Advanced Information Management, Communication, Electronic and Automation Control Conference (IMCEC 2018), doi:10.1109/IMCEC.2018.8469649

[13] Snir, Tal (14 November 2019). "How do you pronounce quokka? Practice with Search" (in en-us). Google. https://blog.google/products/search/how-do-you-pronounce-quokka-practice-search/.

[14] "Pronunciation assessment tool". Microsoft. https://speech.microsoft.com/portal/pronunciationassessmenttool.

[15] Chen, Lei et al. (December 2018) (in en). Automated Scoring of Nonnative Speech: Using the SpeechRater v. 5.0 Engine. ETS Research Report Series. 2018. Princeton, NJ: Educational Testing Service. pp. 1–31. doi:10.1002/ets2.12198. https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12198. Retrieved 11 February 2023.

[16] Alnafisah, Mutleb (September 2022), "Technology Review: Speechace", Proceedings of the 12th Pronunciation in Second Language Learning and Teaching Conference (Virtual PSLLT), no. 40, 12, St. Catharines, Ontario, ISSN 2380-9566, https://www.iastatedigitalpress.com/psllt/article/14315/galley/14157/view, retrieved 14 February 2023

[17] Gorham, Jon; et al. (March 10, 2022). Speech Recognition for English Language Learning (video). Technology in Language Teaching and Learning. Education Solutions. Retrieved 2023-02-14.

[18] "Computer says no: Irish vet fails oral English test needed to stay in Australia". The Guardian. Australian Associated Press. 8 August 2017. https://www.theguardian.com/australia-news/2017/aug/08/computer-says-no-irish-vet-fails-oral-english-test-needed-to-stay-in-australia.

[19] Ferrier, Tracey (9 August 2017). "Australian ex-news reader with English degree fails robot's English test" (in en). The Sydney Morning Herald. https://www.smh.com.au/technology/australian-exnews-reader-with-english-degree-fails-robots-english-test-20170809-gxsjv2.html.

[20] Main, Ed; Watson, Richard (9 February 2022). "The English test that ruined thousands of lives". BBC News. https://www.bbc.com/news/uk-60264106.

[21] Joyce, Katy Spratte (January 24, 2023). "13 Words That Can Be Pronounced Two Ways". Reader's Digest. https://www.rd.com/list/words-that-can-be-pronounced-two-ways/.

[22] E.g., CMUDICT, "The CMU Pronouncing Dictionary". http://www.speech.cs.cmu.edu/cgi-bin/cmudict. Compare "four" given as "F AO R" with the vowel AO as in "caught," to "row" given as "R OW" with the vowel OW as in "oat."

[23] Common European framework of reference for languages learning, teaching, assessment: Companion volume with new descriptors. Language Policy Programme, Education Policy Division, Education Department, Council of Europe. February 2018. p. 136. OCLC 1090351600. https://rm.coe.int/cefr-companion-volume-with-new-descriptors-2018/1680787989.

[24] Vidal, Jazmín et al. (15 September 2019), "EpaDB: A Database for Development of Pronunciation Assessment Systems", Interspeech 2019, pp. 589–593, doi:10.21437/Interspeech.2019-1839, https://www.isca-speech.org/archive/pdfs/interspeech_2019/vidal19_interspeech.pdf, retrieved 19 February 2023 ; database .zip file.

[25] Zhang, Junbo et al. (30 August 2021), "speechocean762: An Open-Source Non-Native English Speech Corpus for Pronunciation Assessment", Interspeech 2021, pp. 3710–3714, doi:10.21437/Interspeech.2021-1259, https://www.isca-speech.org/archive/pdfs/interspeech_2021/zhang21x_interspeech.pdf, retrieved 19 February 2023 ; GitHub corpus repository.

[26] Wu, Peter; et al. (14 February 2023), "Speaker-Independent Acoustic-to-Articulatory Speech Inversion", arXiv:2302.06774 [eess.AS]

[27] Sancinetti, Marcelo et al. (23 May 2022). "A Transfer Learning Approach for Pronunciation Scoring". 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022): 6812–6816. doi:10.1109/ICASSP43922.2022.9747727. ISBN 978-1-6654-0540-9.

[28] Che Dalim, Che Samihah et al. (February 2020). "Using augmented reality with speech input for non-native children's language learning". International Journal of Human-Computer Studies 134: 44–64. doi:10.1016/j.ijhcs.2019.10.002. https://core.ac.uk/download/pdf/233047951.pdf. Retrieved 28 February 2023.

[29] Tolba, Rahma M. et al. (2023). "Mobile Augmented Reality for Learning Phonetics: A Review (2012–2022)" (in en). Extended Reality and Metaverse (Springer International Publishing): 87–98. doi:10.1007/978-3-031-25390-4_7. https://link.springer.com/chapter/10.1007/978-3-031-25390-4_7. Retrieved 28 February 2023.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]