OCR in Indian languages

Optical character recognition (also known as OCR) is the process of converting an image into text. OCR for Latin characters is still not 100% accurate but a relatively high degree of accuracy in conversion has been able to be achieved. Such accuracy has not yet been able to be achieved for Indo-Aryan languages using OCR. This is due in part to the writing systems of Indo-Aryan languages as well as a lack of standard representation, encoding, and support among operating systems and keyboards. The Centre for Development of Advanced Computing (C-DAC) and Technology Development for Indian Languages, the premier R&D organisation of the Ministry of Electronics and Information Technology (also known as MeitY) of India have carried out many projects relating to OCR. Their projects include OCR for Malayalam, Odia, Punjabi, Telugu and Devanagari script.

Properties of Indo-Aryan writing systems

There are 22 officially recognised languages in India. Of these, Hindi, Bengali and Punjabi are the most widely spoken Indo-Aryan languages and are also the fourth, seventh and tenth most widely spoken languages in the world respectively.^[1] Two or more languages can be written with same script. For example, Devanagari is used to write Hindi, Marathi, Rajasthani, Sanskrit, Bhojpuri and others, while Bengali Script is used to write , Manipuri and others.

Apart from basic characters as consonants and vowels, most Indic languages combine 2 or more basic characters to form compound characters. The shape of a compound character is more complex than the constituent basic characters. Some Indo-Aryan languages (including Hindi and Punjabi) have a horizontal line over the characters, while other languages (including Gujarati) and Dravidian languages (Malayalam, Kannada, Tamil, and Telugu) do not. These are some of the main challenges for creating a single OCR for all Indo-Aryan languages.^[2]

The concept of upper/lower case is absent in Indo-Aryan languages. Apart from Urdu, Indo-Aryan languages are written from left to right.

Examples

SanskritOCR - OCR software for Sanskrit, Hindi and other Indo-Aryan languages based on the Devanagari script.
E-aksharayan - Optical character recognition engine for Indo-Aryan languages
Chitrankan - This technology was developed by ISI, Kolkata, and transferred to C-DAC. It processes printed Hindi text from a scanner or from an image.

OCR in use

OCR has been used for Wikisource and other projects.^[3]^[4]^[5]

References

↑ GmbH, Lesson Nine. "The 10 Most Spoken Languages In The World" (in en). The Babbel Magazine. https://www.babbel.com/en/magazine/the-10-most-spoken-languages-in-the-world.
↑ Pal, U.; Chaudhuri, B.B. (2004-09-01). "Indian script character recognition: a survey" (in en). Pattern Recognition 37 (9): 1887–1899. doi:10.1016/j.patcog.2004.02.003. ISSN 0031-3203.
↑ Prabhu, S. (2020-06-04). "Pazhur Patasala — a revival story" (in en-IN). The Hindu. ISSN 0971-751X. https://www.thehindu.com/society/history-and-culture/pazhur-patasala-a-revival-story/article31748137.ece. "An OCR (Optical Character Recognition) for Sanskrit has created an offline corpus that includes over 3,000 books."
↑ "Digitisation going on at brisk pace: Vice-Chancellor Prof V Muralidhara Sharma" (in en). 2019-03-20. https://www.thehansindia.com/news/cities/tirupathi/digitisation-going-on-at-brisk-pace-vice-chancellor-prof-v-muralidhara-sharma-513769.
↑ Dikshit, Ashish (2016-10-27). "Who Says Sanskrit Is Dead? It's Rocking the Wiki World" (in en). https://www.thequint.com/news/india/who-says-sanskrit-is-dead-its-rocking-the-wiki-world-wikipedia-wikisource-sanskrit-bharati-shubha.

"Multilingual Computing & Heritage Computing" (in en-US). https://www.cdac.in/index.aspx?id=mlingual_heritage.
Singh, Rustam (2016-04-16). "The Magic of OCR & Augmented Reality Translates text in Indian Languages, Real Time – Without Internet" (in en). Entrepreneur. https://www.entrepreneur.com/article/274217.
"Indian Language Technology Proliferation and Deployment Centre - Home" (in en-gb). http://www.tdil-dc.in/index.php?lang=en.
Pal, U.; Chaudhuri, B.B. (2004-09-01). "Indian script character recognition: a survey" (in en). Pattern Recognition 37 (9): 1887–1899. doi:10.1016/j.patcog.2004.02.003. ISSN 0031-3203.

External links

0.00

(0 votes)

[1] GmbH, Lesson Nine. "The 10 Most Spoken Languages In The World" (in en). The Babbel Magazine. https://www.babbel.com/en/magazine/the-10-most-spoken-languages-in-the-world.

[2] Pal, U.; Chaudhuri, B.B. (2004-09-01). "Indian script character recognition: a survey" (in en). Pattern Recognition 37 (9): 1887–1899. doi:10.1016/j.patcog.2004.02.003. ISSN 0031-3203.

[3] Prabhu, S. (2020-06-04). "Pazhur Patasala — a revival story" (in en-IN). The Hindu. ISSN 0971-751X. https://www.thehindu.com/society/history-and-culture/pazhur-patasala-a-revival-story/article31748137.ece. "An OCR (Optical Character Recognition) for Sanskrit has created an offline corpus that includes over 3,000 books."

[4] "Digitisation going on at brisk pace: Vice-Chancellor Prof V Muralidhara Sharma" (in en). 2019-03-20. https://www.thehansindia.com/news/cities/tirupathi/digitisation-going-on-at-brisk-pace-vice-chancellor-prof-v-muralidhara-sharma-513769.

[5] Dikshit, Ashish (2016-10-27). "Who Says Sanskrit Is Dead? It's Rocking the Wiki World" (in en). https://www.thequint.com/news/india/who-says-sanskrit-is-dead-its-rocking-the-wiki-world-wikipedia-wikisource-sanskrit-bharati-shubha.

[1]

[2]

[3]

[4]

[5]