OCR in Indian languages

Optical character recognition (Also known as OCR) is the process of converting the image into text. OCR for English and other European languages has been able to achieve a high percentage of accuracy in conversion. But the OCR for Indian Languages were not able to achieve the kind of accuracy they achieved. This is mostly due to the complexity of Indian language, lack of standard representation, encoding, support of operating system and keyboard. Centre for Development of Advanced Computing (C-DAC) and Technology Development for Indian Languages, the premier R&D organisation of the Ministry of Electronics and Information Technology (Also known as MeitY) of India has done many projects for OCR. Their projects include OCR for Malayalam, Odia, Punjabi, Telugu and Devanagari script.

Properties of Indian Scripts

In India, there are 22 officially recognized languages. Among these Hindi, Bengali and Punjabi are most popular languages in India and fourth, seventh and tenth most popular languages in the world.[1] Two or more languages can be written with same script. For example, Devanagiri is used to write Hindi, Marathi, Rajasthani, Bhojpuri and many more. While Bengali Script is used to write Sanskrit, Manipuri etc.

Apart from basic characters as consonants and vowels, most Indian Languages combines 2 or more basic characters to form compound characters. The shape of compound character is more complex than the constituent basic characters. Some Indian languages(Hindi, Punjabi etc) has horizontal line over the characters. While some languages(like Gujarati, Tamil etc) doesn't have these horizontal lines. These are some of the main challenges for creating a single OCR for all Indian languages.[2]

The concept of upper-/lower-case character is absent in Indian Languages. Like English Languages, writing mode of languages is from left to right except Urdu.

Examples

  1. SanskritOCR - OCR software for Sanskrit, Hindi and other Languages of India based on Devanagari Writing system|script.
  2. E-aksharayan - Optical character recognition engine for Indian languages
  3. Chitrankan - It is developed by ISI, Kolkata and the technology is transferred to C-DAC. It processes printed Hindi text either directly from scanner or from an image.

References

  1. GmbH, Lesson Nine. "The 10 Most Spoken Languages In The World". The Babbel Magazine. Retrieved 2018-03-20.
  2. "Indian script character recognition: a survey". Pattern Recognition. 37 (9): 1887–1899. 2004-09-01. doi:10.1016/j.patcog.2004.02.003. ISSN 0031-3203.
  • "Multilingual Computing & Heritage Computing". www.cdac.in. Retrieved 2017-02-12.
  • Singh, Rustam (2016-04-16). "The Magic of OCR & Augmented Reality Translates text in Indian Languages, Real Time – Without Internet". Entrepreneur. Retrieved 2017-02-12.
  • "Indian Language Technology Proliferation and Deployment Centre - Home". www.tdil-dc.in. Retrieved 2017-02-12.
  • "Indian script character recognition: a survey". Pattern Recognition. 37 (9): 1887–1899. 2004-09-01. doi:10.1016/j.patcog.2004.02.003. ISSN 0031-3203.
  • "SanskritOCR - Optical Text Recognition for Sanskrit Documents".
  • "C-DAC: GIST - Products - Chitrankan". cdac.in. Retrieved 2017-02-12.


This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.