Heritage- An OCR Perspective

Shikha Magotra, Dr. Ajay Kaul
Ever wondered how an Optical Character Reader (OCR) works and how important and beneficial it is in uplifting and preserving our Cultural Heritage. Let’s visualise the OCR perspective by bringing magic of Alice’s adventures in our Duggarland, Jammu.
Surfing the oldest book shelf of an ancient library in the Duggarland- Jammu, Alice found an old, somewhat torn book written in some strange script. She tried to read the title but couldn’t read even a part of it. So, she glanced through the book and wondered how difficult it was to visualise its text. She asked the librarian, who told her that the book is written in Namay Dogra Akhhar- the original script developed for writing Dogri language. Fascinated to know the script’s name, she googled its character set and learned it.
Then, confidently, she opened the book again. To her surprise, it was still strange to her. She kept trying harder and harder, and when she got tired of trying, she let herself fall down, deeper and deeper into the rabbit hole and there she found vast knowledge of the script. She kept on learning the whole literature and it gave her a deeper understanding of the script’s features, usage, its ambiguity, origin, drawbacks, matra formation and so on.
She imagined herself trying to read the book again but couldn’t. She wondered how an Optical Character Reader (OCR) software can read so many varieties of scripts instantly. Thinking all this, she went deeper and deeper, and reached there when first OCR was made. She learnt the very basics of the three phases of any OCR working. She wondered how human brain identifies any character for the first time, as a toddler perhaps, out of big documents. Yes! It first extracts the text from scanned image of document using human eye scanner and then recognises it. The same way an OCR works too.
Then, she climbed up from the mud again, now more insightful and self- realised. Calmly, she opened the book again, and started extracting the text from the beginning page. She focussed closely only to the text part, ignoring all the background details and the colour of the page. Some of the text appeared skewed as if in dancing posture in either direction. Some characters seemed broken at different points or, faded away. She tried to focus more on the text only and suddenly, the whole page was converted to complete white background with only the text highlighted in black. It seemed as if her brain had erased all the unnecessary background and turned it white. The dancing text was positioned rightly and the faded characters were rewritten joining the broken ends. All this was done using pre-processing programs, the first stage of OCR.
Now, she looked deeply to the text part only, reviving all that she learned of text extraction by OCR. The document image turned into a jigsaw puzzle with each character form in each rectangle. She started breaking the text of the document into smaller parts/ images, firstly into multiple lines images. She, then, took all those line images one by one and started breaking each line into multiple word images. And further, she took those word images one by one and broke each of it into multiple small character images using Connected Component Segmentation approach.
Now, she was able to recognise some of the characters. Still many jigsaw rectangular pieces were strange to her. She wondered how the complex forms have arisen. She thought and visualised each one of those jigsaw pieces, these were not single alphabet but combination of more than one characters into one like ligatures in Urdu or matras join with consonants in Devanagari. So, she started breaking those ligatures individually focussing on each of them. She looked deeper at the character strokes, their width, curves. The width became the great walls around her and she felt caught in some maze. Running from one direction to another, she started finding the way out of the maze. Slowly, she started observing minutely those walls to each single brick of it, and slowly moving along those walls, she found the minutest crack to break through the maze. She started digging there and soon she was out of that maze. She got delighted seeing she had successfully segmented those ligatures into single character forms by developing newer programs for Segmentation, the second stage of OCR.
She was now, in the amazing world of Namay Dogra Akhhar script, surrounded all over by enormous small character images. She felt diving into the script’s characters. Then, very calmly, she took the images one by one and sorted them into the script’s standard alphabet slots which she had learned in the beginning. She did this by mapping each image to the basic characters of the Namay Dogra Akhhar script as is done by template matching programs or, Neural networks in Classification, the third stage of OCR. She imagined how modern machine learning/ deep learning programs have made classification more accurate, even with lakhs of data images and that too in minimum time.
Yes ! She had decoded the script. Now, I can join these again to form words and recognise it too.
What an amazing world of text extraction by OCR it is! She now understood how a computer recognises and translates any script into some other script. Yes! It does it all only after the OCR of the script has been developed. OCR is the initial step towards digital heritage era. She wondered if an OCR for this Namay Dogra Akhhar script could be developed, how it will uplift our cultural heritage drastically. All over the country, almost every Indian state has developed OCRs for their regional scripts. Bangla, Telugu, Tamil, Oriya, Gujarati and even our script’s sibling Gurmukhi has its own OCR. People of all the states got so much more connected to their origin, culture after OCR development as more and more they came to know about their culture. How we can have large ware house of our script’s characters written in different styles, fonts in the form of small images. People all over the world can use it for research and academic purposes.
She thought how the universities in Jammu are developing it already. Research scholars of Computer Science department in these universities are developing OCR for the script of our Dogri language- Namay Dogra Akhhar. She wondered how far they have reached in the development process and it may complete soon.
“Oh ! I have had such a curious dream”, said Alice.
(The authors are Research Scholar, and Associate Professor SMVDU)