What is OCR?

OCR is the acronym for Optical Character Recognition , an expression in English that can be translated as Optical Character Recognition . The notion is used in computer science to name a procedure that allows a text to be digitized through a scanner .

What makes OCR possible is that, when passing a text through a certain device, the system recognizes the characters as part of an alphabet . In this way, the scanned document can be edited with a word processor , as it is not stored as an image.

In this way, OCR facilitates the work that many people have to do . If someone scans a book with the intention of making a summary, thanks to OCR they will be able to interact with the scanned text through a program such as Microsoft Word , cutting, copying and pasting any word, something impossible if such a recognition process is not carried out. since the computer is unable to understand the text that is in an image.
In addition to the obvious advantage of storing a text as such and not as an image, there is the considerable difference in weight: images can take up much more disk space than texts, and this must be taken into account if you want to have books scanned integers. Of course, not in all cases it is advisable for the computer to perform the OCR, especially if there is no intention of editing the content.
It is curious that only one application can change the capacity of the same computer so drastically, but it is what happens in all cases: although modern processors can be very efficient, especially when combined with state-of-the-art memories and disks, They are useless without the appropriate programs, which is why the same machine can go from being useless to extremely advanced simply because of the software it has.
The case of OCR is very particular, since it gives the computer a skill that is basic for most human beings: to read. It is worth mentioning that it is not an easy task for either of us, although in our case we usually learn to do it from a very young age, which is why we acquire great dexterity, even when we must face a handwriting that is difficult to understand.
Despite the advancement of technology , OCR still faces a number of problems. Getting a digital system to recognize handwritten text, for example, is quite difficult. The process is often inconvenient to segment the various text units. The same happens when the words appear close together.

Other OCR flaws can appear when there is not enough contrast between the words and the background. Suppose that text written in black letters is printed on a gray sheet: the OCR process may not be able to distinguish between letters and words .
Let us not forget that, just as an action apparently as simple as walking down the street requires a series of complementary actions to avoid obstacles and protect our integrity, the reading of a printed text is the result of several simultaneous recognition tasks, which we carry out performed almost unconsciously, but we take work.
When faced with a text, our own OCR system is responsible for searching and recognizing the title, identifying paragraphs, punctuation marks, spaces between words and abbreviations, among other elements, in addition to making an effort to understand the sources too ornate or untidy and to fill in the information in regions that have suffered any type of wear, such as an ink stain or a missing piece of paper.

