What is OCR and when should it be used?

Optical Character Recognition (OCR) is the technology of analyzing image elements (pixels) and converting them to a text code such as ASCII (numbers). The advantages of OCR are significant since computers can store a digital image as a pattern of pixels, but cannot interpret what patterns the pixels form without further processing.

First let’s review the basic process of digital scanning. Scanning is the process of reflecting light off an object onto a photo sensitive element. The reflected light emits onto the photo sensitive elements in various intensities based on how much of the light was reflected back and how much was absorbed. A completely white object will reflect nearly 100% of the light while a completely black object will reflect almost zero %. The reflected image is then a pattern of light of various intensities that is converted by the photo sensors to a pattern of dots or pixels of corresponding densities. The density of pixels is related to the resolution. So digital scanning creates an image which is a pattern of pixels.

This digital image can be stored in a variety of digital formats such as jpeg, tiff or even pdf (portable document format). One normally associates a pdf with a printed document, but a pdf directly created from scanning is really only a digital image that when viewing or printing is easily interpreted by a human. You can tell if the pdf you are viewing on your computer is an image or text by trying to select a line of text. If it is an image, it won’t select or highlight the text area, the best you can do is select the entire area around the text. If the pdf is actually text, you can copy the text onto another document such as Microsoft Word and edit the text. So a pdf could be actual text, which is encoded into a series of numbers that a computer can interpret, or it could be an image which is a pattern of dots that is not encoded.

An encoded pdf is interpreted as text by the computer and therefore can be processed as text. This includes editing, searching or changing fonts or font sizes. An image pdf is interpreted as an image by the computer, so one cannot do any text editing or searching by text strings.

With OCR technology, a text image can be converted to text code. Patterns of pixels are grouped into objects and then analyzed using sophisticated computer algorithms that will convert a pattern of pixels into a usable code. All text elements such as letters, symbols and numbers are actually represented by a series of zeros and ones. This can be quite challenging since there are 100’s of fonts and font sizes, and iterations such as bold, italic and 100’s of symbols beyond the normal alpha-numeric spectrum.

Scanning a document so that the text can be read by a computer, requires an additional processing step and added cost. There are some low cost OCR processors on the market today but they tend to be slow and they are not 100% accurate. In other words, some of the dot patterns can be misinterpreted by the algorithm. There are machines that are much more accurate and faster, but of course are much more expensive. Most consumers or even businesses could not justify the expense of a high quality OCR machine. It is normally much more cost effective to hire a company such as Microfacs to scan documents to be text readable.

There are options other than OCR for scanning documents; sometimes indexing the document with identifiers such as date, title, location etc is all that is needed for efficient storage and retrieval of a document. In other cases it is desirable to enable searching by document content which requires OCR technology. As a full service scanning and digitizing company, Microfacs can accommodate your specific need in a cost effective and timely manner.

Search

Contact us

More Information