Optical Character Recognition
During your foray into the world of document scanning, you've probably come across the term "OCR". You may even know that it stands for "Optical Character Recognition". But what is OCR, really, and what do you need to know about it to make the best use of this sophisticated and valuable tool?
We're here to give you a run-down on Optical Character Recognition, answer any questions you might have, and recommend the best OCR software for your scanning project. Let's begin!
What is OCR?
The primary purpose of Optical Character Recognition is to quickly and automatically convert scanned images of machine-printed (typed) text - which to a computer are no more meaningful a collection of pixels than any other image, such as a landscape photo - into actual text data that you can search through and modify. The exact mechanics of this process are complicated, but suffice to say that an OCR engine will look at pixel data and search for patterns resembling letters, numbers, and other symbols and create a digitized record of these symbols.
Types of OCR
There are two major types of Optical Character Recognition:
Full Page OCR - Converts the entire page into one of the below mentioned formats:
- Plain Text - Only basic text information on the page is retained in a consecutive order.
- Formatted Text - Text information is retained in consecutive paragraphs, saving font size and style. This can also conserve tables in a tabular format, such as spreadsheets.
- Exact Copy - All information on the page is retained, including graphics, and placed on the page in such as way as to most closely recreate the original document.
- Searchable File - Text information is retained on a hidden layer behind the scanned, image, allowing the file to be searched while retaining the appearance of the original.
Zone OCR - Recognizes strings of text located on particular areas of the page. This is usually for the purpose of indexing and document management. The information can be used to name a file, save it to a particular location, or archive particular pieces of data into an organized format, such as a database.
Levels of OCR Software
OCR Software comes in many different types, which vary in price range based on their features, speed, and accuracy. For instance, you can get a freeware such as SimpleOCR that will serve in a pinch, but it will only be able to convert BMP, JPG, and TIF images of English or French text into plain text documents of TXT or DOC format, one page at a time.
On the other hand, you can invest a few hundred dollars in a Batch OCR or even Server OCR software that will be able to watch particular folders for incoming documents in a variety of image formats and languages, then automatically recreate exact copies of all of the pages therein in a format of choice.
You can also find Desktop OCR software, which will bridge the price gap and include many of the features of the Corporate editions but still require some user input during conversion.
Although some OCR engines are better than others, no software can guarantee 100% accuracy. This is because there are other factors in play, including scan quality. Recognition software will not be able to do its work if the scanner is not properly digitizing the page.
It is recommended to scan at a resolution of 300dpi for best results. Black & White (Bitonal) is preferred over Greyscale or Color modes, and although most modern scanners are fairly well configured out of the box, you may want to adjust your Brightness and Contrast settings for your particular documents.
If you do not have a scanner that has the necessary speed, quality, or other features that you require to scan your documents, you can always find a large selection of scanners at ScanStore!
ScanStore even has a handy scanners guide to help you find the perfect scanner for your specific requirements and price range.
Limitations of OCR
OCR software is also limited in what it is able to recognize. Most OCR software are only designed to recognize machine printed text, as opposed to handwriting. While there are ICR software that can recognize handwritten information, they tend to be enterprise level solutions for forms processing work, rather than full page recognition.
Similarly, most OCR software are only able to convert traditional machine fonts, not cursive scripts or calligraphy. There are many fonts out there, and OCR engines depend on common, separated letter shapes to recognize the text, so fonts that are unusual or flow together will not be recognized.