Qt Box Editor
Version: 1.12rc1 Web: https://github.com/ zdenop/qt-box-editor
T
esseract is a great example of optical character recognition (OCR) technology. You might think that Tesseract should belong to the OpenCV family, but in fact it came out before OpenCV. Tesseract is a free alternative to ABBYYFinereader, a commercial product that delivers state-of-the-art OCR quality. There are many ways you can achieve a Finereader- like experience with Tesseract in Linux, and perhaps the best one would be using the gImageReader front-end (see LXF229). You’ll notice that while Tesseract has almost no trouble with quality images like screen grabs or high-resolution scans of laser printouts, it stumble over less-readable images.
Various Tesseract training tutorials describe how to tackle this problem. The core idea is to take a sample image, extract characters from it (‘as is’) forming a Box file, and then manually edit it and correct all erroneous characters. Tesseract can them match the way a letter looks on the image with a correct Unicode symbol. The more valid pairs Tesseract has learned, the more precise future recognition attempts will be.
Editing a Box file is the most time-consuming operation. It requires lots of patience and diligence. Qt
BoxEditor is a tool that helps the process along by providing a smart GUI. It shows the source image on the right and a narrow spreadsheet-like area on the left. Navigating between cells is very fast and can be controlled by the arrow keys.
Compared to a convenient text editor, QtBoxEditor enables you to complete an average page nearly twice as fast. When you move to the next row in the ‘spreadsheet’ area, the application highlights the corresponding letter on the image. When working with scanned old typewriter sheets or other poorly decipherable images, Tesseract sometimes makes errors when detecting letter ’boxes’. Luckily, QtBoxEditor features a selection tool and makes it simple to correct the box.