Tesseract Training Tools

2009-June:  While constructing some training-pages for the Tesseract OCR-program, I wrote several bash-scripts:

cvt-tesseract-box-to-source - converts a tesseract box-file to the "source" form of the TIFF-image
cvt-tesseract-source-to-box - revises a tesseract box-file using the sourcefile
cvt-tesseract-box-fixups - revises a tesseract box-file to workaround certain bugs in "makebox"
tesseract-training-from-images - does all the training steps for a set of TIFF images
tesseract-training-from-source - does all the training steps for a sourcefile, and a specified set of fonts

Using tesseract-training-from-images is a low-level alternative to using a boxfile-editor such as tesseractTrainer.py.  It will pause with instructions about the files you need to revise.  Your first choice will be to revise the SOURCEFILE since that's easier, and you'll only resort to revising the boxfile if forced to.  When editing a SOURCEFILE, you may freely insert and remove spaces and newlines, but any other length-changing revisions require some care: to replace one character with several, enclose the new characters in parentheses; similarly to ignore a character, replace it with the emptystring enclosed in parentheses.  However when you need to revise bounding-boxes, for example to split one into two, then you'll need to edit the boxfile.  And it's probably only in such cases that anyone will prefer this low-level method over using a boxfile-editor such as tesseractTrainer.py.

Using tesseract-training-from-source is a lot easier.  It is fully-automated, requiring no manual intervention.  It constructs the images, from a textfile, then procedes to do the training with these "synthetic" images.  This method is simpler to use, completely fool-proof, and immune to bugs in tesseract's makebox output.  It is obviously the way to go if you're doing training for "screen fonts".  However for OCRing scanned pages, doing the training on images that have been degraded by printing & scanning might possibly produce better results.  However, it will also be a lot more work, since subtle changes to bounding-box coordinates are unavoidable during that process...  Or can that sort of degradation be faked, by temporarily rasterizing to a higher than desired resolution, in order to move characters around by "fractional" pixel amounts, and possibly apply small rotations?  I may yet give that a try.  Although only if I convince myself that it does indeed produce better results. 

Note: both training scripts presently produce the set of 8 training-files as needed for tesseract versions through 2.04;  they'll be converted to the single-file format once version-3.0 is available and its training-procedure documented.

Other Stuff:

eng.SOURCEFILE - plain-text file suitable for English-language training
deu.SOURCEFILE - plain-text file suitable for Deutsch-language training

a tesseract-ocr@googlegroups.com message from Jonathin 2009-11-07 15:49 mentions getting better results with either of:

	tesseract X.tif X   batch     makebox
	tesseract X.tif X   nobatch   makebox
as compared to the recommended use of batch.nochop and he requests an explanation of the differences.  I have found no explanation of these mysterious cmdline options (although nobatch is mentioned in Issue 59 from 2007-Aug25), however I've deduced that they are "scripts" found under /usr/share/tessdata/tessconfigs/ that assign values to chop_enable, enable_assoc, display_text; other files in that directory involve other "variables".  I suppose one has to read the source to learn what these do.

Send your questions, suggestions, bug-reports to ereimer@shaw.ca.