DonationCoder.com Forum

Main Area and Open Discussion => General Software Discussion => Non-Windows Software => Topic started by: ewemoa on December 01, 2013, 10:07 PM

Title: *NIX; tesseract OCR experiences
Post by: ewemoa on December 01, 2013, 10:07 PM

Recently I've been trying out the tesseract OCR option (both via gimagereader and via the command line) with mixed (but tolerably good IMHO) results at least for English text.

In my usage, I notice occasional recognition results such as:

"fi" being recognized as "ﬁ" (see: http://www.fileformat.info/info/unicode/char/fb01/index.htm)
"fl" being recognized as "ﬂ" (see: http://www.fileformat.info/info/unicode/char/fb02/index.htm)
lowercase "w" recognized as "W"
lowercase "v" recognized as "V"
lowercase "w" recognized as "vv"

It seems to me that for some of these cases, there is little point in accepting the results as-is (e.g. "vv" seems like it's seldom used). I'm about to go through a page with a description of tesseract configuration parameters (http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version) in hope of turning up something applicable -- but anyone have any relevant tesseract experience to share?

I'm using tesseract 3.02.02.