*NIX; tesseract OCR experiences

ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > Non-Windows Software

(1/1)

ewemoa:
Recently I've been trying out the tesseract OCR option (both via gimagereader and via the command line) with mixed (but tolerably good IMHO) results at least for English text.

In my usage, I notice occasional recognition results such as:

* "fi" being recognized as "ﬁ" (see: http://www.fileformat.info/info/unicode/char/fb01/index.htm)
* "fl" being recognized as "ﬂ" (see: http://www.fileformat.info/info/unicode/char/fb02/index.htm)
* lowercase "w" recognized as "W"
* lowercase "v" recognized as "V"
* lowercase "w" recognized as "vv"
It seems to me that for some of these cases, there is little point in accepting the results as-is (e.g. "vv" seems like it's seldom used). I'm about to go through a page with a description of tesseract configuration parameters in hope of turning up something applicable -- but anyone have any relevant tesseract experience to share?

I'm using tesseract 3.02.02.

Navigation

[0] Message Index

Go to full version