Author Topic: OCR - comparisons of different software/capability (Read 21169 times)

IainB · « **on:** October 14, 2012, 02:51 AM »

Objective:
To run a quick comparison of the difference in accuracy between some of the different OCR capability available on/from my laptop and in the Cloud:
   ○ MS Office 2007 MSPVIEW.EXE (laptop).
   ○ MS OneNote (laptop).
   ○ ABBYY Screenshot Reader (laptop).
   ○ Google docs (Cloud).

Method:
Input data is in the form of a .tif image document, made by scanning a laser-printed document (black on white). The image was of a date-ordered financial statement.
For the purposes of the comparison, I just focussed on a range of dates in the date column, not the whole document image.
The table below has had all the OCRed results put into the same font and font size, and the single image has been resized/aligned, so that it is easy to compare the results across columns.

Results:
   ○ The errors have been highlighted in yellow.
   ○ It is evidently a tie between OneNote OCR and ABBYY clip OCR, both with 100% accuracy.
   ○ Google docs OCR is a very close second, with only one error - it missed a dot(!).
   ○ MSPVIEW seems to have made no data errors, but has inserted spaces where there were none.

OCR - comparison of results 2012-10-14.png

OCR - comparison of results 2012-10-14.png

cranioscopical · « **Reply #1 on:** October 14, 2012, 07:49 AM »

That's interesting, thanks for preparing and sharing the results.

Contro · « **Reply #2 on:** October 15, 2012, 07:42 AM »

Nice work indeed.
I have seen a new method of scan more powerful for documents or books of your own.

Scanner

IainB · « **Reply #3 on:** October 15, 2012, 08:46 PM »

Nice work indeed.
I have seen a new method of scan more powerful for documents or books of your own.

Scanner
-Contro (October 15, 2012, 07:42 AM)

Thankyou! Nice find.

oversky · « **Reply #4 on:** October 15, 2012, 10:15 PM »

Can you provide the original ttf? I would like to test it with other OCR. Thank you.

IainB · « **Reply #5 on:** October 16, 2012, 01:11 AM »

Can you provide the original ttf? I would like to test it with other OCR. Thank you.
-oversky (October 15, 2012, 10:15 PM)

The .tiff file I copied the date column from is a copy of a client document, and I am not at liberty to publish it. Here is another .tif (not .tiff, but is the same thing, I think) in a .ZIP file. It is a scanned copy of a published hardcopy document about a now defunct corporation.

[attachmini=#][/attachmini]

You could also scan your own .tiff copy, or get a copy of a .tif(f) file from the internet - e.g., search in Google images.

The reason I picked .tiff is that it seems to be the lowest common denominator:

the format apparently holds more data to make for an error-free OCR result.
MSPVIEW can only scan/OCR .mdi, .tiff, and .tif files.
Google docs can (currently) only scan/OCR .tiff files.
Windows 7 Search can only scan/OCR and index the text in .tiff files.

MS OneNote and ABBYY Screenshot Reader don't seem to have the same constraints, and can scan/OCR text copied from any image so far, with varying degrees of accuracy - depending on the quality of the image.

40hz · « **Reply #6 on:** October 17, 2012, 11:46 AM »

Thanks for posting this info. I have a very old copy of ABBY, and I'd been debating whether or not to upgrade. But having seen this, and tried the OCR features I'd formerly been neglecting in OneNote, it seems I really don't need it. OneNote's capabilities work just fine for what I'm doing.

IainB · « **Reply #7 on:** October 18, 2012, 06:08 PM »

@40z: Yes, OneNote has a lot of useful stuff that I had not realised was there.
I still find use for ABBYY Screenshot Reader though, so I would recommend that you get the free Christmas version per post copied below - if it is newer than your existing copy.
I think the ABBYY Screenshot Reader will only run if its licence Service is running too, so it will be one less overhead if you uninstall the one you no longer require, or disable the Service for it at any rate. I am not sure, but I think that otherwise you will find duplicated Services.

I just checked and you can apparently still get the FREE ABBY Screenshot Reader "RETAIL" (2011 Christmas giveaway) software - download from http://fr7.abbyy.com...enshotReader_ESD.exe

The "newer" version I have (dated 2009-11-20) seems to work just the same as the giveaway version (apparently dated 2009-01), so I don't know what the difference is - if any.
-IainB (October 13, 2012, 09:46 PM)

If you are starting to use OneNote more, then there are other, potentially useful references in DCF that you might not have seen.
For example:

f0dder · « **Reply #8 on:** October 18, 2012, 06:25 PM »

You'd need a lot more samples to construct a corpus valid for comparing OCR products

FWIW, I've had very good results with ABBYY in the (now semi-distant!) past - IIRC, we chose the product mainly because the Danish National Museum was using it, and had decent results. Now, that was back in ~2002, and a lot could have happened since then, but it was relatively fast, had an intuitive UI, and dealt with relatively bad text quite fine (I was doing (civil) conscription at a smaller .dk museum, and some of the stuff I scanned was pretty bad) - oh, and it had decent support for furren languages, like Danish

I would expect Google to have some really wicked stuff internally these days, not the least because of reCAPTCHA, but I do wonder how much of that power they open to the wider public.

IainB · « **Reply #9 on:** October 18, 2012, 07:25 PM »

You'd need a lot more samples to construct a corpus valid for comparing OCR products
...
I would expect Google to have some really wicked stuff internally these days, not the least because of reCAPTCHA, but I do wonder how much of that power they open to the wider public.
-f0dder (October 18, 2012, 06:25 PM)

Yes, definitely a comprehensive and rigorous comparison test would be useful, but I ran a quick comparison of some of the tools, and the OCR scan was just of numeric data and full stops. I suspect that if alphabetic characters had been involved in the OCR scan, then there could have been some more interesting variability in the results.

The Google technology seems to have been quietly introduced without fanfare. I reckoned the OCR scan results were pretty darn good. I don't know if that service is being provided in any other Cloud services at present.

IainB · « **Reply #10 on:** October 18, 2012, 10:28 PM »

More differences between OneNote and ABBYY capability. OneNote seems to be unable to OCR scan red text in an image, whereas ABBYY has no problem.
Whilst inspecting my TEMP folder for something today, I noticed that ABBYY Screenshot Reader had set up a temp folder Scr3101.tmp with a special icon:

OneNote-ABBYY comparison - CHS Error - 02.png

OneNote-ABBYY comparison - CHS Error - 02.png

ABBYY had placed into the folder two .tif files - images of an image of an error message from Clipboard Help & Spell that I had originally captured into OneNote. It was a long error message.

Though I had captured the image of the red text message directly into OneNote, OneNote was unable to get anything from OCRing it, so I took clips of the same image with ABBYY - which got the text 100% right. I had had to "ABBYY" it in two bites (hence 2 images) because of its length.
The text of the message in its entirety was in RED
Interestingly, though the file extension for both ABBY images was .tif, irfanview told me that they were in fact .BMP files and asked me if I'd like to rename them to .BMP.
________________________________________________
EDIT 2015-05-31 1900hrs: Correction to this post.

OneNote - 03 Analysis of Text scrap OCR issues.png

OneNote - 03 Analysis of Text scrap OCR issues.png

InstantFundas · « **Reply #11 on:** October 19, 2012, 01:58 AM »

You should try Cuneiform OCR. It's free.

anandcoral · « **Reply #12 on:** October 19, 2012, 03:22 AM »

Mentioning of color, got me thinking.

I do not require OCR always, but at times I get some images whose text I have to put down in the report format and I look for some simple ocr to extract the text quickly.

I have below installed (all free),
Advanced OCR Free
Free Image OCR
Free OCR to Word
SuperGeek Free Document OCR

(Funny all have similar interface).

To check if any works for a complex image with colors, I captured an image from my program and saved as PNG and run all on it. They all failed.

I know the image is very complex with background colors and also a PNG file, still I wanted to find out.

Then I check online OCRs,

GoogleDoc and www.onlineocr.net did more or less good job and extracted most text.
www.free-ocr.com and www.newocr.com came second in extracting from light colored backgrounds better than from complex colored backgrounds.

So finally I can rest assure, GoogleDoc is more robust and free.

regards,

Anand

IainB · « **Reply #13 on:** October 19, 2012, 04:01 AM »

I dropped that image into OneNote and OCRscanned it, to give this:

OCR Free test - 2012-10-19_130935 - 02.png

IainB · « **Reply #14 on:** July 22, 2013, 10:32 AM »

Cross-link to post here: Re: PDF-XChange Viewer ($FREE version) - Mini-Review

PDF-XChange Viewer now gets a 5 x from me (was 4½).
No issues re OCR, now - it all seems to work just fine. I see that it currently caters for English, French, German, Spanish, but I have only used the English OCR functionality so far.
-IainB (June 20, 2013, 12:56 AM)

I agree that PDF-XChange Viewer is excellent. I used the free version for years, and recently have upgraded to PRO.

Regarding OCR, it is generally fine for most purposes. Where it begins to have problems is with poorly scanned texts. For those situations I use ABBYY FineReader, and there can be a big difference: where PDF-XChange Viewer might have a 60-70% success rate in recognising text (which is basically unusable, as you can't understand a sentence where a third of the words are unintelligible), FineReader produces a 99.99% correct OCR. But these are marginal cases I'm talking about (book pages scanned at a low resolution).
-dr_andus (July 22, 2013, 07:09 AM)

IainB · « **Reply #15 on:** May 27, 2015, 12:47 AM »

The quality of any given OCR result is likely to depend on the quality of the scanned image input to the OCR engine. The better OCR scanning software will generally have done some pre-processing before the actual OCR scan.
The comparison of OCR scan results in the opening post of this thread showed some output of errors.

As a result of reading about ScanTailor in the DC Forum and elsewhere, I had downloaded and installed ScanTailor v0.9.11.1 (2012-02-27) 64bit. (Note that this is not the same as ScanTailor Enhanced that is referred to by @Nod5 in one of his posts.)

I had a sample image - a scan of two pages from a book. I straightened it up and tried to improve the B/W contrast using Picasa, but it OCRed in OneNote with quite a lot of errors that would be tedious to manually correct for. I wanted to see whether ScanTailor could help by preprocessing the image to enable a more accurate OCR scan.
So, take a look at this:
This is the input image (2 pages in one image): (click to enlarge)

OCR - comparisons of different software/capability

This is what ScanTailor saw had been input: (click to enlarge)

OCR - comparisons of different software/capability

After splitting the pages up and doing its work (including adjusting the DPI, deskewing, despeckling, etc.), this is what ScanTailor saw had been output: (click to enlarge)

OCR - comparisons of different software/capability

This is the output (2 separate pages) from ScanTailor showing the OCR result (with some errors highlighted) from OneNote: (click to enlarge)

ScanTailor 04 - Output and OCR result.png

OCR - comparisons of different software/capability

Pretty impressive!

References:

Curt · « **Reply #16 on:** May 27, 2015, 10:56 AM »

Last refreshed at 1:09 pm ACE DdShbOdrd Next refresh at 1:14 pm
To display information as per your requirement on this dashboard please contact Coral Softwares
pos-20120225-SD BATCH DATA 230212
[NEW8]
Agent: No | Batch Yes | Location: No | VAT Yes
Datapath f:\Ace5\ACE5NEW8\
Cash balance as on date is 410 00
Todays Sale amount is 0 00
Todays Purchase amount is 0 00
This is NOT the final version.
Please check this version on proper data and
in all required combination.
Please inform for any anamolies to be fixed
and final version delivered
Debug mode enabled for XPPDEBUG LOG
□ Whats New inversion 10.0.019 Dated 18 Oct 2012 (13:13:11)
* Error in Transport Master, fixed.
See menu Help \ Whafs Newforfull list.
f:\Ace5\ACE5NEW8\IMAGE\user_Super/i sor jpg not fou nd
-ABBYY Screenshot Reader about anandcoral's picture

-------------------
-------------------

Actually, I came here hoping to find a mentioning of today's Bits du Jour -offer: ABBYY PDF Transformer+

Does anyone here know if the ABBYY OCR part is the same in both PDF Transformer+ and Screenshot Reader ?

There is quite a difference in their prices...

IainB · « **Reply #17 on:** May 28, 2015, 12:48 AM »

I don't know for sure, but I would expect that the OCR engine would probably be a common core process throughout the ABBYY range of software that involves image OCR scanning.
What my post above shows is that the post-processing of the scanned image that is fed into the OCR engine can make quite a bit of difference to the quality/accuracy of the OCR scan outputs.
ABBY Screenshot Reader clearly does some pretty smart stuff - e.g., it will scan for and output columnar information if you tell it that is what you want.

Nod5 · « **Reply #18 on:** May 30, 2015, 04:38 PM »

IainB: ScanTailor and ScanTailor Enhanced produce the same output in many cases, Enhanced has some extra settings and a few tweaks useful for command line processing. ScanTailor is great and preprocessing with it helps a lot for OCR. Though development has been on halt for quite some time now. There is a new maintainer but no feature updates yet. The small helper tools I've made aim to automate or speed up some steps. If you only want black and white text output and already have well cropped inputs (or precrop them with BookCrop or SoloCrop ) then you don't need to open the ScanTailor GUI.

Edit: for ScanTailor_multi_core , see this later version which support more cores.

wraith808 · « **Reply #19 on:** May 30, 2015, 06:48 PM »

I don't know for sure, but I would expect that the OCR engine would probably be a common core process throughout the ABBYY range of software that involves image OCR scanning.
What my post above shows is that the post-processing of the scanned image that is fed into the OCR engine can make quite a bit of difference to the quality/accuracy of the OCR scan outputs.
ABBY Screenshot Reader clearly does some pretty smart stuff - e.g., it will scan for and output columnar information if you tell it that is what you want.
-IainB (May 28, 2015, 12:48 AM)

It is, and it isn't. This is from someone who just did an implementation of the ABBYY Finereader Engine in a custom app. For low yield OCR, you won't really notice the difference. But as you get higher, there are several different capabilities, settings, and languages that are left out. Even at the level where we are, having the ability to use 4 cores to scan, there is still a higher level, that we won't get to until we scan 1,000,000 pages- the ability to use 8 cores.

So they build it with the ability to nickel-and-dime you for features, and believe me, they do nickel-and-dime you. We were very close to just scrapping the whole contract they're so parasitic, but when they saw that, they just gave in.

IainB · « **Reply #20 on:** May 31, 2015, 01:50 AM »

@Nod5: Thanks for the link re extra cores support. I did pick up on that reference whilst I was reading over your detailed posts in the DC Forum.

@wraith808: Thanks for your comment. From what you say, it rather sounds like software in the 2000s is not too dissimilar to what cars were reportedly like in the 60s/70s - you had to pay through the nose for any "extra" feature - e.g. a heater for the passengers, delayed wiper switch, ... Sheesh. Archaic and greedy marketing for little or no real added value.

IainB · « **Reply #21 on:** May 31, 2015, 03:19 AM »

Just made a significant correction to my post at Re: OCR - comparisons of different software/capability

Looks like I was mistaken about OneNote being unable to scan that red-coloured text error message.
Trouble is, I learned from this that OneNote is disabled by default (and which cannot be enabled) from scanning scrap images of single lines of text. Of what purpose/use that is, I have no idea.

Ath · « **Reply #22 on:** May 31, 2015, 03:57 AM »

Of what purpose/use that is, I have no idea.
-IainB (May 31, 2015, 03:19 AM)

Maybe to avoid it being used for captcha recognition?

Nod5 · « **Reply #23 on:** May 31, 2015, 06:34 AM »

IanB: I put the newer version of ScanTailor_multi_core (with one more bug fixed) on the official page.

wraith808: I'm not surprised! It feels like ~~Acrobat~~ Adobe have a similar approach. I think they've intentionally made Acrobat hard to automate in order to sell their various more expensive corporate solutions.