topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Thursday March 28, 2024, 3:33 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: OCR/Text Recognition in a .pdf document? How do I do this.  (Read 19580 times)

vegas

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 357
    • View Profile
    • Donate to Member
OCR/Text Recognition in a .pdf document? How do I do this.
« on: November 04, 2006, 03:21 PM »
I downloaded this old magazine the other day, which was scanned into a .pdf file, but what I didn't understand, I went to search for something within the file and results actually came back.  How the heck is this done? I have tons of old magazines I want to be able to throw away someday (the sooner the better) - but I'd like to be able to keep all the information within them by scanning them to PDF files.  Do I have to buy Adobe Acrobat to do this or is other software used? Has anyone tried doing this with print material so it is completely searchable?  Thanks for any responses.   -vegas

cranioscopical

  • Friend of the Site
  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 4,776
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #1 on: November 04, 2006, 05:07 PM »
...scanned into a .pdf file, but what I didn't understand, I went to search for something within the file and results actually came back...
Do I have to buy Adobe Acrobat to do this or is other software used?
Searchable text is certainly an option with Acrobat (not the reader).
I think there's at least one other .pdf 'handler' out there that claims to do the job. Unhelpfully, I can't recall what it is right now. If my brain starts working again (unlikely) I'll let you know the name. Probably others here will fill in the blank, anyway. It's certainly a useful feature.

Try looking here:  http://searchable-text.qarchive.org/
« Last Edit: November 04, 2006, 05:13 PM by cranioscopical »

Darwin

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 6,984
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #2 on: August 28, 2008, 02:41 PM »
A couple of years late... PDF Converter Professional 3 (and 4 and 5) will allow you to select "Save As" and "Searchable PDF" from the File menu....

I've had very good results doing this on various pdfs that I have downloaded from the internet. Most of them are ancient journal articles (such as Science magazine articles from the 19th century) but some recent pdfs are generated as image files rather than searchable text. PDF Converter handles these as well. Most recently I've done this on a 400+ page PhD dissertation that features pages that are at an angle and have things like lint and other debris that were on the platen glass visible... Worked like a charm.
« Last Edit: August 28, 2008, 02:45 PM by Darwin »

Edvard

  • Coding Snacks Author
  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 3,017
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #3 on: August 28, 2008, 04:56 PM »
I work in a copy shop where we just got a new Xerox machine - a WorkCentre 7655.
It will scan to a searchable pdf - pretty amazing stuff.
I'd recommend going to a reputable copy place in your area and asking their prices for scanning.
Our place is charging 50 cents (US) per, but other folks might be charging less.
Other than that, I don't know of any solution besides Acrobat pro.

Darwin

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 6,984
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #4 on: August 28, 2008, 05:13 PM »
Other than that, I don't know of any solution besides Acrobat pro.

Well, there is PDF Converter Professional as noted above! Actually, this also means that Zeon PDFDoc Gold should do it as well as Scansoft/Nuance licence their product from them.

Curt

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 7,566
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #5 on: August 28, 2008, 08:29 PM »
Searching/Indexing Documents

The Search tool helps to locate information in PDF documents by looking phrases in the current document, a specified folder, or an indexed archive:

    * Search current document.
    * Search all files in specified folder.
    * Search against pre-build index files.

"PDF Gold" (including PDF Plus Professional) is capable to create indexes across gigabytes of PDF documents. This Unicode based index engine allows searching both the contents and pre-defined custom fields.
-Zeon PDF Doc Gold

Zeon is $99
http://www.pdfwizard.../product/pdfgold.asp

lanux128

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 6,277
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #6 on: August 28, 2008, 09:10 PM »
if you have Microsoft Office, you can use JOCR (freeware).

JOCR enables you to capture the image on the screen and convert the captured image to text. It is useful to revive the protected files whose text can not be copied.  JOCR enables you to copy text from any files and images on the screen such as protected Web pages, PDF files, error messages.

JOCR requires Microsoft Office 2003 or higher version. If JCOR does not work, please manually install "Micorosoft Office Document Imaging" (MODI) that is included in the setup file of Microsoft Office.

Product_JOCR3.JPG
http://home.megapass...ng/Product_JOCR.html

Darwin

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 6,984
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #7 on: August 29, 2008, 10:12 AM »
Thanks, lanux  :Thmbsup: I'm checking it out now...

Darwin

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 6,984
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #8 on: August 29, 2008, 10:15 AM »
BTW, Curt, I notice that NovaPDF (all flavours) is on sale at the moment. Will it handle this as well (ie allow you to save a pdf created as an archive of image files as a searchable pdf)?

Darwin

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 6,984
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #9 on: August 29, 2008, 10:19 AM »
Hmm... checking out the feature matrix, it doesn't look like NovaPDF will allow you to process existing pdfs. It *looks* like a very powerful print driver, no? Actually, I wonder if NovaPDF will do this (process/change existing pdfs) printing an existing pdf from a pdf reader...? Enquiring minds want to know!

lanux128

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 6,277
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #10 on: August 29, 2008, 10:35 AM »
NovaPDF is a PDF creator and the 'Lite' version that i'm using quite a good. but this is a good site to visit for PDF related resources - Planet PDF. :)

Darwin

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 6,984
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #11 on: August 29, 2008, 11:54 AM »
Thanks lanux - I'd forgotten about that site. Will take a look...

Curt

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 7,566
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #12 on: August 29, 2008, 04:48 PM »
if you have Microsoft Office, you can use JOCR (freeware).

- thanks, lanux, for reminding me. I have JOCR, but had forgotten about it.  :-[

@ Darwin, your question caused a minor panic around here, because I couldn't remember that novaPdf comes from the company Softland (also the maker of Backup4All), so I couldn't find the (censored) folder... So finally I went to Revo_Uninstaller to ask, and realized that I actually have two (almost) identical programs:

softland.gif


- but they are both merely virtual printers, not editors, so I don't know anything about the more expensive version's features. But you can do more with PDF-XChange Viewer, than with novaPdf virtual printer. "I don't know why I have it, I guess it was on sale" is of course much too hard to put it, but my next print-to-pdf -purchase will be of another one:

Searchable.
Aloaha PDF files could be searched for  text-parts.

No special PS or PDF-printerdriver is needed.

http://www.aloaha.co...ware-en/versions.php
-Aloha
- maybe
« Last Edit: August 29, 2008, 05:10 PM by Curt »

Curt

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 7,566
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #13 on: August 29, 2008, 05:45 PM »
I wonder if NovaPDF will do this (process/change existing pdfs) printing an existing pdf from a pdf reader...?

I found an old link to PDFcamp Printer Pro (PDF Writer)
- and realized it is "only" $38:

http://www.verypdf.c.../pdfcamp/pdfcamp.htm

PDFcamp(pdf writer) Pro Features
PDFcamp(pdf writer) Pro includes all of the features in PDFcamp(pdf writer), plus:

(...many features listed...)

Option to merge or append to an existing PDF file (insert before the first page or append to the last page)

Supports Text Extraction
The newest version of PDFcamp(pdf writer) supports text extraction from printable documents (excluding graphics and PDF files) while keeping the original page layout. The extracted text can be used to re-construct the document, independent of the software that created the original document, and/or it can be posted to a searchable text database. Text Extract is ideal for archiving form documents, like invoices, statements and reports.

Integrate with Microsoft Office application, Create toolbar and icon in Microsoft Office 97, Office 2000, Office XP, Office 2003 and Office 2007, include Microsoft Excel, FrontPage, PowerPoint, Word, Outlook applications.

They also have a very fine editor @ $90 - http://www.verypdf.c...df-editor/index.html

But it seems in general to be impossible to edit a pdf file at the printing stage.   :(
Edited: I mean, in general I will use a virtual pdf printer when I am at some homepage and read some article. I will click print article, but will very often like to add some info about the page and the author, or a picture not already included. It is seldom that I want to edit an article that I already have saved as a pdf file. So, in most cases my wish is to edit the pdf file already before it is saved from the Internet the first time. This feature seems to be rare or not existing.
 :tellme:
« Last Edit: August 29, 2008, 05:55 PM by Curt »

Darwin

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 6,984
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #14 on: August 30, 2008, 07:13 AM »
The newest version of PDFcamp(pdf writer) supports text extraction from printable documents (excluding graphics and PDF files) while keeping the original page layout.

Drat! So close...

lanux128

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 6,277
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #15 on: August 30, 2008, 07:23 AM »
what about Foxit PDF Editor? i haven't used it but coming from the makers of much-touted Foxit Reader, it might be worth a shot.

lanux128

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 6,277
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #16 on: August 30, 2008, 09:22 AM »
i noticed that there was a collaborative effort by tinjaw to purchase bundled products from Software995 which included some PDF toolkits such as PdfSuite995 and UltraPdf. Darwin, did you manage to try either of the programs since you were one of the purchasers? :)


Darwin

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 6,984
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #17 on: August 30, 2008, 09:49 AM »
Heh, heh - I've been looking at the pdf995 toolset. However, the pdf portion is a print driver, it doesn't offer any advanced post-pdf creation editing, sadly. I'd forgotten about Foxit PDF Editor, I'll take a look.

FWIW I LOVED Foxit Reader and used it until Adobe Acrobat Reader 8 came out. AAR 8 (and now 9) is *almost* as light on resources (when configured properly) as Foxit Reader but has a number of features that I need. At any rate, it's been a year or so and it's time to look at Foxit again...

Darwin

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 6,984
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #18 on: August 30, 2008, 09:54 AM »
Also BTW, I have Nuance PDF Professional 5 (and 3 and 4...) so actually have this functionality alreadly. However, as described in this thread, I had to uninstall it from my XP machine. As my XP machine is still my main computer, I'm vaguely interested in finding an alternative solution for this computer. However, over the next few months I'll be switching to the Vista notebook, so in the long run this won't be a big deal...

PhilB66

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 1,522
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #19 on: August 30, 2008, 09:57 AM »
http://TopOCR looks interesting.

Multiple text output formats, including searchable PDF and HTML
Able to read 11 different languages

screen1.jpg

Darwin

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 6,984
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #20 on: August 30, 2008, 10:10 AM »
From a quick read of the Foxit PDF Editor user's manual (albeit for version 1.5, though there is no mention of this feature having been added for version 2), I don't think it has any OCR capabilities, though it does appear to be a powerful and impressive application.

TopOCR looks interesting - and  the price is right (free)! Using it in conjunction with Foxit PDF Editor might be an easy and effective workaround. Nice find  :Thmbsup:

Softland

  • Software Author
  • Charter Honorary Member
  • Joined in 2005
  • ***
  • default avatar
  • Posts: 30
    • View Profile
    • Backup4all
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #21 on: September 01, 2008, 01:15 AM »
BTW, Curt, I notice that NovaPDF (all flavours) is on sale at the moment. Will it handle this as well (ie allow you to save a pdf created as an archive of image files as a searchable pdf)?
novaPDF cannot save an image as a searchable pdf because it's a pdf printer driver without pdf editing capabilities (a goal on the long run though). To do this you would have to perform OCR with a different program and after that convert the recognized text to a pdf with searchable text.

I wonder if NovaPDF will do this (process/change existing pdfs) printing an existing pdf from a pdf reader...?
novaPDF can create a PDF from a pdf reader (you open the pdf in foxit/adobe and hit print, select novapdf and save the pdf), however it cannot change existing pdfs (no editing capabilities).

Darwin, your question caused a minor panic around here, because I couldn't remember that novaPdf comes from the company Softland (also the maker of Backup4All), so I couldn't find the (censored) folder... So finally I went to Revo_Uninstaller to ask, and realized that I actually have two (almost) identical programs
Softland publishes both Backup4all and novaPDF (and dopdf). We also sell developer tools for creating/handling pdf, so as I see you have installed yourself novaPDF (as a separate pdf driver) but also purchased another program that uses our licensed technology. No need to panic, they both are well behaved and don't interact with each other (though randomly in our tests a 3rd smaller pdf printer sometimes appeared from nowhere :) ).

Darwin

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 6,984
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #22 on: September 01, 2008, 01:38 AM »
Hi Softland - thank you for your reply and clarification about NovaPDF's editing capabilities (or lack thereof). Hope this is added in the future - it's a great feature!

Softland

  • Software Author
  • Charter Honorary Member
  • Joined in 2005
  • ***
  • default avatar
  • Posts: 30
    • View Profile
    • Backup4all
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #23 on: September 01, 2008, 01:54 AM »
NovaPDF's editing capabilities (or lack thereof). Hope this is added in the future - it's a great feature!
We know it's a great feature, if only it wouldn't be that difficult to implement.

Curt

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 7,566
    • View Profile
    • Donate to Member
Re: OCR/Text Recognition in a .pdf document? How do I do this.
« Reply #24 on: September 02, 2008, 06:14 AM »
NovaPDF's editing capabilities (or lack thereof). Hope this is added in the future - it's a great feature!
We know it's a great feature, if only it wouldn't be that difficult to implement.

I don't think it should be all implemented, but maybe made available as (stand-alone?) add-ons or whatever, otherwise I would expect the virtual printer to become much too slow. But I would certainly want my virtual printer to have a button named EDIT - and when clicked it would ask if I want to edit the file as a pdf or rtf or doc or or or..., or something.