Welcome Guest.   Make a donation to an author on the site November 23, 2014, 03:45:53 AM  *

Please login or register.
Or did you miss your validation email?


Login with username and password (forgot your password?)
Why not become a lifetime supporting member of the site with a one-time donation of any amount? Your donation entitles you to a ton of additional benefits, including access to exclusive discounts and downloads, the ability to enter monthly free software drawings, and a single non-expiring license key for all of our programs.


You must sign up here before you can post and access some areas of the site. Registration is totally free and confidential.
 
The N.A.N.Y. Challenge 2011! Download 30+ custom programs!
   
   Forum Home   Thread Marks Chat! Downloads Search Login Register  
Pages: [1]   Go Down
  Reply  |  New Topic  |  Print  
Author Topic: OCR/Text Recognition in a .pdf document? How do I do this.  (Read 9201 times)
vegas
Charter Member
***
Posts: 352


View Profile Give some DonationCredits to this forum member
« on: November 04, 2006, 03:21:55 PM »

I downloaded this old magazine the other day, which was scanned into a .pdf file, but what I didn't understand, I went to search for something within the file and results actually came back.  How the heck is this done? I have tons of old magazines I want to be able to throw away someday (the sooner the better) - but I'd like to be able to keep all the information within them by scanning them to PDF files.  Do I have to buy Adobe Acrobat to do this or is other software used? Has anyone tried doing this with print material so it is completely searchable?  Thanks for any responses.   -vegas
Logged
cranioscopical
Friend of the Site
Supporting Member
**
Posts: 4,195



see users location on a map View Profile Read user's biography. Give some DonationCredits to this forum member
« Reply #1 on: November 04, 2006, 05:07:57 PM »

Quote
...scanned into a .pdf file, but what I didn't understand, I went to search for something within the file and results actually came back...
Do I have to buy Adobe Acrobat to do this or is other software used?
Searchable text is certainly an option with Acrobat (not the reader).
I think there's at least one other .pdf 'handler' out there that claims to do the job. Unhelpfully, I can't recall what it is right now. If my brain starts working again (unlikely) I'll let you know the name. Probably others here will fill in the blank, anyway. It's certainly a useful feature.

Try looking here:  http://searchable-text.qarchive.org/
« Last Edit: November 04, 2006, 05:13:16 PM by cranioscopical » Logged

Chris
Darwin
Charter Member
***
Posts: 6,979



see users location on a map View Profile Read user's biography. Give some DonationCredits to this forum member
« Reply #2 on: August 28, 2008, 02:41:37 PM »

A couple of years late... PDF Converter Professional 3 (and 4 and 5) will allow you to select "Save As" and "Searchable PDF" from the File menu....

I've had very good results doing this on various pdfs that I have downloaded from the internet. Most of them are ancient journal articles (such as Science magazine articles from the 19th century) but some recent pdfs are generated as image files rather than searchable text. PDF Converter handles these as well. Most recently I've done this on a 400+ page PhD dissertation that features pages that are at an angle and have things like lint and other debris that were on the platen glass visible... Worked like a charm.
« Last Edit: August 28, 2008, 02:45:30 PM by Darwin » Logged

"Some people have a way with words, other people,... oh... have not way" - Steve Martin
Edvard
Coding Snacks Author
Charter Honorary Member
***
Posts: 2,625



View Profile Give some DonationCredits to this forum member
« Reply #3 on: August 28, 2008, 04:56:31 PM »

I work in a copy shop where we just got a new Xerox machine - a WorkCentre 7655.
It will scan to a searchable pdf - pretty amazing stuff.
I'd recommend going to a reputable copy place in your area and asking their prices for scanning.
Our place is charging 50 cents (US) per, but other folks might be charging less.
Other than that, I don't know of any solution besides Acrobat pro.
Logged

All children left unattended will be given a mocha and a puppy.
Darwin
Charter Member
***
Posts: 6,979



see users location on a map View Profile Read user's biography. Give some DonationCredits to this forum member
« Reply #4 on: August 28, 2008, 05:13:42 PM »

Other than that, I don't know of any solution besides Acrobat pro.

Well, there is PDF Converter Professional as noted above! Actually, this also means that Zeon PDFDoc Gold should do it as well as Scansoft/Nuance licence their product from them.
Logged

"Some people have a way with words, other people,... oh... have not way" - Steve Martin
Curt
Supporting Member
**
Posts: 6,349

see users location on a map View Profile Give some DonationCredits to this forum member
« Reply #5 on: August 28, 2008, 08:29:30 PM »

Quote from: Zeon PDF Doc Gold
Searching/Indexing Documents

The Search tool helps to locate information in PDF documents by looking phrases in the current document, a specified folder, or an indexed archive:

    * Search current document.
    * Search all files in specified folder.
    * Search against pre-build index files.

"PDF Gold" (including PDF Plus Professional) is capable to create indexes across gigabytes of PDF documents. This Unicode based index engine allows searching both the contents and pre-defined custom fields.

Zeon is $99
http://www.pdfwizard.com/eng/product/pdfgold.asp
Logged
lanux128
Global Moderator
*****
Posts: 6,132



see users location on a map View Profile WWW Read user's biography. Give some DonationCredits to this forum member
« Reply #6 on: August 28, 2008, 09:10:48 PM »

if you have Microsoft Office, you can use JOCR (freeware).

JOCR enables you to capture the image on the screen and convert the captured image to text. It is useful to revive the protected files whose text can not be copied.  JOCR enables you to copy text from any files and images on the screen such as protected Web pages, PDF files, error messages.

JOCR requires Microsoft Office 2003 or higher version. If JCOR does not work, please manually install "Micorosoft Office Document Imaging" (MODI) that is included in the setup file of Microsoft Office.


http://home.megapass.co.k...oosjung/Product_JOCR.html
Logged

Darwin
Charter Member
***
Posts: 6,979



see users location on a map View Profile Read user's biography. Give some DonationCredits to this forum member
« Reply #7 on: August 29, 2008, 10:12:26 AM »

Thanks, lanux  Thmbsup I'm checking it out now...
Logged

"Some people have a way with words, other people,... oh... have not way" - Steve Martin
Darwin
Charter Member
***
Posts: 6,979



see users location on a map View Profile Read user's biography. Give some DonationCredits to this forum member
« Reply #8 on: August 29, 2008, 10:15:50 AM »

BTW, Curt, I notice that NovaPDF (all flavours) is on sale at the moment. Will it handle this as well (ie allow you to save a pdf created as an archive of image files as a searchable pdf)?
Logged

"Some people have a way with words, other people,... oh... have not way" - Steve Martin
Darwin
Charter Member
***
Posts: 6,979



see users location on a map View Profile Read user's biography. Give some DonationCredits to this forum member
« Reply #9 on: August 29, 2008, 10:19:23 AM »

Hmm... checking out the feature matrix, it doesn't look like NovaPDF will allow you to process existing pdfs. It *looks* like a very powerful print driver, no? Actually, I wonder if NovaPDF will do this (process/change existing pdfs) printing an existing pdf from a pdf reader...? Enquiring minds want to know!
Logged

"Some people have a way with words, other people,... oh... have not way" - Steve Martin
lanux128
Global Moderator
*****
Posts: 6,132



see users location on a map View Profile WWW Read user's biography. Give some DonationCredits to this forum member
« Reply #10 on: August 29, 2008, 10:35:40 AM »

NovaPDF is a PDF creator and the 'Lite' version that i'm using quite a good. but this is a good site to visit for PDF related resources - Planet PDF. smiley
Logged

Darwin
Charter Member
***
Posts: 6,979



see users location on a map View Profile Read user's biography. Give some DonationCredits to this forum member
« Reply #11 on: August 29, 2008, 11:54:01 AM »

Thanks lanux - I'd forgotten about that site. Will take a look...
Logged

"Some people have a way with words, other people,... oh... have not way" - Steve Martin
Curt
Supporting Member
**
Posts: 6,349

see users location on a map View Profile Give some DonationCredits to this forum member
« Reply #12 on: August 29, 2008, 04:48:45 PM »

if you have Microsoft Office, you can use JOCR (freeware).

- thanks, lanux, for reminding me. I have JOCR, but had forgotten about it.  embarassed

@ Darwin, your question caused a minor panic around here, because I couldn't remember that novaPdf comes from the company Softland (also the maker of Backup4All), so I couldn't find the (censored) folder... So finally I went to Revo_Uninstaller to ask, and realized that I actually have two (almost) identical programs:




- but they are both merely virtual printers, not editors, so I don't know anything about the more expensive version's features. But you can do more with PDF-XChange Viewer, than with novaPdf virtual printer. "I don't know why I have it, I guess it was on sale" is of course much too hard to put it, but my next print-to-pdf -purchase will be of another one:

Quote from: Aloha
Searchable.
Aloaha PDF files could be searched for  text-parts.

No special PS or PDF-printerdriver is needed.

http://www.aloaha.com/wi-software-en/versions.php
- maybe
« Last Edit: August 29, 2008, 05:10:31 PM by Curt » Logged
Curt
Supporting Member
**
Posts: 6,349

see users location on a map View Profile Give some DonationCredits to this forum member
« Reply #13 on: August 29, 2008, 05:45:32 PM »

I wonder if NovaPDF will do this (process/change existing pdfs) printing an existing pdf from a pdf reader...?

I found an old link to PDFcamp Printer Pro (PDF Writer)
- and realized it is "only" $38:

Quote
http://www.verypdf.com/pdfcamp/pdfcamp.htm

PDFcamp(pdf writer) Pro Features
PDFcamp(pdf writer) Pro includes all of the features in PDFcamp(pdf writer), plus:

(...many features listed...)

Option to merge or append to an existing PDF file (insert before the first page or append to the last page)

Supports Text Extraction
The newest version of PDFcamp(pdf writer) supports text extraction from printable documents (excluding graphics and PDF files) while keeping the original page layout. The extracted text can be used to re-construct the document, independent of the software that created the original document, and/or it can be posted to a searchable text database. Text Extract is ideal for archiving form documents, like invoices, statements and reports.

Integrate with Microsoft Office application, Create toolbar and icon in Microsoft Office 97, Office 2000, Office XP, Office 2003 and Office 2007, include Microsoft Excel, FrontPage, PowerPoint, Word, Outlook applications.

They also have a very fine editor @ $90 - http://www.verypdf.com/pdf-editor/index.html

But it seems in general to be impossible to edit a pdf file at the printing stage.   Sad
Edited: I mean, in general I will use a virtual pdf printer when I am at some homepage and read some article. I will click print article, but will very often like to add some info about the page and the author, or a picture not already included. It is seldom that I want to edit an article that I already have saved as a pdf file. So, in most cases my wish is to edit the pdf file already before it is saved from the Internet the first time. This feature seems to be rare or not existing.
 tellme
« Last Edit: August 29, 2008, 05:55:03 PM by Curt » Logged
Darwin
Charter Member
***
Posts: 6,979



see users location on a map View Profile Read user's biography. Give some DonationCredits to this forum member
« Reply #14 on: August 30, 2008, 07:13:07 AM »

The newest version of PDFcamp(pdf writer) supports text extraction from printable documents (excluding graphics and PDF files) while keeping the original page layout.

Drat! So close...
Logged

"Some people have a way with words, other people,... oh... have not way" - Steve Martin
lanux128
Global Moderator
*****
Posts: 6,132



see users location on a map View Profile WWW Read user's biography. Give some DonationCredits to this forum member
« Reply #15 on: August 30, 2008, 07:23:43 AM »

what about Foxit PDF Editor? i haven't used it but coming from the makers of much-touted Foxit Reader, it might be worth a shot.
Logged

lanux128
Global Moderator
*****
Posts: 6,132



see users location on a map View Profile WWW Read user's biography. Give some DonationCredits to this forum member
« Reply #16 on: August 30, 2008, 09:22:17 AM »

i noticed that there was a collaborative effort by tinjaw to purchase bundled products from Software995 which included some PDF toolkits such as PdfSuite995 and UltraPdf. Darwin, did you manage to try either of the programs since you were one of the purchasers? smiley

Logged

Darwin
Charter Member
***
Posts: 6,979



see users location on a map View Profile Read user's biography. Give some DonationCredits to this forum member
« Reply #17 on: August 30, 2008, 09:49:03 AM »

Heh, heh - I've been looking at the pdf995 toolset. However, the pdf portion is a print driver, it doesn't offer any advanced post-pdf creation editing, sadly. I'd forgotten about Foxit PDF Editor, I'll take a look.

FWIW I LOVED Foxit Reader and used it until Adobe Acrobat Reader 8 came out. AAR 8 (and now 9) is *almost* as light on resources (when configured properly) as Foxit Reader but has a number of features that I need. At any rate, it's been a year or so and it's time to look at Foxit again...
Logged

"Some people have a way with words, other people,... oh... have not way" - Steve Martin
Darwin
Charter Member
***
Posts: 6,979



see users location on a map View Profile Read user's biography. Give some DonationCredits to this forum member
« Reply #18 on: August 30, 2008, 09:54:42 AM »

Also BTW, I have Nuance PDF Professional 5 (and 3 and 4...) so actually have this functionality alreadly. However, as described in this thread, I had to uninstall it from my XP machine. As my XP machine is still my main computer, I'm vaguely interested in finding an alternative solution for this computer. However, over the next few months I'll be switching to the Vista notebook, so in the long run this won't be a big deal...
Logged

"Some people have a way with words, other people,... oh... have not way" - Steve Martin
PhilB66
Supporting Member
**
Posts: 1,510


View Profile Give some DonationCredits to this forum member
« Reply #19 on: August 30, 2008, 09:57:10 AM »

http://TopOCR looks interesting.

Quote
Multiple text output formats, including searchable PDF and HTML
Able to read 11 different languages

Logged
Darwin
Charter Member
***
Posts: 6,979



see users location on a map View Profile Read user's biography. Give some DonationCredits to this forum member
« Reply #20 on: August 30, 2008, 10:10:23 AM »

From a quick read of the Foxit PDF Editor user's manual (albeit for version 1.5, though there is no mention of this feature having been added for version 2), I don't think it has any OCR capabilities, though it does appear to be a powerful and impressive application.

TopOCR looks interesting - and  the price is right (free)! Using it in conjunction with Foxit PDF Editor might be an easy and effective workaround. Nice find  Thmbsup
Logged

"Some people have a way with words, other people,... oh... have not way" - Steve Martin
Softland
Software Author
Charter Honorary Member
***
Posts: 30

View Profile WWW Give some DonationCredits to this forum member
« Reply #21 on: September 01, 2008, 01:15:15 AM »

BTW, Curt, I notice that NovaPDF (all flavours) is on sale at the moment. Will it handle this as well (ie allow you to save a pdf created as an archive of image files as a searchable pdf)?
novaPDF cannot save an image as a searchable pdf because it's a pdf printer driver without pdf editing capabilities (a goal on the long run though). To do this you would have to perform OCR with a different program and after that convert the recognized text to a pdf with searchable text.

I wonder if NovaPDF will do this (process/change existing pdfs) printing an existing pdf from a pdf reader...?
novaPDF can create a PDF from a pdf reader (you open the pdf in foxit/adobe and hit print, select novapdf and save the pdf), however it cannot change existing pdfs (no editing capabilities).

Darwin, your question caused a minor panic around here, because I couldn't remember that novaPdf comes from the company Softland (also the maker of Backup4All), so I couldn't find the (censored) folder... So finally I went to Revo_Uninstaller to ask, and realized that I actually have two (almost) identical programs
Softland publishes both Backup4all and novaPDF (and dopdf). We also sell developer tools for creating/handling pdf, so as I see you have installed yourself novaPDF (as a separate pdf driver) but also purchased another program that uses our licensed technology. No need to panic, they both are well behaved and don't interact with each other (though randomly in our tests a 3rd smaller pdf printer sometimes appeared from nowhere smiley ).
Logged

Darwin
Charter Member
***
Posts: 6,979



see users location on a map View Profile Read user's biography. Give some DonationCredits to this forum member
« Reply #22 on: September 01, 2008, 01:38:48 AM »

Hi Softland - thank you for your reply and clarification about NovaPDF's editing capabilities (or lack thereof). Hope this is added in the future - it's a great feature!
Logged

"Some people have a way with words, other people,... oh... have not way" - Steve Martin
Softland
Software Author
Charter Honorary Member
***
Posts: 30

View Profile WWW Give some DonationCredits to this forum member
« Reply #23 on: September 01, 2008, 01:54:15 AM »

NovaPDF's editing capabilities (or lack thereof). Hope this is added in the future - it's a great feature!
We know it's a great feature, if only it wouldn't be that difficult to implement.
Logged

Curt
Supporting Member
**
Posts: 6,349

see users location on a map View Profile Give some DonationCredits to this forum member
« Reply #24 on: September 02, 2008, 06:14:58 AM »

NovaPDF's editing capabilities (or lack thereof). Hope this is added in the future - it's a great feature!
We know it's a great feature, if only it wouldn't be that difficult to implement.

I don't think it should be all implemented, but maybe made available as (stand-alone?) add-ons or whatever, otherwise I would expect the virtual printer to become much too slow. But I would certainly want my virtual printer to have a button named EDIT - and when clicked it would ask if I want to edit the file as a pdf or rtf or doc or or or..., or something.
Logged
Pages: [1]   Go Up
  Reply  |  New Topic  |  Print  
 
Jump to:  
   Forum Home   Thread Marks Chat! Downloads Search Login Register  

DonationCoder.com | About Us
DonationCoder.com Forum | Powered by SMF
[ Page time: 0.051s | Server load: 0.08 ]