Home | Blog | Software | Reviews and Features | Forum | Help | Donate | About us
topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • July 27, 2016, 04:40:59 AM
  • Proudly celebrating 10 years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Last post Author Topic: DONE: Check folder and tell me which PDFs are images (non-searchable)  (Read 13314 times)

vevola

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 91
  • VeVoLa
    • View Profile
    • Donate to Member
I have a huge number of PDFs, and often I perform text searches across my collection. But I just realized: some of these PDFs are images!

I'd like to be able to know which PDFs are images so that I can convert them to searchable text. But there are so many of them in my folder, I'd have to open them one by one and check manually.

Would it be possible to have a little program that would check in certain folders and tell me which PDFs are images (even with a certain degree of certainty)?

TIA!!

skwire

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 4,593
    • View Profile
    • Donate to Member
Could you please attach one of those image PDFs here?  Or send me one in a PM?  Thanks.

vevola

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 91
  • VeVoLa
    • View Profile
    • Donate to Member
Here's an example of a PDF image (non-searchable text)

Other example are all the downloadable ebooks and pages from Google Books (either the free versions, or using Google Book Downloader for Greasemonkey from http://book.huhiho.com/).

Thanks!

skwire

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 4,593
    • View Profile
    • Donate to Member
I have some working code so how do you want this to work?  I can make a full GUI for it or I can simply make it recurse through your PDF folder(s) and spit out a text file at the end listing which files had no searchable text.  Your thoughts?

vevola

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 91
  • VeVoLa
    • View Profile
    • Donate to Member
Well, I'm a GUI kinda-guy! It would be great to choose which folders I want to search in, and then have a list that I could order in terms of filename or folder so it would be easier to work with...

bob99

  • Supporting Member
  • Joined in 2008
  • **
  • default avatar
  • Posts: 339
    • View Profile
    • Donate to Member
I have a huge number of PDFs, and often I perform text searches across my collection.

Not sure if I'm asking this right...
Are you searching inside each of the individual PDF's in the collection one file at a time or able to search all of them start to finish? Without manually opening each file individually, searching it, opening the next, searching, and so on.

If it's all at once (automatically one after the other) what are you using to do this?


vevola

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 91
  • VeVoLa
    • View Profile
    • Donate to Member
I use two programs for viewing and handling PDFs. One is a simple viewer and annotator (Foxit), and here you can search for a word or a phrase in a single PDF or all PDFs in a Folder. The other is called Mendeley Desktop, which is more of an "iTunes for PDFs" (made for academia).

But if a PDF is an image, obviously it won't show up in the results. If I have a list, I could OCR and convert them.

So, no, I don't have to open each file one at a time to perform a search. Both programs do an indexing of the texts of the collections.
« Last Edit: July 14, 2011, 12:59:40 PM by vevola »

bob99

  • Supporting Member
  • Joined in 2008
  • **
  • default avatar
  • Posts: 339
    • View Profile
    • Donate to Member

Son of a gun.  It's something I can do in PDF-XChange also.  Just never thought about it.  Most of the time I know which PDF/manual I need to go to for reference.  Occasionally though, knowing I can do this will be helpful. 
Good luck with your request. Didn't mean to go off topic.

vevola

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 91
  • VeVoLa
    • View Profile
    • Donate to Member
I know how it feels, no worries, and thanks!

skwire

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 4,593
    • View Profile
    • Donate to Member
Here's a test application to try:  PDF Text Checker

This is a native AHK solution and it, currently, doesn't work as well as I'd like.  Basically, what is does is look for the hex equivalent of "FontName" within the file.  If it's found, that PDF is definitely searchable.  The problem comes in when it's NOT found.  I have some PDFs that don't have that string but are still searchable.  Anyway, give it a shot.

  • Download, extract and run PDFTextChecker.exe.
  • You will be presented with the standard folder selection dialog asking for your PDF source folder.
  • You will then be asked if you want the application to search sub-directories as well.
  • Scanning will take time but you can check which file it's on by hovering over the tray icon.
  • Another message box will alert you as to when it's finished.
  • Two text files will be generated in your chosen folder:  !Searchable.txt and !Not_Searchable.txt

If the results aren't satisfactory, I may have to approach this in a different manner and use a PDF-to-Text type of application to determine if a PDF has any searchable text.  At any rate, let me know how this works.

vevola

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 91
  • VeVoLa
    • View Profile
    • Donate to Member
Thanks for the update!

I tried running it and after a couple of minutes I get the following error:
Quote
Error: Memory limit reached (see #MaxMem in the help file). The current thread
will exit.
Line#
—> 083: Return,str

BTW, the folder I'm searching has about 4GB of PDFs, dunno if that means anything.

EDIT: I thought the program stopped, but it's still running the background. I imagine the error was for a particular file, and not for the entire program. Any ideas of what it is? Will it show up in the final log? We'll see!  ;)
Thanks again!
« Last Edit: July 15, 2011, 02:17:32 PM by vevola »

skwire

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 4,593
    • View Profile
    • Donate to Member
I'm going to assume that you have some large PDFs?  Clear your cache, re-download and see if this latest build gets through that.  Thanks.

vevola

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 91
  • VeVoLa
    • View Profile
    • Donate to Member
Nope. I still get errors, and indeed eventually it crashes.

The largest of my PDFs are 340MB, 210MB, 140MB, and all the other are under 100MB. Maybe exclude all PDFs over 100MB?

skwire

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 4,593
    • View Profile
    • Donate to Member
Maybe exclude all PDFs over 100MB?

Done, please re-grab it.  Thanks.

vevola

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 91
  • VeVoLa
    • View Profile
    • Donate to Member
So I temporarily moved the larger PDFs and did a search of only under 60MB and it worked (didn't try it with the very last version.)

However, the lists produced don't seem reliable: non-searchable 817; searchable 616.

(The ratio should be less).

skwire

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 4,593
    • View Profile
    • Donate to Member
However, the lists produced don't seem reliable: non-searchable 817; searchable 616.

Yep, that's what I was afraid of.  I'll work on a different method.

skwire

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 4,593
    • View Profile
    • Donate to Member
Let's give this a shot:  PDF Text Checker

Extract all the contents to a folder and run PDFTextChecker.exe.  This version uses the pdftotext.exe to extract any text from the PDF and then checks the resulting text file for any alphanumeric characters.  If any are found, it considers that searchable.  It's still not perfect but it seems to be better than the previous version.  If this version doesn't do it for you, you may be better off using something like File Hound to search through your PDFs.

vevola

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 91
  • VeVoLa
    • View Profile
    • Donate to Member
Thanks again!

Better: 225 vs. 1205

But I also see many false-positives (about 1/3).

vevola

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 91
  • VeVoLa
    • View Profile
    • Donate to Member

EDIT: Actually it is much better than what I wrote. There are false-positives, but probably not 1:3.

I think that unless you have any other tweaks, this may do! Thanks!

skwire

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 4,593
    • View Profile
    • Donate to Member
I think that unless you have any other tweaks, this may do! Thanks!

You're welcome; apologies that it couldn't be more accurate.

rjbull

  • Charter Member
  • Joined in 2005
  • ***
  • default avatar
  • Posts: 2,896
    • View Profile
    • Donate to Member
you may be better off using something like File Hound to search through your PDFs.
I'm pretty sure Hound uses pdftotext too.  That probably means it ignores image files as being empty of text, which should at least make the search hands-off.

vevola

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 91
  • VeVoLa
    • View Profile
    • Donate to Member
you may be better off using something like File Hound to search through your PDFs.
I'm pretty sure Hound uses pdftotext too.  That probably means it ignores image files as being empty of text, which should at least make the search hands-off.

I'm not sure exactly how it works, but what I've noticed in some of my PDFs: it's a scan that the library makes (image) and then they add some type of "copyright" stamp which is text. So some PDFs would be principally images, but with just some text.

suntsu

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 1
    • View Profile
    • Donate to Member
Re: DONE: Check folder and tell me which PDFs are images (non-searchable)
« Reply #22 on: July 21, 2011, 04:31:53 AM »
 :Thmbsup:I do have a similar problem. The app works fine for me!
Thanks a lot (Win XP Pro, old Toshiba Laptop).
Suntsu

vevola

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 91
  • VeVoLa
    • View Profile
    • Donate to Member
Re: DONE: Check folder and tell me which PDFs are images (non-searchable)
« Reply #23 on: July 21, 2011, 08:46:26 AM »
kudos to skwire! I'm happy it wasn't just for me!

skwire

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 4,593
    • View Profile
    • Donate to Member
Re: DONE: Check folder and tell me which PDFs are images (non-searchable)
« Reply #24 on: July 21, 2011, 08:48:01 AM »
:Thmbsup:I do have a similar problem. The app works fine for me!
Thanks a lot (Win XP Pro, old Toshiba Laptop).
Suntsu

You're welcome, Suntsu, and welcome to the site.   :D