Author Topic: DONE: Check folder and tell me which PDFs are images (non-searchable) (Read 31581 times)

vevola · « **on:** July 13, 2011, 02:50 AM »

I have a huge number of PDFs, and often I perform text searches across my collection. But I just realized: some of these PDFs are images!

I'd like to be able to know which PDFs are images so that I can convert them to searchable text. But there are so many of them in my folder, I'd have to open them one by one and check manually.

Would it be possible to have a little program that would check in certain folders and tell me which PDFs are images (even with a certain degree of certainty)?

TIA!!

skwire · « **Reply #1 on:** July 13, 2011, 07:18 AM »

Could you please attach one of those image PDFs here? Or send me one in a PM? Thanks.

vevola · « **Reply #2 on:** July 14, 2011, 03:50 AM »

Here's an example of a PDF image (non-searchable text)

Other example are all the downloadable ebooks and pages from Google Books (either the free versions, or using Google Book Downloader for Greasemonkey from http://book.huhiho.com/).

Thanks!

skwire · « **Reply #3 on:** July 14, 2011, 08:34 AM »

I have some working code so how do you want this to work? I can make a full GUI for it or I can simply make it recurse through your PDF folder(s) and spit out a text file at the end listing which files had no searchable text. Your thoughts?

vevola · « **Reply #4 on:** July 14, 2011, 09:02 AM »

Well, I'm a GUI kinda-guy! It would be great to choose which folders I want to search in, and then have a list that I could order in terms of filename or folder so it would be easier to work with...

bob99 · « **Reply #5 on:** July 14, 2011, 12:23 PM »

I have a huge number of PDFs, and often I perform text searches across my collection.
-vevola (July 13, 2011, 02:50 AM)

Not sure if I'm asking this right...
Are you searching inside each of the individual PDF's in the collection one file at a time or able to search all of them start to finish? Without manually opening each file individually, searching it, opening the next, searching, and so on.

If it's all at once (automatically one after the other) what are you using to do this?

vevola · « **Reply #6 on:** July 14, 2011, 12:54 PM »

I use two programs for viewing and handling PDFs. One is a simple viewer and annotator (Foxit), and here you can search for a word or a phrase in a single PDF or all PDFs in a Folder. The other is called Mendeley Desktop, which is more of an "iTunes for PDFs" (made for academia).

But if a PDF is an image, obviously it won't show up in the results. If I have a list, I could OCR and convert them.

So, no, I don't have to open each file one at a time to perform a search. Both programs do an indexing of the texts of the collections.

bob99 · « **Reply #7 on:** July 14, 2011, 01:44 PM »

Son of a gun. It's something I can do in PDF-XChange also. Just never thought about it. Most of the time I know which PDF/manual I need to go to for reference. Occasionally though, knowing I can do this will be helpful.
Good luck with your request. Didn't mean to go off topic.

vevola · « **Reply #8 on:** July 14, 2011, 01:46 PM »

I know how it feels, no worries, and thanks!

skwire · « **Reply #9 on:** July 15, 2011, 01:56 PM »

Here's a test application to try: PDF Text Checker

This is a native AHK solution and it, currently, doesn't work as well as I'd like. Basically, what is does is look for the hex equivalent of "FontName" within the file. If it's found, that PDF is definitely searchable. The problem comes in when it's NOT found. I have some PDFs that don't have that string but are still searchable. Anyway, give it a shot.

Download, extract and run PDFTextChecker.exe.
You will be presented with the standard folder selection dialog asking for your PDF source folder.
You will then be asked if you want the application to search sub-directories as well.
Scanning will take time but you can check which file it's on by hovering over the tray icon.
Another message box will alert you as to when it's finished.
Two text files will be generated in your chosen folder: !Searchable.txt and !Not_Searchable.txt

If the results aren't satisfactory, I may have to approach this in a different manner and use a PDF-to-Text type of application to determine if a PDF has any searchable text. At any rate, let me know how this works.

vevola · « **Reply #10 on:** July 15, 2011, 02:13 PM »

Thanks for the update!

I tried running it and after a couple of minutes I get the following error:

Error: Memory limit reached (see #MaxMem in the help file). The current thread
will exit.
Line#
—> 083: Return,str

BTW, the folder I'm searching has about 4GB of PDFs, dunno if that means anything.

EDIT: I thought the program stopped, but it's still running the background. I imagine the error was for a particular file, and not for the entire program. Any ideas of what it is? Will it show up in the final log? We'll see!

Thanks again!

skwire · « **Reply #11 on:** July 15, 2011, 02:20 PM »

I'm going to assume that you have some large PDFs? Clear your cache, re-download and see if this latest build gets through that. Thanks.

vevola · « **Reply #12 on:** July 15, 2011, 03:37 PM »

Nope. I still get errors, and indeed eventually it crashes.

The largest of my PDFs are 340MB, 210MB, 140MB, and all the other are under 100MB. Maybe exclude all PDFs over 100MB?

skwire · « **Reply #13 on:** July 15, 2011, 04:00 PM »

Maybe exclude all PDFs over 100MB?
-vevola (July 15, 2011, 03:37 PM)

Done, please re-grab it. Thanks.

vevola · « **Reply #14 on:** July 15, 2011, 04:13 PM »

So I temporarily moved the larger PDFs and did a search of only under 60MB and it worked (didn't try it with the very last version.)

However, the lists produced don't seem reliable: non-searchable 817; searchable 616.

(The ratio should be less).

skwire · « **Reply #15 on:** July 15, 2011, 04:46 PM »

However, the lists produced don't seem reliable: non-searchable 817; searchable 616.
-vevola (July 15, 2011, 04:13 PM)

Yep, that's what I was afraid of. I'll work on a different method.

skwire · « **Reply #16 on:** July 16, 2011, 12:27 PM »

Let's give this a shot: PDF Text Checker

Extract all the contents to a folder and run PDFTextChecker.exe. This version uses the pdftotext.exe to extract any text from the PDF and then checks the resulting text file for any alphanumeric characters. If any are found, it considers that searchable. It's still not perfect but it seems to be better than the previous version. If this version doesn't do it for you, you may be better off using something like File Hound to search through your PDFs.

vevola · « **Reply #17 on:** July 17, 2011, 08:05 AM »

Thanks again!

Better: 225 vs. 1205

But I also see many false-positives (about 1/3).

vevola · « **Reply #18 on:** July 17, 2011, 08:12 AM »

EDIT: Actually it is much better than what I wrote. There are false-positives, but probably not 1:3.

I think that unless you have any other tweaks, this may do! Thanks!

skwire · « **Reply #19 on:** July 17, 2011, 02:48 PM »

I think that unless you have any other tweaks, this may do! Thanks!
-vevola (July 17, 2011, 08:12 AM)

You're welcome; apologies that it couldn't be more accurate.

rjbull · « **Reply #20 on:** July 17, 2011, 04:40 PM »

you may be better off using something like File Hound to search through your PDFs.
-skwire (July 16, 2011, 12:27 PM)

I'm pretty sure Hound uses pdftotext too. That probably means it ignores image files as being empty of text, which should at least make the search hands-off.

vevola · « **Reply #21 on:** July 20, 2011, 04:04 AM »

you may be better off using something like File Hound to search through your PDFs.
-skwire (July 16, 2011, 12:27 PM)
I'm pretty sure Hound uses pdftotext too. That probably means it ignores image files as being empty of text, which should at least make the search hands-off.
-rjbull (July 17, 2011, 04:40 PM)

I'm not sure exactly how it works, but what I've noticed in some of my PDFs: it's a scan that the library makes (image) and then they add some type of "copyright" stamp which is text. So some PDFs would be principally images, but with just some text.

suntsu · « **Reply #22 on:** July 21, 2011, 04:31 AM »

:Thmbsup:I do have a similar problem. The app works fine for me!
Thanks a lot (Win XP Pro, old Toshiba Laptop).
Suntsu

vevola · « **Reply #23 on:** July 21, 2011, 08:46 AM »

kudos to skwire! I'm happy it wasn't just for me!

skwire · « **Reply #24 on:** July 21, 2011, 08:48 AM »

:Thmbsup:I do have a similar problem. The app works fine for me!
Thanks a lot (Win XP Pro, old Toshiba Laptop).
Suntsu
-suntsu (July 21, 2011, 04:31 AM)

You're welcome, Suntsu, and welcome to the site.