Welcome Guest.   Make a donation to an author on the site October 23, 2014, 06:04:00 AM  *

Please login or register.
Or did you miss your validation email?


Login with username and password (forgot your password?)
Why not become a lifetime supporting member of the site with a one-time donation of any amount? Your donation entitles you to a ton of additional benefits, including access to exclusive discounts and downloads, the ability to enter monthly free software drawings, and a single non-expiring license key for all of our programs.


You must sign up here before you can post and access some areas of the site. Registration is totally free and confidential.
 
The N.A.N.Y. Challenge 2012! Download dozens of custom programs!
   
   Forum Home   Thread Marks Chat! Downloads Search Login Register  
Pages: [1] 2 Next   Go Down
  Reply  |  New Topic  |  Print  
Author Topic: DONE: Check folder and tell me which PDFs are images (non-searchable)  (Read 9813 times)
vevola
Charter Member
***
Posts: 88


VeVoLa

View Profile Give some DonationCredits to this forum member
« on: July 13, 2011, 02:50:04 AM »

I have a huge number of PDFs, and often I perform text searches across my collection. But I just realized: some of these PDFs are images!

I'd like to be able to know which PDFs are images so that I can convert them to searchable text. But there are so many of them in my folder, I'd have to open them one by one and check manually.

Would it be possible to have a little program that would check in certain folders and tell me which PDFs are images (even with a certain degree of certainty)?

TIA!!
Logged
skwire
Global Moderator
*****
Posts: 4,114



Another Coding Snack request? Om nom nom...

see users location on a map View Profile WWW Give some DonationCredits to this forum member
« Reply #1 on: July 13, 2011, 07:18:51 AM »

Could you please attach one of those image PDFs here?  Or send me one in a PM?  Thanks.
Logged

vevola
Charter Member
***
Posts: 88


VeVoLa

View Profile Give some DonationCredits to this forum member
« Reply #2 on: July 14, 2011, 03:50:23 AM »

Here's an example of a PDF image (non-searchable text)

Other example are all the downloadable ebooks and pages from Google Books (either the free versions, or using Google Book Downloader for Greasemonkey from http://book.huhiho.com/).

Thanks!
Logged
skwire
Global Moderator
*****
Posts: 4,114



Another Coding Snack request? Om nom nom...

see users location on a map View Profile WWW Give some DonationCredits to this forum member
« Reply #3 on: July 14, 2011, 08:34:15 AM »

I have some working code so how do you want this to work?  I can make a full GUI for it or I can simply make it recurse through your PDF folder(s) and spit out a text file at the end listing which files had no searchable text.  Your thoughts?
Logged

vevola
Charter Member
***
Posts: 88


VeVoLa

View Profile Give some DonationCredits to this forum member
« Reply #4 on: July 14, 2011, 09:02:16 AM »

Well, I'm a GUI kinda-guy! It would be great to choose which folders I want to search in, and then have a list that I could order in terms of filename or folder so it would be easier to work with...
Logged
bob99
Supporting Member
**
Posts: 332

see users location on a map View Profile Give some DonationCredits to this forum member
« Reply #5 on: July 14, 2011, 12:23:27 PM »

I have a huge number of PDFs, and often I perform text searches across my collection.

Not sure if I'm asking this right...
Are you searching inside each of the individual PDF's in the collection one file at a time or able to search all of them start to finish? Without manually opening each file individually, searching it, opening the next, searching, and so on.

If it's all at once (automatically one after the other) what are you using to do this?

Logged
vevola
Charter Member
***
Posts: 88


VeVoLa

View Profile Give some DonationCredits to this forum member
« Reply #6 on: July 14, 2011, 12:54:55 PM »

I use two programs for viewing and handling PDFs. One is a simple viewer and annotator (Foxit), and here you can search for a word or a phrase in a single PDF or all PDFs in a Folder. The other is called Mendeley Desktop, which is more of an "iTunes for PDFs" (made for academia).

But if a PDF is an image, obviously it won't show up in the results. If I have a list, I could OCR and convert them.

So, no, I don't have to open each file one at a time to perform a search. Both programs do an indexing of the texts of the collections.
« Last Edit: July 14, 2011, 12:59:40 PM by vevola » Logged
bob99
Supporting Member
**
Posts: 332

see users location on a map View Profile Give some DonationCredits to this forum member
« Reply #7 on: July 14, 2011, 01:44:19 PM »


Son of a gun.  It's something I can do in PDF-XChange also.  Just never thought about it.  Most of the time I know which PDF/manual I need to go to for reference.  Occasionally though, knowing I can do this will be helpful. 
Good luck with your request. Didn't mean to go off topic.
Logged
vevola
Charter Member
***
Posts: 88


VeVoLa

View Profile Give some DonationCredits to this forum member
« Reply #8 on: July 14, 2011, 01:46:38 PM »

I know how it feels, no worries, and thanks!
Logged
skwire
Global Moderator
*****
Posts: 4,114



Another Coding Snack request? Om nom nom...

see users location on a map View Profile WWW Give some DonationCredits to this forum member
« Reply #9 on: July 15, 2011, 01:56:00 PM »

Here's a test application to try:  PDF Text Checker

This is a native AHK solution and it, currently, doesn't work as well as I'd like.  Basically, what is does is look for the hex equivalent of "FontName" within the file.  If it's found, that PDF is definitely searchable.  The problem comes in when it's NOT found.  I have some PDFs that don't have that string but are still searchable.  Anyway, give it a shot.

  • Download, extract and run PDFTextChecker.exe.
  • You will be presented with the standard folder selection dialog asking for your PDF source folder.
  • You will then be asked if you want the application to search sub-directories as well.
  • Scanning will take time but you can check which file it's on by hovering over the tray icon.
  • Another message box will alert you as to when it's finished.
  • Two text files will be generated in your chosen folder:  !Searchable.txt and !Not_Searchable.txt

If the results aren't satisfactory, I may have to approach this in a different manner and use a PDF-to-Text type of application to determine if a PDF has any searchable text.  At any rate, let me know how this works.
Logged

vevola
Charter Member
***
Posts: 88


VeVoLa

View Profile Give some DonationCredits to this forum member
« Reply #10 on: July 15, 2011, 02:13:42 PM »

Thanks for the update!

I tried running it and after a couple of minutes I get the following error:
Quote
Error: Memory limit reached (see #MaxMem in the help file). The current thread
will exit.
Line#
—> 083: Return,str

BTW, the folder I'm searching has about 4GB of PDFs, dunno if that means anything.

EDIT: I thought the program stopped, but it's still running the background. I imagine the error was for a particular file, and not for the entire program. Any ideas of what it is? Will it show up in the final log? We'll see!  Wink
Thanks again!
« Last Edit: July 15, 2011, 02:17:32 PM by vevola » Logged
skwire
Global Moderator
*****
Posts: 4,114



Another Coding Snack request? Om nom nom...

see users location on a map View Profile WWW Give some DonationCredits to this forum member
« Reply #11 on: July 15, 2011, 02:20:12 PM »

I'm going to assume that you have some large PDFs?  Clear your cache, re-download and see if this latest build gets through that.  Thanks.
Logged

vevola
Charter Member
***
Posts: 88


VeVoLa

View Profile Give some DonationCredits to this forum member
« Reply #12 on: July 15, 2011, 03:37:21 PM »

Nope. I still get errors, and indeed eventually it crashes.

The largest of my PDFs are 340MB, 210MB, 140MB, and all the other are under 100MB. Maybe exclude all PDFs over 100MB?
Logged
skwire
Global Moderator
*****
Posts: 4,114



Another Coding Snack request? Om nom nom...

see users location on a map View Profile WWW Give some DonationCredits to this forum member
« Reply #13 on: July 15, 2011, 04:00:25 PM »

Maybe exclude all PDFs over 100MB?

Done, please re-grab it.  Thanks.
Logged

vevola
Charter Member
***
Posts: 88


VeVoLa

View Profile Give some DonationCredits to this forum member
« Reply #14 on: July 15, 2011, 04:13:44 PM »

So I temporarily moved the larger PDFs and did a search of only under 60MB and it worked (didn't try it with the very last version.)

However, the lists produced don't seem reliable: non-searchable 817; searchable 616.

(The ratio should be less).
Logged
skwire
Global Moderator
*****
Posts: 4,114



Another Coding Snack request? Om nom nom...

see users location on a map View Profile WWW Give some DonationCredits to this forum member
« Reply #15 on: July 15, 2011, 04:46:31 PM »

However, the lists produced don't seem reliable: non-searchable 817; searchable 616.

Yep, that's what I was afraid of.  I'll work on a different method.
Logged

skwire
Global Moderator
*****
Posts: 4,114



Another Coding Snack request? Om nom nom...

see users location on a map View Profile WWW Give some DonationCredits to this forum member
« Reply #16 on: July 16, 2011, 12:27:39 PM »

Let's give this a shot:  PDF Text Checker

Extract all the contents to a folder and run PDFTextChecker.exe.  This version uses the pdftotext.exe to extract any text from the PDF and then checks the resulting text file for any alphanumeric characters.  If any are found, it considers that searchable.  It's still not perfect but it seems to be better than the previous version.  If this version doesn't do it for you, you may be better off using something like File Hound to search through your PDFs.
Logged

vevola
Charter Member
***
Posts: 88


VeVoLa

View Profile Give some DonationCredits to this forum member
« Reply #17 on: July 17, 2011, 08:05:32 AM »

Thanks again!

Better: 225 vs. 1205

But I also see many false-positives (about 1/3).
Logged
vevola
Charter Member
***
Posts: 88


VeVoLa

View Profile Give some DonationCredits to this forum member
« Reply #18 on: July 17, 2011, 08:12:39 AM »


EDIT: Actually it is much better than what I wrote. There are false-positives, but probably not 1:3.

I think that unless you have any other tweaks, this may do! Thanks!
Logged
skwire
Global Moderator
*****
Posts: 4,114



Another Coding Snack request? Om nom nom...

see users location on a map View Profile WWW Give some DonationCredits to this forum member
« Reply #19 on: July 17, 2011, 02:48:31 PM »

I think that unless you have any other tweaks, this may do! Thanks!

You're welcome; apologies that it couldn't be more accurate.
Logged

rjbull
Charter Member
***
Posts: 2,776

View Profile Give some DonationCredits to this forum member
« Reply #20 on: July 17, 2011, 04:40:40 PM »

you may be better off using something like File Hound to search through your PDFs.
I'm pretty sure Hound uses pdftotext too.  That probably means it ignores image files as being empty of text, which should at least make the search hands-off.
Logged
vevola
Charter Member
***
Posts: 88


VeVoLa

View Profile Give some DonationCredits to this forum member
« Reply #21 on: July 20, 2011, 04:04:19 AM »

you may be better off using something like File Hound to search through your PDFs.
I'm pretty sure Hound uses pdftotext too.  That probably means it ignores image files as being empty of text, which should at least make the search hands-off.

I'm not sure exactly how it works, but what I've noticed in some of my PDFs: it's a scan that the library makes (image) and then they add some type of "copyright" stamp which is text. So some PDFs would be principally images, but with just some text.
Logged
suntsu
Supporting Member
**
Posts: 1


View Profile Give some DonationCredits to this forum member
« Reply #22 on: July 21, 2011, 04:31:53 AM »

 :Thmbsup:I do have a similar problem. The app works fine for me!
Thanks a lot (Win XP Pro, old Toshiba Laptop).
Suntsu
Logged
vevola
Charter Member
***
Posts: 88


VeVoLa

View Profile Give some DonationCredits to this forum member
« Reply #23 on: July 21, 2011, 08:46:26 AM »

kudos to skwire! I'm happy it wasn't just for me!
Logged
skwire
Global Moderator
*****
Posts: 4,114



Another Coding Snack request? Om nom nom...

see users location on a map View Profile WWW Give some DonationCredits to this forum member
« Reply #24 on: July 21, 2011, 08:48:01 AM »

:Thmbsup:I do have a similar problem. The app works fine for me!
Thanks a lot (Win XP Pro, old Toshiba Laptop).
Suntsu

You're welcome, Suntsu, and welcome to the site.   cheesy
Logged

Pages: [1] 2 Next   Go Up
  Reply  |  New Topic  |  Print  
 
Jump to:  
   Forum Home   Thread Marks Chat! Downloads Search Login Register  

DonationCoder.com | About Us
DonationCoder.com Forum | Powered by SMF
[ Page time: 0.037s | Server load: 0.03 ]