Home | Blog | Software | Reviews and Features | Forum | Help | Donate | About us
topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • December 03, 2016, 07:41:54 AM
  • Proudly celebrating 10 years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: Extract text PDF using AHK  (Read 5452 times)

weedseed85

  • Participant
  • Joined in 2013
  • *
  • Posts: 2
    • View Profile
    • Donate to Member
Extract text PDF using AHK
« on: January 18, 2013, 02:42:53 AM »
Dear Developer's,

Might i please ask for your help.

(My English writing skills are very... disappointing, bud i think i can make myself clear. If not please ask. Thank you)

I really would like to create a script, with the help of AHK Hotkey. perhaps i will be in need of some external software, so be it.
My primary objective is to create a, "closed" kind, A script that does not need anny more input than i already gave it.

I would like to write or create a script that retries the text from a searchable PDF file and return me the output from the "search".

I was thinking about, the script / program, extracting all the available text from the PDF, putting it in a txt file and with the help of AHK I could
search or retrieve the text i need.

I have limited knowledge of AHK, But I've created some before so I'm not a complete noob. Besides I know where to find the AHK help documentary

Any suggestion is fine, and will help me getting closer to my goal.

Thanks is advance.

Greets,
Michel
 


4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 4,471
    • View Profile
    • Donate to Member
Re: Extract text PDF using AHK
« Reply #1 on: January 18, 2013, 03:48:03 AM »
This might get you started, it's an AutoIt thread with source code provided so shouldn't be too hard to convert to AutoHK or just leave and modify to do what you want:

Search PDF files for specific word and delete them

It uses pdftotext.exe, (part of Xpdf distribution), to convert the PDF before searching, naturally you don't need the delete part of it - you could just change that to output a list of found files.

Code reproduced below, not tested by me, (although I'm thinking this might be something useful....hhmm...off to the Bat Cave  :) ).

Code: AutoIt [Select]
  1. ; Written by DavidFromLafayette & wakillon
  2. ; http://www.autoitscript.com/forum/topic/127980-search-pdf-files-for-specific-word-and-delete-them/
  3. #include <File.au3>
  4. $current = "r:\updatecd\gcr_document"
  5. $ext = "*.pdf"
  6. $_pdftotextPath = 'c:\gnuwin32\bin\pdftotext.exe'
  7. $_OutPutFilepath = 'C:\temp\file.txt'
  8. _FileCreate($_OutPutFilepath)
  9.  
  10. Search($current, $ext)
  11.  
  12. Func Search($current, $ext)
  13.  
  14.         Local $search = FileFindFirstFile($current & "\*.*")
  15.         While 1
  16.                 Dim $file = FileFindNextFile($search)
  17.                 If @error Or StringLen($file) < 1 Then ExitLoop
  18.                 ;ConsoleWrite('-->-- $file : ' & $file & @CRLF)
  19.                 If Not StringInStr(FileGetAttrib($current & "\" & $file), "D") And ($file <> "." Or $file <> "..") Then
  20.                         $_RunWait = '"' & $_pdftotextPath & '" "' & $current & '\' & $file & '" ' & $_OutPutFilepath
  21.                         RunWait($_RunWait, '', @SW_HIDE)
  22.                         $filecontents = FileRead($_OutPutFilepath)
  23.                         $ObsoleteinFile = StringInStr($filecontents, "Obsolete")
  24.                         If $ObsoleteinFile > 0 And $ObsoleteinFile < 100 Then ; check to ensure Obsolete is on title page
  25.                                 ConsoleWrite("Obsolete in File" & $file & $ObsoleteinFile & @CRLF)
  26.                                 FileDelete($current & '\' & $file)
  27.                         EndIf
  28.                         FileDelete($_OutPutFilepath)
  29.                 EndIf
  30.                 If StringInStr(FileGetAttrib($current & "\" & $file), "D") And ($file <> "." Or $file <> "..") Then
  31.                         Search($current & "\" & $file, $ext)
  32.                 EndIf
  33.                 Sleep(1000)
  34.         WEnd
  35.         FileClose($search)
  36.  
  37. EndFunc   ;==>Search

weedseed85

  • Participant
  • Joined in 2013
  • *
  • Posts: 2
    • View Profile
    • Donate to Member
Re: Extract text PDF using AHK
« Reply #2 on: January 18, 2013, 04:28:56 AM »
Dear Four Wheel Drive (4wd),

Thank you!

I'll try to cook something up with the code you provided. I'll even might take a look at the
AutoIt software, Looks very interesting.

Thank You

 

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 4,471
    • View Profile
    • Donate to Member
Re: Extract text PDF using AHK
« Reply #3 on: January 18, 2013, 08:31:45 AM »
Here's a simple AutoIt program that will let you search a directory of pdf files, (including recursive), for specific text at the end it will output a list of files that had matches - started off based on the code above.

Nothing fancy but it might get you started.

Code: AutoIt [Select]
  1. #Region ;**** Directives created by AutoIt3Wrapper_GUI ****
  2. #AutoIt3Wrapper_UseUpx=n
  3. #AutoIt3Wrapper_Res_requestedExecutionLevel=asInvoker
  4. #AutoIt3Wrapper_Res_File_Add=pdftotext.exe
  5. #EndRegion ;**** Directives created by AutoIt3Wrapper_GUI ****
  6.  
  7. ; Include lots of constants for GUI creation
  8. #include <Constants.au3>
  9. #include <ButtonConstants.au3>
  10. #include <EditConstants.au3>
  11. #include <GUIConstantsEx.au3>
  12. #include <StaticConstants.au3>
  13. #include <WindowsConstants.au3>
  14. #include <RFLTA.au3>    ; Recursive filelist function
  15.  
  16. ; If pdftotext.exe doesn't exist then extract it from within the SearchPDF executable
  17. If Not FileExists('pdftotext.exe') Then
  18.         FileInstall('pdftotext.exe', '.\pdftotext.exe')
  19.  
  20. ; Global variables: inifile
  21.  
  22. ; Initial search directory = users Documents
  23. $current = @MyDocumentsDir
  24. $_text = ''
  25.  
  26. _GetIni() ; Get previous directory & search from inifile
  27.  
  28. ; Simple GUI interface
  29. #Region ### START Koda GUI section ###
  30. $Form1_1 = GUICreate("SearchPDF", 354, 376)             ; GUI window name/size
  31. $Input1 = GUICtrlCreateInput("", 7, 65, 250, 21)        ; Search text input field
  32. GUICtrlSetData($Input1, $_text)                         ; Set with previous data if it exists
  33. $Input2 = GUICtrlCreateInput("", 8, 10, 250, 21, BitOR($GUI_SS_DEFAULT_INPUT,$ES_READONLY))     ; Search dir display field
  34. GUICtrlSetData($Input2, $current)                               ; Set with previous data if it exists
  35. $Button1 = GUICtrlCreateButton("Path", 270, 7, 75, 25)          ; Path select button
  36. $Button2 = GUICtrlCreateButton("Exit", 136, 339, 75, 25)        ; Exit button
  37. $Button3 = GUICtrlCreateButton("Search", 272, 65, 75, 25)       ; Search button
  38. $Checkbox1 = GUICtrlCreateCheckbox("Recurse", 272, 40, 65, 17)  ; Recurse checkbox
  39. $Edit1 = GUICtrlCreateEdit("", 8, 100, 337, 229)                                ; Edit control for output, will let you copy text, (and edit it)
  40. GUISetState(@SW_SHOW)                                   ; Show the GUI
  41. #EndRegion ### END Koda GUI section ###
  42.  
  43.  
  44. While 1                                         ; Wait for something to happen
  45.         $nMsg = GUIGetMsg()                     ; Get any messages from the interface
  46.         Switch $nMsg
  47.                 Case $Button1                   ; Path button pressed
  48.                         $temp = FileSelectFolder('Folder to monitor:', '', 4)
  49.                         If @error <> 1 Then $current = $temp
  50.                         GUICtrlSetData($Input2, $current)               ; Change input field to reflect chosen path
  51.                 Case $GUI_EVENT_CLOSE, $Button2         ; GUI closed/Exit pressed: Write out current path/search text
  52.                         IniWrite($inifile, 'General', 'Path', $current)
  53.                         IniWrite($inifile, 'General', 'Search', $_text)
  54.                         Exit
  55.                 Case $Button3           ; Search pressed
  56.                         If (GUICtrlRead($Checkbox1) = $GUI_CHECKED) Then        ; Read state of Recurse checkbox and set flag
  57.                                 $_recurse = 1
  58.                         Else
  59.                                 $_recurse = 0
  60.                         EndIf
  61.                         $_text = GUICtrlRead($Input1)   ; Read search text, if empty pop up a message
  62.                         If $_text = '' Then
  63.                                 MsgBox(48, 'SearchPDF', 'No search text entered')
  64.                                 ContinueCase
  65.                         EndIf
  66.                         $_pdfFiles = _RecFileListToArray($current, '*.pdf', 1, $_recurse, 0, 2) ; Call RFLTA to list all .pdf files
  67.                         If Not IsArray($_pdfFiles) Or $_pdfFiles[0] = 0 Then    ; If none found, pop up a message
  68.                                 MsgBox(48, 'SearchPDF', 'No PDF files found')
  69.                                 ContinueCase
  70.                         EndIf
  71.                         Search($_pdfFiles, $_text)      ; Call Search function with array of pdf files and search text
  72.                 Case Else
  73.         EndSwitch
  74.  
  75.  
  76. Func Search($files, $text) ; Arguments passed: Array containing PDF path\filenames and text to search for
  77.         Local $_tempFile = @TempDir & '\tempPDF.txt', $output = ''      ; Local variables: temporary txt file and output
  78.         For $i = 1 To $files[0] ; First array element [0] contains number of files
  79.                 $_RunWait = '"' & $files[$i] & '" ' & $_tempFile        ; Compose command for conversion
  80.                 RunWait(@comspec & ' /c pdftotext.exe ' & $_RunWait, '.', @SW_HIDE)     ; Execute command for conversion with hidden CLI
  81.                 $filecontents = FileRead($_tempFile)    ; Read converted PDF file into variable
  82.                 $textInFile = StringInStr($filecontents, $text) ; Look for search text
  83.                 If $textInFile > 0 Then ; If it exists add filename to output list
  84.                         $output &= $files[$i] & @CRLF
  85.                 EndIf
  86.                 FileDelete($_tempFile)  ; Delete temporary txt file ready for next conversion
  87.         Next
  88.         If $output = '' Then $output = 'No files matched'
  89.         GUICtrlSetData($Edit1, $output) ; Write the results into the edit control
  90. EndFunc   ;==>Search
  91.  
  92. Func _GetIni() ; Reads data from ini file if it exists
  93.         If FileExists($inifile) Then
  94.                 $current = IniRead($inifile, 'General', 'Path', @MyDocumentsDir)
  95.                 $_text = IniRead($inifile, 'General', 'Search', '')
  96.         EndIf

pdftotext.exe is contained within SearchPDF.exe as a resource, it'll be extracted out if necessary.  Previous path and search are written to an ini file on exit to be used next time it's run.

Seems to work OK but no great bug finding was performed, no error trapping either.

EDIT: Cleaned it up a bit, (laugh if you like), now outputs to edit control so you can copy results, added lots of comments because mouser likes that kind of thing.
EDIT2: Tells you if no matches are found.

Anyway, I'm stopping there - you could add a progressbar, some way to cancel, better output (eg. line numbers within the file, no text within file, etc), etc.
« Last Edit: January 19, 2013, 12:05:35 AM by 4wd, Reason: Cleaned up code, now outputs to edit control. »

phitsc

  • Honorary Member
  • Joined in 2008
  • **
  • Posts: 1,187
    • View Profile
    • Donate to Member
Re: Extract text PDF using AHK
« Reply #4 on: January 18, 2013, 09:10:08 AM »
I have used pdftotext for my NANY 2013 entry pdfautomv. The code / script is in Ruby and can be found here: https://bitbucket.or...automv.rb?at=default

kyrathaba

  • N.A.N.Y. Organizer
  • Honorary Member
  • Joined in 2006
  • **
  • Posts: 3,120
    • View Profile
    • Donate to Member
Re: Extract text PDF using AHK
« Reply #5 on: February 24, 2013, 01:11:27 PM »
Quote
added lots of comments because mouser likes that kind of thing...

LOL  ;D