ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Other Software > Developer's Corner

Extract text PDF using AHK

(1/2) > >>

weedseed85:
Dear Developer's,

Might i please ask for your help.

(My English writing skills are very... disappointing, bud i think i can make myself clear. If not please ask. Thank you)

I really would like to create a script, with the help of AHK Hotkey. perhaps i will be in need of some external software, so be it.
My primary objective is to create a, "closed" kind, A script that does not need anny more input than i already gave it.

I would like to write or create a script that retries the text from a searchable PDF file and return me the output from the "search".

I was thinking about, the script / program, extracting all the available text from the PDF, putting it in a txt file and with the help of AHK I could
search or retrieve the text i need.

I have limited knowledge of AHK, But I've created some before so I'm not a complete noob. Besides I know where to find the AHK help documentary

Any suggestion is fine, and will help me getting closer to my goal.

Thanks is advance.

Greets,
Michel
 

4wd:
This might get you started, it's an AutoIt thread with source code provided so shouldn't be too hard to convert to AutoHK or just leave and modify to do what you want:

Search PDF files for specific word and delete them

It uses pdftotext.exe, (part of Xpdf distribution), to convert the PDF before searching, naturally you don't need the delete part of it - you could just change that to output a list of found files.

Code reproduced below, not tested by me, (although I'm thinking this might be something useful....hhmm...off to the Bat Cave  :) ).


--- Code: AutoIt ---; Written by DavidFromLafayette & wakillon; http://www.autoitscript.com/forum/topic/127980-search-pdf-files-for-specific-word-and-delete-them/#include <File.au3>$current = "r:\updatecd\gcr_document"$ext = "*.pdf"$_pdftotextPath = 'c:\gnuwin32\bin\pdftotext.exe'$_OutPutFilepath = 'C:\temp\file.txt'_FileCreate($_OutPutFilepath) Search($current, $ext) Func Search($current, $ext)         Local $search = FileFindFirstFile($current & "\*.*")        While 1                Dim $file = FileFindNextFile($search)                If @error Or StringLen($file) < 1 Then ExitLoop                ;ConsoleWrite('-->-- $file : ' & $file & @CRLF)                If Not StringInStr(FileGetAttrib($current & "\" & $file), "D") And ($file <> "." Or $file <> "..") Then                        $_RunWait = '"' & $_pdftotextPath & '" "' & $current & '\' & $file & '" ' & $_OutPutFilepath                        RunWait($_RunWait, '', @SW_HIDE)                        $filecontents = FileRead($_OutPutFilepath)                        $ObsoleteinFile = StringInStr($filecontents, "Obsolete")                        If $ObsoleteinFile > 0 And $ObsoleteinFile < 100 Then ; check to ensure Obsolete is on title page                                ConsoleWrite("Obsolete in File" & $file & $ObsoleteinFile & @CRLF)                                FileDelete($current & '\' & $file)                        EndIf                        FileDelete($_OutPutFilepath)                EndIf                If StringInStr(FileGetAttrib($current & "\" & $file), "D") And ($file <> "." Or $file <> "..") Then                        Search($current & "\" & $file, $ext)                EndIf                Sleep(1000)        WEnd        FileClose($search) EndFunc   ;==>Search

weedseed85:
Dear Four Wheel Drive (4wd),

Thank you!

I'll try to cook something up with the code you provided. I'll even might take a look at the
AutoIt software, Looks very interesting.

Thank You

 

4wd:
Here's a simple AutoIt program that will let you search a directory of pdf files, (including recursive), for specific text at the end it will output a list of files that had matches - started off based on the code above.

Nothing fancy but it might get you started.


--- Code: AutoIt ---#Region ;**** Directives created by AutoIt3Wrapper_GUI ****#AutoIt3Wrapper_UseUpx=n#AutoIt3Wrapper_Res_requestedExecutionLevel=asInvoker#AutoIt3Wrapper_Res_File_Add=pdftotext.exe#EndRegion ;**** Directives created by AutoIt3Wrapper_GUI **** ; Include lots of constants for GUI creation#include <Constants.au3>#include <ButtonConstants.au3>#include <EditConstants.au3>#include <GUIConstantsEx.au3>#include <StaticConstants.au3>#include <WindowsConstants.au3>#include <RFLTA.au3>    ; Recursive filelist function ; If pdftotext.exe doesn't exist then extract it from within the SearchPDF executableIf Not FileExists('pdftotext.exe') Then        FileInstall('pdftotext.exe', '.\pdftotext.exe')EndIf ; Global variables: inifileGlobal $inifile = StringLeft(@ScriptName, StringLen(@ScriptName) - 4) & '.ini' ; Initial search directory = users Documents$current = @MyDocumentsDir$_text = '' _GetIni() ; Get previous directory & search from inifile ; Simple GUI interface#Region ### START Koda GUI section ###$Form1_1 = GUICreate("SearchPDF", 354, 376)             ; GUI window name/size$Input1 = GUICtrlCreateInput("", 7, 65, 250, 21)        ; Search text input fieldGUICtrlSetData($Input1, $_text)                         ; Set with previous data if it exists$Input2 = GUICtrlCreateInput("", 8, 10, 250, 21, BitOR($GUI_SS_DEFAULT_INPUT,$ES_READONLY))     ; Search dir display fieldGUICtrlSetData($Input2, $current)                               ; Set with previous data if it exists$Button1 = GUICtrlCreateButton("Path", 270, 7, 75, 25)          ; Path select button$Button2 = GUICtrlCreateButton("Exit", 136, 339, 75, 25)        ; Exit button$Button3 = GUICtrlCreateButton("Search", 272, 65, 75, 25)       ; Search button$Checkbox1 = GUICtrlCreateCheckbox("Recurse", 272, 40, 65, 17)  ; Recurse checkbox$Edit1 = GUICtrlCreateEdit("", 8, 100, 337, 229)                                ; Edit control for output, will let you copy text, (and edit it)GUISetState(@SW_SHOW)                                   ; Show the GUI#EndRegion ### END Koda GUI section ###  While 1                                         ; Wait for something to happen        $nMsg = GUIGetMsg()                     ; Get any messages from the interface        Switch $nMsg                Case $Button1                   ; Path button pressed                        $temp = FileSelectFolder('Folder to monitor:', '', 4)                        If @error <> 1 Then $current = $temp                        GUICtrlSetData($Input2, $current)               ; Change input field to reflect chosen path                Case $GUI_EVENT_CLOSE, $Button2         ; GUI closed/Exit pressed: Write out current path/search text                        IniWrite($inifile, 'General', 'Path', $current)                        IniWrite($inifile, 'General', 'Search', $_text)                        Exit                Case $Button3           ; Search pressed                        If (GUICtrlRead($Checkbox1) = $GUI_CHECKED) Then        ; Read state of Recurse checkbox and set flag                                $_recurse = 1                        Else                                $_recurse = 0                        EndIf                        $_text = GUICtrlRead($Input1)   ; Read search text, if empty pop up a message                        If $_text = '' Then                                MsgBox(48, 'SearchPDF', 'No search text entered')                                ContinueCase                        EndIf                        $_pdfFiles = _RecFileListToArray($current, '*.pdf', 1, $_recurse, 0, 2) ; Call RFLTA to list all .pdf files                        If Not IsArray($_pdfFiles) Or $_pdfFiles[0] = 0 Then    ; If none found, pop up a message                                MsgBox(48, 'SearchPDF', 'No PDF files found')                                ContinueCase                        EndIf                        Search($_pdfFiles, $_text)      ; Call Search function with array of pdf files and search text                Case Else        EndSwitchWEnd  Func Search($files, $text) ; Arguments passed: Array containing PDF path\filenames and text to search for        Local $_tempFile = @TempDir & '\tempPDF.txt', $output = ''      ; Local variables: temporary txt file and output        For $i = 1 To $files[0] ; First array element [0] contains number of files                $_RunWait = '"' & $files[$i] & '" ' & $_tempFile        ; Compose command for conversion                RunWait(@comspec & ' /c pdftotext.exe ' & $_RunWait, '.', @SW_HIDE)     ; Execute command for conversion with hidden CLI                $filecontents = FileRead($_tempFile)    ; Read converted PDF file into variable                $textInFile = StringInStr($filecontents, $text) ; Look for search text                If $textInFile > 0 Then ; If it exists add filename to output list                        $output &= $files[$i] & @CRLF                EndIf                FileDelete($_tempFile)  ; Delete temporary txt file ready for next conversion        Next        If $output = '' Then $output = 'No files matched'        GUICtrlSetData($Edit1, $output) ; Write the results into the edit controlEndFunc   ;==>Search Func _GetIni() ; Reads data from ini file if it exists        If FileExists($inifile) Then                $current = IniRead($inifile, 'General', 'Path', @MyDocumentsDir)                $_text = IniRead($inifile, 'General', 'Search', '')        EndIfEndFunc
pdftotext.exe is contained within SearchPDF.exe as a resource, it'll be extracted out if necessary.  Previous path and search are written to an ini file on exit to be used next time it's run.

Seems to work OK but no great bug finding was performed, no error trapping either.

EDIT: Cleaned it up a bit, (laugh if you like), now outputs to edit control so you can copy results, added lots of comments because mouser likes that kind of thing.
EDIT2: Tells you if no matches are found.

Anyway, I'm stopping there - you could add a progressbar, some way to cancel, better output (eg. line numbers within the file, no text within file, etc), etc.

phitsc:
I have used pdftotext for my NANY 2013 entry pdfautomv. The code / script is in Ruby and can be found here: https://bitbucket.org/phitsc/pdfautomv/src/1ff9d93070163d5e57c21d0ef479cc8404fe1daf/pdfautomv.rb?at=default

Navigation

[0] Message Index

[#] Next page

Go to full version