Other Software > Developer's Corner
Extract text PDF using AHK
weedseed85:
Dear Developer's,
Might i please ask for your help.
(My English writing skills are very... disappointing, bud i think i can make myself clear. If not please ask. Thank you)
I really would like to create a script, with the help of AHK Hotkey. perhaps i will be in need of some external software, so be it.
My primary objective is to create a, "closed" kind, A script that does not need anny more input than i already gave it.
I would like to write or create a script that retries the text from a searchable PDF file and return me the output from the "search".
I was thinking about, the script / program, extracting all the available text from the PDF, putting it in a txt file and with the help of AHK I could
search or retrieve the text i need.
I have limited knowledge of AHK, But I've created some before so I'm not a complete noob. Besides I know where to find the AHK help documentary
Any suggestion is fine, and will help me getting closer to my goal.
Thanks is advance.
Greets,
Michel
4wd:
This might get you started, it's an AutoIt thread with source code provided so shouldn't be too hard to convert to AutoHK or just leave and modify to do what you want:
Search PDF files for specific word and delete them
It uses pdftotext.exe, (part of Xpdf distribution), to convert the PDF before searching, naturally you don't need the delete part of it - you could just change that to output a list of found files.
Code reproduced below, not tested by me, (although I'm thinking this might be something useful....hhmm...off to the Bat Cave :) ).
--- Code: AutoIt ---; Written by DavidFromLafayette & wakillon; http://www.autoitscript.com/forum/topic/127980-search-pdf-files-for-specific-word-and-delete-them/#include <File.au3>$current = "r:\updatecd\gcr_document"$ext = "*.pdf"$_pdftotextPath = 'c:\gnuwin32\bin\pdftotext.exe'$_OutPutFilepath = 'C:\temp\file.txt'_FileCreate($_OutPutFilepath) Search($current, $ext) Func Search($current, $ext) Local $search = FileFindFirstFile($current & "\*.*") While 1 Dim $file = FileFindNextFile($search) If @error Or StringLen($file) < 1 Then ExitLoop ;ConsoleWrite('-->-- $file : ' & $file & @CRLF) If Not StringInStr(FileGetAttrib($current & "\" & $file), "D") And ($file <> "." Or $file <> "..") Then $_RunWait = '"' & $_pdftotextPath & '" "' & $current & '\' & $file & '" ' & $_OutPutFilepath RunWait($_RunWait, '', @SW_HIDE) $filecontents = FileRead($_OutPutFilepath) $ObsoleteinFile = StringInStr($filecontents, "Obsolete") If $ObsoleteinFile > 0 And $ObsoleteinFile < 100 Then ; check to ensure Obsolete is on title page ConsoleWrite("Obsolete in File" & $file & $ObsoleteinFile & @CRLF) FileDelete($current & '\' & $file) EndIf FileDelete($_OutPutFilepath) EndIf If StringInStr(FileGetAttrib($current & "\" & $file), "D") And ($file <> "." Or $file <> "..") Then Search($current & "\" & $file, $ext) EndIf Sleep(1000) WEnd FileClose($search) EndFunc ;==>Search
weedseed85:
Dear Four Wheel Drive (4wd),
Thank you!
I'll try to cook something up with the code you provided. I'll even might take a look at the
AutoIt software, Looks very interesting.
Thank You
4wd:
Here's a simple AutoIt program that will let you search a directory of pdf files, (including recursive), for specific text at the end it will output a list of files that had matches - started off based on the code above.
Nothing fancy but it might get you started.
--- Code: AutoIt ---#Region ;**** Directives created by AutoIt3Wrapper_GUI ****#AutoIt3Wrapper_UseUpx=n#AutoIt3Wrapper_Res_requestedExecutionLevel=asInvoker#AutoIt3Wrapper_Res_File_Add=pdftotext.exe#EndRegion ;**** Directives created by AutoIt3Wrapper_GUI **** ; Include lots of constants for GUI creation#include <Constants.au3>#include <ButtonConstants.au3>#include <EditConstants.au3>#include <GUIConstantsEx.au3>#include <StaticConstants.au3>#include <WindowsConstants.au3>#include <RFLTA.au3> ; Recursive filelist function ; If pdftotext.exe doesn't exist then extract it from within the SearchPDF executableIf Not FileExists('pdftotext.exe') Then FileInstall('pdftotext.exe', '.\pdftotext.exe')EndIf ; Global variables: inifileGlobal $inifile = StringLeft(@ScriptName, StringLen(@ScriptName) - 4) & '.ini' ; Initial search directory = users Documents$current = @MyDocumentsDir$_text = '' _GetIni() ; Get previous directory & search from inifile ; Simple GUI interface#Region ### START Koda GUI section ###$Form1_1 = GUICreate("SearchPDF", 354, 376) ; GUI window name/size$Input1 = GUICtrlCreateInput("", 7, 65, 250, 21) ; Search text input fieldGUICtrlSetData($Input1, $_text) ; Set with previous data if it exists$Input2 = GUICtrlCreateInput("", 8, 10, 250, 21, BitOR($GUI_SS_DEFAULT_INPUT,$ES_READONLY)) ; Search dir display fieldGUICtrlSetData($Input2, $current) ; Set with previous data if it exists$Button1 = GUICtrlCreateButton("Path", 270, 7, 75, 25) ; Path select button$Button2 = GUICtrlCreateButton("Exit", 136, 339, 75, 25) ; Exit button$Button3 = GUICtrlCreateButton("Search", 272, 65, 75, 25) ; Search button$Checkbox1 = GUICtrlCreateCheckbox("Recurse", 272, 40, 65, 17) ; Recurse checkbox$Edit1 = GUICtrlCreateEdit("", 8, 100, 337, 229) ; Edit control for output, will let you copy text, (and edit it)GUISetState(@SW_SHOW) ; Show the GUI#EndRegion ### END Koda GUI section ### While 1 ; Wait for something to happen $nMsg = GUIGetMsg() ; Get any messages from the interface Switch $nMsg Case $Button1 ; Path button pressed $temp = FileSelectFolder('Folder to monitor:', '', 4) If @error <> 1 Then $current = $temp GUICtrlSetData($Input2, $current) ; Change input field to reflect chosen path Case $GUI_EVENT_CLOSE, $Button2 ; GUI closed/Exit pressed: Write out current path/search text IniWrite($inifile, 'General', 'Path', $current) IniWrite($inifile, 'General', 'Search', $_text) Exit Case $Button3 ; Search pressed If (GUICtrlRead($Checkbox1) = $GUI_CHECKED) Then ; Read state of Recurse checkbox and set flag $_recurse = 1 Else $_recurse = 0 EndIf $_text = GUICtrlRead($Input1) ; Read search text, if empty pop up a message If $_text = '' Then MsgBox(48, 'SearchPDF', 'No search text entered') ContinueCase EndIf $_pdfFiles = _RecFileListToArray($current, '*.pdf', 1, $_recurse, 0, 2) ; Call RFLTA to list all .pdf files If Not IsArray($_pdfFiles) Or $_pdfFiles[0] = 0 Then ; If none found, pop up a message MsgBox(48, 'SearchPDF', 'No PDF files found') ContinueCase EndIf Search($_pdfFiles, $_text) ; Call Search function with array of pdf files and search text Case Else EndSwitchWEnd Func Search($files, $text) ; Arguments passed: Array containing PDF path\filenames and text to search for Local $_tempFile = @TempDir & '\tempPDF.txt', $output = '' ; Local variables: temporary txt file and output For $i = 1 To $files[0] ; First array element [0] contains number of files $_RunWait = '"' & $files[$i] & '" ' & $_tempFile ; Compose command for conversion RunWait(@comspec & ' /c pdftotext.exe ' & $_RunWait, '.', @SW_HIDE) ; Execute command for conversion with hidden CLI $filecontents = FileRead($_tempFile) ; Read converted PDF file into variable $textInFile = StringInStr($filecontents, $text) ; Look for search text If $textInFile > 0 Then ; If it exists add filename to output list $output &= $files[$i] & @CRLF EndIf FileDelete($_tempFile) ; Delete temporary txt file ready for next conversion Next If $output = '' Then $output = 'No files matched' GUICtrlSetData($Edit1, $output) ; Write the results into the edit controlEndFunc ;==>Search Func _GetIni() ; Reads data from ini file if it exists If FileExists($inifile) Then $current = IniRead($inifile, 'General', 'Path', @MyDocumentsDir) $_text = IniRead($inifile, 'General', 'Search', '') EndIfEndFunc
pdftotext.exe is contained within SearchPDF.exe as a resource, it'll be extracted out if necessary. Previous path and search are written to an ini file on exit to be used next time it's run.
Seems to work OK but no great bug finding was performed, no error trapping either.
EDIT: Cleaned it up a bit, (laugh if you like), now outputs to edit control so you can copy results, added lots of comments because mouser likes that kind of thing.
EDIT2: Tells you if no matches are found.
Anyway, I'm stopping there - you could add a progressbar, some way to cancel, better output (eg. line numbers within the file, no text within file, etc), etc.
phitsc:
I have used pdftotext for my NANY 2013 entry pdfautomv. The code / script is in Ruby and can be found here: https://bitbucket.org/phitsc/pdfautomv/src/1ff9d93070163d5e57c21d0ef479cc8404fe1daf/pdfautomv.rb?at=default
Navigation
[0] Message Index
[#] Next page
Go to full version