ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

DonationCoder.com Software > Post New Requests Here

REQ: CLI : basically unzip + run PDFTextChecker and move resulting files

(1/4) > >>

jity2:
Dear all,

I am trying to speed up my monthly archiving manual work by doing some automatic things. I am on Win8.1 64 bits home.

[0) I am using Synckback pro V8 to periodically move files (zip, rar, pdf, jpg and png) to a folder named "folder1".
When this Synckback pro profile has run, it can run a program (see image REQ: CLI : basically unzip + run PDFTextChecker and move resulting files) after the above files has been moved ("Run after profile").
I have discovered that I can run a .bat file which itself run several bat files (https://stackoverflow.com/questions/1103994/how-to-run-multiple-bat-files-within-a-bat-file). ]

The main.bat file (or powershell?) should do the following :

1) delete duplicate files (MD5 checksum?)
(maybe using this powershell script https://n3wjack.net/2015/04/06/find-and-delete-duplicate-files-with-just-powershell/ ?)

2) unzip+unrar zip and rar files recursively (and once done delete the originals) of "folder1"
I thought using this old coding snack in ahk (RecurUnZip https://www.donationcoder.com/forum/index.php?topic=21424.msg192366#msg192366) but I need something automatic as my "folder1" path doesn't change.
(I have tried to adapt this powershell code with its comments alas unsucessfully https://superuser.com/questions/620056/recursively-unzip-files-where-they-reside-then-delete-the-archives/620077 !)

3) Reduce filepath to less than 260 characters (because some old programs can't open filepath that are long than 260 characters them later on). I manually use "Path Scanner" (http://www.parhelia-tools.com/products/pathscanner/Download.aspx)

4) Delete pdf files that are less than 2ko (because they are garbage) (I manually do this by using the freeware Everything and rank by size pdf files)

5) Run PDFTextChecker (https://www.donationcoder.com/forum/index.php?topic=27311.msg255322#msg255322) on "folder1" by itself (it creates 2 files "!Not_Searchable.txt" and "!Searchable.txt"). Move the Not_Searchable pdf files in "folder2" and Searchable files in "folder3" (maybe using this old coding snack https://www.donationcoder.com/forum/index.php?topic=35339.msg330784#msg330784 ?).


Then I use ABBYY Finereader 12 Corporate (not the last version which limits the number of page you can OCR per month !) which does an OCR automatically every day of "folder2" (be careful if you follow this process as it sometimes delete pdf files without warnings! So use the options to keep original files in a separated folder! Then at the end of the month I use manually the freeware "pathsync"(https://www.cockos.com/pathsync/)  to check differences and find those Finereader bugs!)

Many thanks in advance, ;)
Jity2

superboyac:
Then I use ABBYY Finereader 12 Corporate (not the last version which limits the number of page you can OCR per month !) which does an OCR automatically every day of "folder2" (be careful if you follow this process as it sometimes delete pdf files without warnings! So use the options to keep original files in a separated folder! Then at the end of the month I use manually the freeware "pathsync"(https://www.cockos.com/pathsync/)  to check differences and find those Finereader bugs!)
-jity2 (November 14, 2019, 02:41 AM)
--- End quote ---
why is there a limit?  i thought finereader is just a local windows software.  why would there be a monthly limit?  is there a subscription that goes along with it?  I don't remember this at all with finereader, but I haven't looked at it for like 10 years maybe.

I'm thinking of gettng X1 search or something to easily search through documents with full fidelity.  I have archivarius right now, but i wish the output would be a little more fancy than plain text.

jity2:
Then I use ABBYY Finereader 12 Corporate (not the last version which limits the number of page you can OCR per month !) which does an OCR automatically every day of "folder2" (be careful if you follow this process as it sometimes delete pdf files without warnings! So use the options to keep original files in a separated folder! Then at the end of the month I use manually the freeware "pathsync"(https://www.cockos.com/pathsync/)  to check differences and find those Finereader bugs!)
-jity2 (November 14, 2019, 02:41 AM)
--- End quote ---
why is there a limit?  i thought finereader is just a local windows software.  why would there be a monthly limit?  is there a subscription that goes along with it?  I don't remember this at all with finereader, but I haven't looked at it for like 10 years maybe.

I'm thinking of gettng X1 search or something to easily search through documents with full fidelity.  I have archivarius right now, but i wish the output would be a little more fancy than plain text.
-superboyac (November 14, 2019, 12:43 PM)
--- End quote ---
See Abbyy Finereader 15 corporate for individuals : "Automate digitization and conversion routines 5,000 pages/month, 2 cores" https://www.abbyy.com/en-eu/finereader/specifications/

4wd:
The main.bat file (or powershell?) should do the following :

1) delete duplicate files (MD5 checksum?)
(maybe using this powershell script https://n3wjack.net/2015/04/06/find-and-delete-duplicate-files-with-just-powershell/ ?)-jity2 (November 14, 2019, 02:41 AM)
--- End quote ---

Looks good

2) unzip+unrar zip and rar files recursively (and once done delete the originals) of "folder1"
I thought using this old coding snack in ahk (RecurUnZip https://www.donationcoder.com/forum/index.php?topic=21424.msg192366#msg192366) but I need something automatic as my "folder1" path doesn't change.
(I have tried to adapt this powershell code with its comments alas unsucessfully https://superuser.com/questions/620056/recursively-unzip-files-where-they-reside-then-delete-the-archives/620077 !)
--- End quote ---


--- Code: PowerShell ---<#  Uses 7Zip4Powershell module: https://www.powershellgallery.com/packages/7Zip4Powershell/1.9.0   NOTE: You need to use a Admin Powershell Console to install the module. .\Extract.ps1 <archive> <archive> = full path to archive with quotes if necessary#>  Param (  [string]$archive) # Below, change the R:\folder1 to point to your particular location$tempdest = "R:\folder1\$(([io.path]::GetFileNameWithoutExtension($archive)))"Expand-7Zip -ArchiveFileName $archive -TargetPath $tempdest Remove-Item "$($archive)" -Force
3) Reduce filepath to less than 260 characters (because some old programs can't open filepath that are long than 260 characters them later on). I manually use "Path Scanner" (http://www.parhelia-tools.com/products/pathscanner/Download.aspx)
--- End quote ---

How would you determine what the shortened name should be?
Truncate, remove every second character, etc, etc.

You can't use extended-length paths ?

4) Delete pdf files that are less than 2ko (because they are garbage) (I manually do this by using the freeware Everything and rank by size pdf files)
--- End quote ---

Same as you're using here for files <3kb.


--- Code: PowerShell ---Get-ChildItem $path -Filter *.pdf -recurse -file | ? {$_.length -lt 2048} | % {Remove-Item $_.fullname -Force}
5) Run PDFTextChecker (https://www.donationcoder.com/forum/index.php?topic=27311.msg255322#msg255322) on "folder1" by itself (it creates 2 files "!Not_Searchable.txt" and "!Searchable.txt"). Move the Not_Searchable pdf files in "folder2" and Searchable files in "folder3" (maybe using this old coding snack https://www.donationcoder.com/forum/index.php?topic=35339.msg330784#msg330784 ?).
--- End quote ---

Would be a simple matter of parsing the output files and issuing the appropriate Move-Item sub-command - I'm assuming you also want to keep the existing folder structure?

Then I use ABBYY Finereader 12 Corporate (not the last version which limits the number of page you can OCR per month !) which does an OCR automatically every day of "folder2" (be careful if you follow this process as it sometimes delete pdf files without warnings! So use the options to keep original files in a separated folder!
--- End quote ---

Probably be easier to just clone folder2 to folder2-orig and then let it run on folder2 letting it delete the originals.


I'll look at patching something together, might be a few days though.  Also, I don't have ABBYY Finereader so can you give me the various commandline options, (both to keep original or not).

4wd:
Requires (in the same folder):
7z.exe
7z.dll
itextsharp.dll

Attached the versions I'm using, (for x64 only).

Set the variables at the start as required.

jityPDFt3.ps1


--- Code: PowerShell ---<#  Copy 7z.exe and 7z.dll into the same folder as this script.    Requires iTextSharp.dll in the same folder as this script - can be extracted from  itextsharp.5.5.13.1.nupkg by changing extension to .zip - package can be downloaded from  https://github.com/itext/itextsharp/releases#> # Initial folder with archives$folder1 = 'G:\test\folder1'# Folder for text based PDFs$folder2 = 'G:\test\Text'# Folder for non-text PDFs$folder3 = 'G:\test\Image'# Threshold to determine whether PDF is text or image based, equals minimum number of# lines of text to be detected before PDF is considered text based.# The threshold applies to the number of lines detected across first 5 pages, (or all# if less than 5 pages).$TextPDFThreshold = 20# Delete extracted archives: $true or $false# Archives that fail extraction won't be deleted regardless of this setting.$delArc = $false<# ------------------- #> Add-Type -Path ".\itextsharp.dll" Function Delete-Dupes {Get-ChildItem "$($folder1)\*" -File -Include *.zip,*.rar,*.7z,*.pdf | Get-FileHash | Group -Property Hash | Where { $_.count -gt 1 } `  | % { $_.group | Select -Skip 1 } | Remove-Item -Force} Function Extract-Items {  for ($j = 0; $j -lt 2; $j++) {    if ($j -ne 1) {      $files = Get-ChildItem "$($folder1)\*" -File -Include *.zip,*.rar,*.7z    } else {      $files = Get-ChildItem "$($folder1)" -Include *.zip,*.rar,*.7z -Recurse    }    for ($i = 0; $i -lt $files.Count; $i++) {      $tempdest = "$(([io.path]::GetDirectoryName($files[$i])))\$(([io.path]::GetFileNameWithoutExtension($files[$i])))"      & ".\7z.exe" "x" "-y" "$($files[$i])" "-o$tempdest" | Out-Null      if ($? -and $delArc) {        Remove-Item "$($files[$i])" -Force      }    }  }} Function Delete-SmallPDF {  param (    [bool]$delSmall  )  if ($delSmall) {    Get-ChildItem "$($folder1)\*" -Include *.pdf -Recurse | ? {$_.length -lt 2048} | % {Remove-Item $_.fullname -Force}  }  Get-ChildItem "$($folder1)" -recurse | Where {$_.PSIsContainer -and `    @(Get-ChildItem -Lit $_.Fullname -r | Where {!$_.PSIsContainer}).Length -eq 0} | Remove-Item -recurse  Get-ChildItem "$($folder2)" -recurse | Where {$_.PSIsContainer -and `    @(Get-ChildItem -Lit $_.Fullname -r | Where {!$_.PSIsContainer}).Length -eq 0} | Remove-Item -recurse  Get-ChildItem "$($folder3)" -recurse | Where {$_.PSIsContainer -and `    @(Get-ChildItem -Lit $_.Fullname -r | Where {!$_.PSIsContainer}).Length -eq 0} | Remove-Item -recurse} Function Fix-FileNames {  $files = Get-ChildItem "$($folder1)\*" -Include *.pdf -Recurse  for ($i = 0; $i -lt $files.Count; $i++) {    $j = 0    $noSpecialChars = (Convert-ToLatinCharacters "$([io.path]::GetFileNameWithoutExtension($files[$i]))") -replace '[\[\]]', '_'    $tempName = "$(([io.path]::GetDirectoryName($files[$i])))\$($noSpecialChars)"    $pathLength = ([io.path]::GetDirectoryName($files[$i])).Length    $totalLength = $tempName.Length + 4    if ($pathLength -lt 248) {      if ($totalLength -gt 251) {        $fn1 = "$($noSpecialChars.Substring(0, (251 - $pathLength)))"      } else {        $fn1 = "$($noSpecialChars)"      }    } else {      $fn1 = $null      break    }    if (($fn1 -ne $null) -and ($fn1 -ne "$([io.path]::GetFileNameWithoutExtension($files[$i]))")) {      $newName = "$([io.path]::GetDirectoryName($files[$i]))\$($fn1).pdf"      if (Test-Path $newName) {        do {          $j++          $k = "{0:0000}" -f $j          $newName = "$([io.path]::GetDirectoryName($files[$i]))\$($fn1)_$($k).pdf"        } while (Test-Path $newName)      }      Rename-Item -LiteralPath "$($files[$i])" "$($newName)"    }  }} Function Convert-ToLatinCharacters {# https://lazywinadmin.com/2015/05/powershell-remove-diacritics-accents.html# https://lazywinadmin.com/2015/08/powershell-remove-special-characters.html  param (    [string]$inputString  )  return ([Text.Encoding]::ASCII.GetString([Text.Encoding]::GetEncoding("Cyrillic").GetBytes($inputString)) -replace '[/;]|[^\p{L}\p{Nd}/(/)/_/ \[\]]', '')} Function Check-PDF {# https://superuser.com/questions/1278479/search-pdf-contents-with-powershell-and-output-a-file-list/1278521#1278521  $files = (Get-ChildItem "$($folder1)\*" -Include *.pdf -Recurse)  for ($i = 0; $i -lt $files.Count; $i++) {    if ($files[$i].FullName.Length -lt 260) {      Write-Host "Processing - $($files[$i]) ..."      $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $files[$i].FullName      if ($?) {        $linesOfText = 0        for ($page = 1; $page -le $reader.NumberOfPages; $page++) {          $pageText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page).Split([char]0x000A)          $linesOfText += $pageText.Count          if ($page -gt 4) {            break          }        }        $reader.Close()        if ($linesOfText -ge $TextPDFThreshold) {          $outfile = "$($folder2)\$(($files[$i].FullName).Substring($folder1.Length))"          If(-Not (Test-Path (Split-Path -Path $outfile))) {            New-Item (Split-Path -Path $outfile) -Type Directory | Out-Null          }          Move-Item "$($files[$i])" -Destination "$($outfile)"        } else {          $outfile = "$($folder3)\$(($files[$i].FullName).Substring($folder1.Length))"          If(-Not (Test-Path (Split-Path -Path $outfile))) {            New-Item (Split-Path -Path $outfile) -Type Directory | Out-Null          }          Move-Item "$($files[$i])" -Destination "$($outfile)"        }      }    }  }}  Write-Host 'Removing duplicate archives ...'Delete-DupesWrite-Host 'Extracting archives ...'Extract-ItemsWrite-Host 'Deleting small PDFs and empty folders ...'Delete-SmallPDF $trueWrite-Host 'Removing diacritics, etc, and fix long paths ...'Fix-FileNamesWrite-Host 'Testing for text PDFs ...'Check-PDFWrite-Host 'Deleting empty folders ...'Delete-SmallPDF $falseWrite-Host 'Finished ...'
No longer requires PDFTextChecker, uses the itextsharp library.

Navigation

[0] Message Index

[#] Next page

Go to full version