topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Friday March 29, 2024, 9:32 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: REQ: CLI : basically unzip + run PDFTextChecker and move resulting files  (Read 18934 times)

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 126
    • View Profile
    • Donate to Member
Dear all,

I am trying to speed up my monthly archiving manual work by doing some automatic things. I am on Win8.1 64 bits home.

[0) I am using Synckback pro V8 to periodically move files (zip, rar, pdf, jpg and png) to a folder named "folder1".
When this Synckback pro profile has run, it can run a program (see image 2019-11-14_091418.pngREQ: CLI : basically unzip + run PDFTextChecker and move resulting files) after the above files has been moved ("Run after profile").
I have discovered that I can run a .bat file which itself run several bat files (https://stackoverflo...es-within-a-bat-file). ]

The main.bat file (or powershell?) should do the following :

1) delete duplicate files (MD5 checksum?)
(maybe using this powershell script https://n3wjack.net/...ith-just-powershell/ ?)

2) unzip+unrar zip and rar files recursively (and once done delete the originals) of "folder1"
I thought using this old coding snack in ahk (RecurUnZip https://www.donation....msg192366#msg192366) but I need something automatic as my "folder1" path doesn't change.
(I have tried to adapt this powershell code with its comments alas unsucessfully https://superuser.co...-the-archives/620077 !)

3) Reduce filepath to less than 260 characters (because some old programs can't open filepath that are long than 260 characters them later on). I manually use "Path Scanner" (http://www.parhelia-...canner/Download.aspx)

4) Delete pdf files that are less than 2ko (because they are garbage) (I manually do this by using the freeware Everything and rank by size pdf files)

5) Run PDFTextChecker (https://www.donation....msg255322#msg255322) on "folder1" by itself (it creates 2 files "!Not_Searchable.txt" and "!Searchable.txt"). Move the Not_Searchable pdf files in "folder2" and Searchable files in "folder3" (maybe using this old coding snack https://www.donation....msg330784#msg330784 ?).


Then I use ABBYY Finereader 12 Corporate (not the last version which limits the number of page you can OCR per month !) which does an OCR automatically every day of "folder2" (be careful if you follow this process as it sometimes delete pdf files without warnings! So use the options to keep original files in a separated folder! Then at the end of the month I use manually the freeware "pathsync"(https://www.cockos.com/pathsync/)  to check differences and find those Finereader bugs!)

Many thanks in advance, ;)
Jity2
« Last Edit: November 14, 2019, 04:05 AM by jity2, Reason: spelling »

superboyac

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 6,347
    • View Profile
    • Donate to Member
Then I use ABBYY Finereader 12 Corporate (not the last version which limits the number of page you can OCR per month !) which does an OCR automatically every day of "folder2" (be careful if you follow this process as it sometimes delete pdf files without warnings! So use the options to keep original files in a separated folder! Then at the end of the month I use manually the freeware "pathsync"(https://www.cockos.com/pathsync/)  to check differences and find those Finereader bugs!)
why is there a limit?  i thought finereader is just a local windows software.  why would there be a monthly limit?  is there a subscription that goes along with it?  I don't remember this at all with finereader, but I haven't looked at it for like 10 years maybe.

I'm thinking of gettng X1 search or something to easily search through documents with full fidelity.  I have archivarius right now, but i wish the output would be a little more fancy than plain text.

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 126
    • View Profile
    • Donate to Member
Then I use ABBYY Finereader 12 Corporate (not the last version which limits the number of page you can OCR per month !) which does an OCR automatically every day of "folder2" (be careful if you follow this process as it sometimes delete pdf files without warnings! So use the options to keep original files in a separated folder! Then at the end of the month I use manually the freeware "pathsync"(https://www.cockos.com/pathsync/)  to check differences and find those Finereader bugs!)
why is there a limit?  i thought finereader is just a local windows software.  why would there be a monthly limit?  is there a subscription that goes along with it?  I don't remember this at all with finereader, but I haven't looked at it for like 10 years maybe.

I'm thinking of gettng X1 search or something to easily search through documents with full fidelity.  I have archivarius right now, but i wish the output would be a little more fancy than plain text.
See Abbyy Finereader 15 corporate for individuals : "Automate digitization and conversion routines 5,000 pages/month, 2 cores" https://www.abbyy.co...ader/specifications/

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
The main.bat file (or powershell?) should do the following :

1) delete duplicate files (MD5 checksum?)
(maybe using this powershell script https://n3wjack.net/...ith-just-powershell/ ?)

Looks good

2) unzip+unrar zip and rar files recursively (and once done delete the originals) of "folder1"
I thought using this old coding snack in ahk (RecurUnZip https://www.donation....msg192366#msg192366) but I need something automatic as my "folder1" path doesn't change.
(I have tried to adapt this powershell code with its comments alas unsucessfully https://superuser.co...-the-archives/620077 !)

Code: PowerShell [Select]
  1. <#
  2.   Uses 7Zip4Powershell module: https://www.powershellgallery.com/packages/7Zip4Powershell/1.9.0
  3.  
  4.   NOTE: You need to use a Admin Powershell Console to install the module.
  5.  
  6. .\Extract.ps1 <archive>
  7.  
  8. <archive> = full path to archive with quotes if necessary
  9. #>
  10.  
  11.  
  12. Param (
  13.   [string]$archive
  14. )
  15.  
  16. # Below, change the R:\folder1 to point to your particular location
  17. $tempdest = "R:\folder1\$(([io.path]::GetFileNameWithoutExtension($archive)))"
  18. Expand-7Zip -ArchiveFileName $archive -TargetPath $tempdest
  19.  
  20. Remove-Item "$($archive)" -Force

3) Reduce filepath to less than 260 characters (because some old programs can't open filepath that are long than 260 characters them later on). I manually use "Path Scanner" (http://www.parhelia-...canner/Download.aspx)

How would you determine what the shortened name should be?
Truncate, remove every second character, etc, etc.

You can't use extended-length paths ?

4) Delete pdf files that are less than 2ko (because they are garbage) (I manually do this by using the freeware Everything and rank by size pdf files)

Same as you're using here for files <3kb.

Code: PowerShell [Select]
  1. Get-ChildItem $path -Filter *.pdf -recurse -file | ? {$_.length -lt 2048} | % {Remove-Item $_.fullname -Force}

5) Run PDFTextChecker (https://www.donation....msg255322#msg255322) on "folder1" by itself (it creates 2 files "!Not_Searchable.txt" and "!Searchable.txt"). Move the Not_Searchable pdf files in "folder2" and Searchable files in "folder3" (maybe using this old coding snack https://www.donation....msg330784#msg330784 ?).

Would be a simple matter of parsing the output files and issuing the appropriate Move-Item sub-command - I'm assuming you also want to keep the existing folder structure?

Then I use ABBYY Finereader 12 Corporate (not the last version which limits the number of page you can OCR per month !) which does an OCR automatically every day of "folder2" (be careful if you follow this process as it sometimes delete pdf files without warnings! So use the options to keep original files in a separated folder!

Probably be easier to just clone folder2 to folder2-orig and then let it run on folder2 letting it delete the originals.


I'll look at patching something together, might be a few days though.  Also, I don't have ABBYY Finereader so can you give me the various commandline options, (both to keep original or not).

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Requires (in the same folder):
7z.exe
7z.dll
itextsharp.dll

Attached the versions I'm using, (for x64 only).

Set the variables at the start as required.

jityPDFt3.ps1

Code: PowerShell [Select]
  1. <#
  2.   Copy 7z.exe and 7z.dll into the same folder as this script.
  3.  
  4.   Requires iTextSharp.dll in the same folder as this script - can be extracted from
  5.   itextsharp.5.5.13.1.nupkg by changing extension to .zip - package can be downloaded from
  6.   https://github.com/itext/itextsharp/releases
  7. #>
  8.  
  9. # Initial folder with archives
  10. $folder1 = 'G:\test\folder1'
  11. # Folder for text based PDFs
  12. $folder2 = 'G:\test\Text'
  13. # Folder for non-text PDFs
  14. $folder3 = 'G:\test\Image'
  15. # Threshold to determine whether PDF is text or image based, equals minimum number of
  16. # lines of text to be detected before PDF is considered text based.
  17. # The threshold applies to the number of lines detected across first 5 pages, (or all
  18. # if less than 5 pages).
  19. $TextPDFThreshold = 20
  20. # Delete extracted archives: $true or $false
  21. # Archives that fail extraction won't be deleted regardless of this setting.
  22. $delArc = $false
  23. <# ------------------- #>
  24.  
  25. Add-Type -Path ".\itextsharp.dll"
  26.  
  27. Function Delete-Dupes {
  28. Get-ChildItem "$($folder1)\*" -File -Include *.zip,*.rar,*.7z,*.pdf | Get-FileHash | Group -Property Hash | Where { $_.count -gt 1 } `
  29.   | % { $_.group | Select -Skip 1 } | Remove-Item -Force
  30. }
  31.  
  32. Function Extract-Items {
  33.   for ($j = 0; $j -lt 2; $j++) {
  34.     if ($j -ne 1) {
  35.       $files = Get-ChildItem "$($folder1)\*" -File -Include *.zip,*.rar,*.7z
  36.     } else {
  37.       $files = Get-ChildItem "$($folder1)" -Include *.zip,*.rar,*.7z -Recurse
  38.     }
  39.     for ($i = 0; $i -lt $files.Count; $i++) {
  40.       $tempdest = "$(([io.path]::GetDirectoryName($files[$i])))\$(([io.path]::GetFileNameWithoutExtension($files[$i])))"
  41.       & ".\7z.exe" "x" "-y" "$($files[$i])" "-o$tempdest" | Out-Null
  42.       if ($? -and $delArc) {
  43.         Remove-Item "$($files[$i])" -Force
  44.       }
  45.     }
  46.   }
  47. }
  48.  
  49. Function Delete-SmallPDF {
  50.   param (
  51.     [bool]$delSmall
  52.   )
  53.   if ($delSmall) {
  54.     Get-ChildItem "$($folder1)\*" -Include *.pdf -Recurse | ? {$_.length -lt 2048} | % {Remove-Item $_.fullname -Force}
  55.   }
  56.   Get-ChildItem "$($folder1)" -recurse | Where {$_.PSIsContainer -and `
  57.     @(Get-ChildItem -Lit $_.Fullname -r | Where {!$_.PSIsContainer}).Length -eq 0} | Remove-Item -recurse
  58.   Get-ChildItem "$($folder2)" -recurse | Where {$_.PSIsContainer -and `
  59.     @(Get-ChildItem -Lit $_.Fullname -r | Where {!$_.PSIsContainer}).Length -eq 0} | Remove-Item -recurse
  60.   Get-ChildItem "$($folder3)" -recurse | Where {$_.PSIsContainer -and `
  61.     @(Get-ChildItem -Lit $_.Fullname -r | Where {!$_.PSIsContainer}).Length -eq 0} | Remove-Item -recurse
  62. }
  63.  
  64. Function Fix-FileNames {
  65.   $files = Get-ChildItem "$($folder1)\*" -Include *.pdf -Recurse
  66.   for ($i = 0; $i -lt $files.Count; $i++) {
  67.     $j = 0
  68.     $noSpecialChars = (Convert-ToLatinCharacters "$([io.path]::GetFileNameWithoutExtension($files[$i]))") -replace '[\[\]]', '_'
  69.     $tempName = "$(([io.path]::GetDirectoryName($files[$i])))\$($noSpecialChars)"
  70.     $pathLength = ([io.path]::GetDirectoryName($files[$i])).Length
  71.     $totalLength = $tempName.Length + 4
  72.     if ($pathLength -lt 248) {
  73.       if ($totalLength -gt 251) {
  74.         $fn1 = "$($noSpecialChars.Substring(0, (251 - $pathLength)))"
  75.       } else {
  76.         $fn1 = "$($noSpecialChars)"
  77.       }
  78.     } else {
  79.       $fn1 = $null
  80.       break
  81.     }
  82.     if (($fn1 -ne $null) -and ($fn1 -ne "$([io.path]::GetFileNameWithoutExtension($files[$i]))")) {
  83.       $newName = "$([io.path]::GetDirectoryName($files[$i]))\$($fn1).pdf"
  84.       if (Test-Path $newName) {
  85.         do {
  86.           $j++
  87.           $k = "{0:0000}" -f $j
  88.           $newName = "$([io.path]::GetDirectoryName($files[$i]))\$($fn1)_$($k).pdf"
  89.         } while (Test-Path $newName)
  90.       }
  91.       Rename-Item -LiteralPath "$($files[$i])" "$($newName)"
  92.     }
  93.   }
  94. }
  95.  
  96. Function Convert-ToLatinCharacters {
  97. # https://lazywinadmin.com/2015/05/powershell-remove-diacritics-accents.html
  98. # https://lazywinadmin.com/2015/08/powershell-remove-special-characters.html
  99.   param (
  100.     [string]$inputString
  101.   )
  102.   return ([Text.Encoding]::ASCII.GetString([Text.Encoding]::GetEncoding("Cyrillic").GetBytes($inputString)) -replace '[/;]|[^\p{L}\p{Nd}/(/)/_/ \[\]]', '')
  103. }
  104.  
  105. Function Check-PDF {
  106. # https://superuser.com/questions/1278479/search-pdf-contents-with-powershell-and-output-a-file-list/1278521#1278521
  107.   $files = (Get-ChildItem "$($folder1)\*" -Include *.pdf -Recurse)
  108.   for ($i = 0; $i -lt $files.Count; $i++) {
  109.     if ($files[$i].FullName.Length -lt 260) {
  110.       Write-Host "Processing - $($files[$i]) ..."
  111.       $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $files[$i].FullName
  112.       if ($?) {
  113.         $linesOfText = 0
  114.         for ($page = 1; $page -le $reader.NumberOfPages; $page++) {
  115.           $pageText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page).Split([char]0x000A)
  116.           $linesOfText += $pageText.Count
  117.           if ($page -gt 4) {
  118.             break
  119.           }
  120.         }
  121.         $reader.Close()
  122.         if ($linesOfText -ge $TextPDFThreshold) {
  123.           $outfile = "$($folder2)\$(($files[$i].FullName).Substring($folder1.Length))"
  124.           If(-Not (Test-Path (Split-Path -Path $outfile))) {
  125.             New-Item (Split-Path -Path $outfile) -Type Directory | Out-Null
  126.           }
  127.           Move-Item "$($files[$i])" -Destination "$($outfile)"
  128.         } else {
  129.           $outfile = "$($folder3)\$(($files[$i].FullName).Substring($folder1.Length))"
  130.           If(-Not (Test-Path (Split-Path -Path $outfile))) {
  131.             New-Item (Split-Path -Path $outfile) -Type Directory | Out-Null
  132.           }
  133.           Move-Item "$($files[$i])" -Destination "$($outfile)"
  134.         }
  135.       }
  136.     }
  137.   }
  138. }
  139.  
  140.  
  141. Write-Host 'Removing duplicate archives ...'
  142. Delete-Dupes
  143. Write-Host 'Extracting archives ...'
  144. Extract-Items
  145. Write-Host 'Deleting small PDFs and empty folders ...'
  146. Delete-SmallPDF $true
  147. Write-Host 'Removing diacritics, etc, and fix long paths ...'
  148. Fix-FileNames
  149. Write-Host 'Testing for text PDFs ...'
  150. Check-PDF
  151. Write-Host 'Deleting empty folders ...'
  152. Delete-SmallPDF $false
  153. Write-Host 'Finished ...'

No longer requires PDFTextChecker, uses the itextsharp library.
« Last Edit: January 01, 2020, 11:42 PM by 4wd »

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 126
    • View Profile
    • Donate to Member
Hi "4wd",

Wow! Many thanks. ;)

>How would you determine what the shortened name should be?
Truncate is fine as long as it keeps the file extension. ;)
For instance : If a filename is longer than 200 characters (let's say that basically my folder subfolder paths are always less than 60 characters), truncate the last part of the filename to less than 200 characters and keep the extension.


I did a few tests with your code and found the following :
(note: I had troubles installing https://www.powershe...Zip4Powershell/1.9.0, this helped me : How to Fix Install-Module is missing in PowerShell https://winaero.com/...-missing-powershell/ ;) )

- It doesn't 'unzip' rar files (it is working fine for zip files).
- If possible do not delete the original zip or rar file if there is an error unzipping.
- It does not find zip files inside zip files (even if Run it twice). It is maybe because it doesn't look for zip and rar files inside subfolders ?


>PDFTextChecker.exe.  This version uses the pdftotext.exe to extract any text from the PDF and then checks the resulting text file for any alphanumeric characters.  If any are found, it considers that searchable.
[The following is very optional : From experience maybe add a step : from "any alphanumeric characters" to minimum 1,000 (?) alphanumeric characters (I stumbled in the past on some strange pdf files with more than 0 but less than 10 characters = first page OCRed and all the other pages not ocred) ? Anyway, nothing is perfect for any kind of pdf files !]

I realize that I also remove some strange characters in the pdf filenames like accents, °, !, +, &amp; , ..etc.. with the freeware "Bulk Rename Utility" as PDFTextChecker can't do a check on them. I don't know if it is possible this in Powershell ?

[ Just for the little story : In fact manually I do other steps ! :
I run PDFInfoGUI https://www.dcmember...download/pdfinfogui/. I import "!Not_Searchable.txt", I rank column by "Encrypted" :
I copy/paste in Excel the pdf files with neither yes or no in the column encrypted. Then I run an excel macro to delete those files as they are 'buggy' and can't be opened by PDFSumatra.
I copy/paste in Excel the pdf files with no in the column encrypted. And I replace the path of the  "!Not_Searchable.txt" file. then I use this https://www.donation....msg330784#msg330784 to move the files. And then I do the OCR with Finereader.
I ignore the pdf files with yes in the column encrypted. In the past I have tried to use a "Pdf Password Remover" with CLI commands. It worked well for most of the files but it destroyed a few files (maybe 2%?). I also realized that once un-encrypted, most of the pdf files were already OCRed ! So I decided that it was a step too much ! ]



For Finereader, I am sorry I was not clear enough. There is no need to send a CLI command to it as I use its "Hot Folder" feature which allows me to run automatically every day which is fine for me. ;)
(For the little story : in Finereader I have 3 folders : folder_in (original pdf files) , folder_moved (when Finereader has OCRed the file it moves it from in to moved), folder_out (when OCR is done)
Then in pathsync I use the following settings : see 2019-11-15_082720.pngREQ: CLI : basically unzip + run PDFTextChecker and move resulting files. After writing this I realize that I forget one more step as it has already appeared to me in the past ! : I need to check if Finereader deletes a few pdf files without moving them to folder_moved or simply letting them in folder_in ! )


I have just asked skwire if he can help for a CLI PDFTextChecker. ;)

Thanks in advance ;)
« Last Edit: November 15, 2019, 05:00 AM by jity2 »

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Truncate is fine as long as it keeps the file extension. ;)
For instance : If a filename is longer than 200 characters (let's say that basically my folder subfolder paths are always less than 60 characters), truncate the last part of the filename to less than 200 characters and keep the extension.

OK, see what I can do.

What happens if the path is longer than 260 characters?

- It doesn't 'unzip' rar files (it is working fine for zip files).

Strange, works fine with zip and rar files here.
Maybe a later version of the 7z libraries are required for them or just switch to calling 7zsa.exe instead.

Do you have a rar file you can let me play with?

- If possible do not delete the original zip or rar file if there is an error unzipping.

OK, have to see if there is an error code returned.

- It does not find zip files inside zip files (even if Run it twice). It is maybe because it doesn't look for zip and rar files inside subfolders ?

That's because I missed the word 'recursively' in your OP.  :-[

Guess it'd have to keep looping until there were no archives left or something ... have to think about it.

>PDFTextChecker.exe.  This version uses the pdftotext.exe to extract any text from the PDF and then checks the resulting text file for any alphanumeric characters.  If any are found, it considers that searchable.

Yeah, I created an image PDF for a test.  After running pdftotext.exe on it there was a much smaller text file with 'Page 1/1' in it.  So might end up thinking image PDFs are text PDFs due to headers/footers in the PDF.

If there's not likely to be headers/footers then it'd be easy enough to check and act on the result.

I realize that I also remove some strange characters in the pdf filenames like accents, °, !, +, &amp; , ..etc.. with the freeware "Bulk Rename Utility" as PDFTextChecker can't do a check on them. I don't know if it is possible this in Powershell ?

Should be easy enough by removing any characters not in the old ASCII table.

BTW, just as a matter of interest, are any files other than PDFs required?

ie. Extract archives then delete anything that's too small or not a PDF.
« Last Edit: November 15, 2019, 05:22 AM by 4wd »

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 126
    • View Profile
    • Donate to Member
Thanks 4wd, ;)

What happens if the path is longer than 260 characters?
Some old softwares can't open the file later on (also it adds complexity to some of my hard drives and can cause various errors https://www.donation....msg373167#msg373167).

Do you have a rar file you can let me play with?
Here is a simple rar file created by "WinRAR 5.71 64bits" for your tests : https://www.cjoint.com/c/IKprhRLGMRD

Here is what Powershell tells me when I tried to run it with the above rar file. note: it has deleted the rar file  :
Expand-7Zip : Invalid archive: open/read error! Is it encrypted and a wrong password was provided?
If your archive is an exotic one, it is possible that SevenZipSharp has no signature for its format and thus decided it is TAR by mistake.
At C:\Users\E\Documents\S\jityPDFt3v2.ps1:25 char:5
+     Expand-7Zip -ArchiveFileName $files[$i] -TargetPath $tempdest
+     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (SevenZip4PowerS...ip+ExpandWorker:ExpandWorker) [Expand-7Zip], SevenZipArchiveException
    + FullyQualifiedErrorId : err01,SevenZip4PowerShell.Expand7Zip


BTW, just as a matter of interest, are any files other than PDFs required?

ie. Extract archives then delete anything that's too small or not a PDF.

I am not sure that I understand but I would like to keep all the kind of files that could be inside the zip files even if there are not pdf files.

Thanks in advance ;)
Jity2

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
What happens if the path is longer than 260 characters?
Some old softwares can't open the file later on (also it adds complexity to some of my hard drives and can cause various errors https://www.donation....msg373167#msg373167).

I meant if the path section of the full path to the file is longer than 260 characters, eg. "R:\a path\that is really\really long, like\over 260 characters\file.pdf"

Here is a simple rar file created by "WinRAR 5.71 64bits" for your tests : https://www.cjoint.com/c/IKprhRLGMRD

Well there you go - 7zip can't open that archive, was it created using some strange options?

First one I've come across that it can't handle, (admittedly I stopped using anything but 7z archives a long long time ago though).

Guess I'll have to do a workaround for rar archives and use unrar instead.

I am not sure that I understand but I would like to keep all the kind of files that could be inside the zip files even if there are not pdf files.

OK, just wondering since small PDFs were deleted and it didn't look like anything other than PDFs were being worked on.

Have updated the script above so it doesn't delete the archive if it gets an error.
« Last Edit: November 15, 2019, 05:25 PM by 4wd »

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 126
    • View Profile
    • Donate to Member
Hi 4wd,

Many thanks for your answer. ;)

I meant if the path section of the full path to the file is longer than 260 characters, eg. "R:\a path\that is really\really long, like\over 260 characters\file.pdf"
Thanks for the explanation. I was only thinking about "filename_with_more_than_260_characters" and not "folder_name_path_with_more_than_260_characters" !
I didn't realized. This is more difficult indeed !
Let's keep it easy : either forget about this step ;) (or either just do truncate "filename_with_more_than_260_characters" and I manually do a check at the end of the month just in case for long folders path. Sorry about the trouble.)


Well there you go - 7zip can't open that archive, was it created using some strange options?
In fact I think the last big change for WinRAR was the new Version 5 in 2017 (https://www.ghacks.n...h-important-changes/). And alas 7Zip4Powershell won't be updated soon (see https://github.com/t...98f1e274d20675cc2e57 and https://github.com/t...f03ed9e4d0aa226a635f).
edit: "7-Zip v15.06 and later support extraction of files in the RAR5 format" https://en.wikipedia.org/wiki/7-Zip
I did a test with an old rar file and it worked.

Have updated the script above so it doesn't delete the archive if it gets an error.
Thank you. This is working well. ;)

Thanks in advance, ;)
Jity2

« Last Edit: November 16, 2019, 04:45 AM by jity2, Reason: added quote with source »

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 126
    • View Profile
    • Donate to Member
Hi again,

Would this help for testing if PDF have been OCRed with Powershell ?
https://superuser.com/a/1278521/27956

Thanks in advance, ;)
Jity2

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
In fact I think the last big change for WinRAR was the new Version 5 in 2017 (https://www.ghacks.n...h-important-changes/). And alas 7Zip4Powershell won't be updated soon (see https://github.com/t...98f1e274d20675cc2e57 and https://github.com/t...f03ed9e4d0aa226a635f).
edit: "7-Zip v15.06 and later support extraction of files in the RAR5 format" https://en.wikipedia.org/wiki/7-Zip
I did a test with an old rar file and it worked.

I think the problem is with the intermediate DLL it uses, SevenZipSharp, which is actively maintained and did get updated for RAR5: SevenZipSharp

If I can get an updated DLL then it should work OK - might have to reinstall Visual Studio and compile it.

In the meantime I'll get it to use unrar instead.

Changed my mind and just made it use 7z.exe, see the updated script.  Just copy the latest versions of 7z.exe and 7z.dll into the same folder as the script.

Still doesn't do extracted archives ... still thinking about it  :P

You can uninstall the 7Zip4Powershell module by opening an Admin Powershell console and entering:
Code: PowerShell [Select]
  1. uninstall-module -name 7Zip4Powershell
« Last Edit: November 17, 2019, 03:46 AM by 4wd »

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 126
    • View Profile
    • Donate to Member
Hi 4wd,

Many thanks for your detailed explanations. ;) Your last code update is working great for 'unzipping' zip and rar files. ;)  :Thmbsup:

(A simple note for those following : I have downloaded this 7zip version (https://www.7-zip.org/a/7z1900-x64.exe) that I have extracted it in "C:\Program Files\7-Zip". Then I have copied "7z.exe" and "7z.dll" in my working directory.)

Still doesn't do extracted archives ... still thinking about it  :P

I thought that if I run it twice it would find them (example : folder1\6\6\example.zip)  during the next run (which would be fine) but no.

Thanks in advance ;)
Jity2

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Okey dokey, most of the way there now ...

  • Deletes duplicate archives
  • Extracts archives, (including in sub-folders)
  • Deletes small PDFs and empty folders
  • Checks for text/image based PDFs and moves them into different folders, (recreating folder tree) - all it does is count the number of text lines in the first 5 pages, (or less if there's less than 5 total), and if the number is greater than a set threshold regards it as a text based PDF

Only thing left is the long names bit I think.

Updated
« Last Edit: November 17, 2019, 08:18 PM by 4wd »

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 126
    • View Profile
    • Donate to Member
Hi 4wd,
Wow. Many thanks. This is fantastic. ;)

I did some tests and the only small things that are not working are :
- it doesn't delete empty folders of folder1 (if some subfolders are empty or not with other kind of files).

-
Testing for text PDFs ...
Move-Item : Cannot retrieve the dynamic parameters for the cmdlet. The specified wildcard character pattern is not valid: [Lac_ven_Drnyvn,_Sr_ehajoe_Uduizn,_Giles_Suilo-Sm(e-kjd.org).pdf
At C:\Users\E\Documents\test\jityPDFt3v5_7zip.ps1:82 char:7
+       Move-Item "$($files[$i])" -Destination "$($outfile)"
+       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidArgument: (:) [Move-Item], ParameterBindingException
    + FullyQualifiedErrorId : GetDynamicParametersException,Microsoft.PowerShell.Commands.MoveItemCommand

Then, the test "for text PDFs" stops and the other same kind of files are not processed.

Also if the pdf file has "[" in its filename, it ignores it (but creates an empty folder in folder2 if the pdf is located into a subfolder).

And the great thing is that it finds encrypted pdf with or without OCR already done moves them accordingly. ;)

Thanks in advance ;)
Jity2

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Hi 4wd,
Wow. Many thanks. This is fantastic. ;)

You're welcome  :)

I did some tests and the only small things that are not working are :
- it doesn't delete empty folders of folder1 (if some subfolders are empty or not with other kind of files).

It just needs to call the Delete-SmallPDF function again after doing the Check-PDF function.

Not sure whether you mean folder1 should be completely empty at the end or not since files that aren't PDF will be remaining - ie. delete everything in folder1 after running the Check-PDF function.

In which case, what happens to the non-PDF files that were in the folders?
Delete or move with PDF?

Testing for text PDFs ...
Move-Item : Cannot retrieve the dynamic parameters for the cmdlet. The specified wildcard character pattern is not valid: [Lac_ven_Drnyvn,_Sr_ehajoe_Uduizn,_Giles_Suilo-Sm(e-kjd.org).pdf
At C:\Users\E\Documents\test\jityPDFt3v5_7zip.ps1:82 char:7
+       Move-Item "$($files[$i])" -Destination "$($outfile)"
+       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidArgument: (:) [Move-Item], ParameterBindingException
    + FullyQualifiedErrorId : GetDynamicParametersException,Microsoft.PowerShell.Commands.MoveItemCommand

Then, the test "for text PDFs" stops and the other same kind of files are not processed.

Also if the pdf file has "[" in its filename, it ignores it (but creates an empty folder in folder2 if the pdf is located into a subfolder).

The replace strange characters in the filename was going to be part of the too long filename function, just haven't got there yet.

Can you zip up some of the PDFs it stops on, (strange characters, etc - 3 or 4 would be good) ?

Unfortunately, going to be a bit busy this week so might not get back to this until next week, I'll see how I go.

EDIT: OK, besides having a full path less than 260 characters, you can apparently only have a full directory path of 247 characters maximum.

ie.

C:\plus 243 characters\                            (max 247 chars)
C:\plus 243 characters\16characters.pdf    (max 259 chars)

Otherwise you get this:
New-Item : The specified path, file name, or both are too long. The fully qualified file name must be less than
260 characters, and the directory name must be less than 248 characters.

ADDENDUM: Give the update a try, it should remove any special characters from file names and in theory it'll shorten filenames if it's the length of the filename that pushes the total path length over the 259 character limit.

If, however, it's the path that exceeds the 247 character limit then the file won't be touched, it'll remain in the initial folder ... so goes the theory :-\

ie.
- If the filename has diacritics and various other strange characters, they'll be removed. (At this point no rename happens.)
  An example: This: François-Xavier!!#@$%^&()_+}{ €$¥£¢ ^$.+()[{ 0123456789.pdf will turn into this: FrancoisXavier()_ Yc () 0123456789.pdf
- If the folder path is less than 248 characters and the full path is less than 259 characters, the file will be renamed.
- If the folder path is less than 248 characters and the full path is greater than 259 characters, the new filename will be truncated and then the file is renamed.
- If the folder path is greater than 247 characters, nothing happens - the file isn't renamed, it will remain in the initial folder.

I might have to tweak the Get-ChildItem statement in the Check-PDF routine to ignore file paths greater than 259 characters, see how you go.

Currently doesn't check for the existence of a file with the same name before renaming.
« Last Edit: November 20, 2019, 04:24 AM by 4wd »

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 126
    • View Profile
    • Donate to Member
Hi 4wd,

Thank you so much for your updated code. ;) This is working great. ;)

Not sure whether you mean folder1 should be completely empty at the end or not since files that aren't PDF will be remaining - ie. delete everything in folder1 after running the Check-PDF function.

In which case, what happens to the non-PDF files that were in the folders?
Delete or move with PDF?
Keeping (like you have already done) what is left inside folder1 is fine. ;)


I did some tests and here is a zip files with examples :
The main bug (see "2-itextsharp.pdfa") is when in folder1 a folder is named "bla.pdf" it causes a "stackoverflow" bug in powershell (it closes Powershell and restarts it when I run it in edit mode).
The other thing is strange filenames (maybe some asian text?). Edit 1: it is because on my computer the folder path is too long on some of those one ! Otherwise it is renaming them fine ! ;)

Others are detecting invalid pdf files (see "1-malformed_pdf"- from what I understand there are at least 2 problems : 1)Invalid pdf file and 2)The image file format has not been recognized  ). From experience it is a complex problem and I think it is better if I do it by hand with PDFinfoGUI (*) and remove them with an excel macro as I can see very fast if I need to download again some important files. So please forget about those. ;)
(*) neither yes or no in column encrypted and other columns - Then I copy the list - except the important one
https://filebin.net/...tests.zip?t=6qfxn4l3

Also, many thanks for the detailed explanations of long names. I appreciated.;) Your truncate filename current code is just already very fine for me. ;) Thanks. ;)

Currently doesn't check for the existence of a file with the same name before renaming.
If I understand well it is because "folder1\venise-.pdf" would have the same name of a file already available in folder3 "folder3\venise.pdf". Renaming the new one (for instance with a counter "venise1.pdf" would be fine).

I have added a small function in order to delete empty folders in folder3
Function Delete-SmallPDF2 {
  param (
    [bool]$delSmall
  )
  if ($delSmall) {
    Get-ChildItem "$($folder3)\*" -Include *.pdf -Recurse | ? {$_.length -lt 2048} | % {Remove-Item $_.fullname -Force}
  }
  Get-ChildItem "$($folder3)" -recurse | Where {$_.PSIsContainer -and `
    @(Get-ChildItem -Lit $_.Fullname -r | Where {!$_.PSIsContainer}).Length -eq 0} | Remove-Item -recurse
}


Edit 2:
I forgot I had this error message :
PS C:\Windows\System32\WindowsPowerShell\v1.0> C:\Users\E\Documents\tests\jityPDF.ps1
Add-Type : Cannot bind parameter 'Path' to the target. Exception setting "Path": "Cannot find path 'C:\Windows\System32\WindowsPowerShell\v1.0\itextsharp.dll' because it does not exist."
So I have copied "itextsharp.dll" in C:\Windows\System32\WindowsPowerShell\v1.0\itextsharp.dll
It may explain why if I try to use 4wd's code in another hard drive (example : L:\) even if I copied the 7z.dll, 7z.exe and itextsharp.dll files and adapt the code for new locations, the script doesn't show errors but it fails to run properly the Check-PDF part. It moves some image based pdf in folder2 instead of folder3 for some files ? So I stay in C:\ ;)


Thanks in advance ;)
Jity2
« Last Edit: November 21, 2019, 12:07 PM by jity2 »

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Edit 2:
I forgot I had this error message :
PS C:\Windows\System32\WindowsPowerShell\v1.0> C:\Users\E\Documents\tests\jityPDF.ps1
Add-Type : Cannot bind parameter 'Path' to the target. Exception setting "Path": "Cannot find path 'C:\Windows\System32\WindowsPowerShell\v1.0\itextsharp.dll' because it does not exist."
So I have copied "itextsharp.dll" in C:\Windows\System32\WindowsPowerShell\v1.0\itextsharp.dll
It may explain why if I try to use 4wd's code in another hard drive (example : L:\) even if I copied the 7z.dll, 7z.exe and itextsharp.dll files and adapt the code for new locations, the script doesn't show errors but it fails to run properly the Check-PDF part. It moves some image based pdf in folder2 instead of folder3 for some files ? So I stay in C:\ ;)

That's because with starting the script from outside the folder it resides in, the working directory is no longer it's folder.

PS C:\Windows\System32\WindowsPowerShell\v1.0> C:\Users\E\Documents\tests\jityPDF.ps1

Since you started the script from the folder in bold above, that becomes the working directory so if a file that's referenced by '.\' isn't found within that folder it'll fail with the above error.

PS C:\Users\E\Documents\tests> C:\Users\E\Documents\tests\jityPDF.ps1
PS C:\Users\E\Documents\tests> .\jityPDF.ps1

Changing your current directory to the same as the script would have let either of the above work OK.

An easier way would be just to use a shortcut and set the Start in parameter once you've set up the folder1/folder2/folder3 variables, then you can just double-click the shortcut.

I've attached an example shortcut.

I'll have a look at the other things.