ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

DonationCoder.com Software > Post New Requests Here

REQ: CLI : basically unzip + run PDFTextChecker and move resulting files

<< < (2/4) > >>

jity2:
Hi "4wd",

Wow! Many thanks. ;)

>How would you determine what the shortened name should be?
--- End quote ---
Truncate is fine as long as it keeps the file extension. ;)
For instance : If a filename is longer than 200 characters (let's say that basically my folder subfolder paths are always less than 60 characters), truncate the last part of the filename to less than 200 characters and keep the extension.


I did a few tests with your code and found the following :
(note: I had troubles installing https://www.powershellgallery.com/packages/7Zip4Powershell/1.9.0, this helped me : How to Fix Install-Module is missing in PowerShell https://winaero.com/blog/fix-install-module-missing-powershell/ ;) )

- It doesn't 'unzip' rar files (it is working fine for zip files).
- If possible do not delete the original zip or rar file if there is an error unzipping.
- It does not find zip files inside zip files (even if Run it twice). It is maybe because it doesn't look for zip and rar files inside subfolders ?


>PDFTextChecker.exe.  This version uses the pdftotext.exe to extract any text from the PDF and then checks the resulting text file for any alphanumeric characters.  If any are found, it considers that searchable.
--- End quote ---
[The following is very optional : From experience maybe add a step : from "any alphanumeric characters" to minimum 1,000 (?) alphanumeric characters (I stumbled in the past on some strange pdf files with more than 0 but less than 10 characters = first page OCRed and all the other pages not ocred) ? Anyway, nothing is perfect for any kind of pdf files !]

I realize that I also remove some strange characters in the pdf filenames like accents, °, !, +, &amp; , ..etc.. with the freeware "Bulk Rename Utility" as PDFTextChecker can't do a check on them. I don't know if it is possible this in Powershell ?

[ Just for the little story : In fact manually I do other steps ! :
I run PDFInfoGUI https://www.dcmembers.com/skwire/download/pdfinfogui/. I import "!Not_Searchable.txt", I rank column by "Encrypted" :
I copy/paste in Excel the pdf files with neither yes or no in the column encrypted. Then I run an excel macro to delete those files as they are 'buggy' and can't be opened by PDFSumatra.
I copy/paste in Excel the pdf files with no in the column encrypted. And I replace the path of the  "!Not_Searchable.txt" file. then I use this https://www.donationcoder.com/forum/index.php?topic=35339.msg330784#msg330784 to move the files. And then I do the OCR with Finereader.
I ignore the pdf files with yes in the column encrypted. In the past I have tried to use a "Pdf Password Remover" with CLI commands. It worked well for most of the files but it destroyed a few files (maybe 2%?). I also realized that once un-encrypted, most of the pdf files were already OCRed ! So I decided that it was a step too much ! ]


For Finereader, I am sorry I was not clear enough. There is no need to send a CLI command to it as I use its "Hot Folder" feature which allows me to run automatically every day which is fine for me. ;)
(For the little story : in Finereader I have 3 folders : folder_in (original pdf files) , folder_moved (when Finereader has OCRed the file it moves it from in to moved), folder_out (when OCR is done)
Then in pathsync I use the following settings : see REQ: CLI : basically unzip + run PDFTextChecker and move resulting files. After writing this I realize that I forget one more step as it has already appeared to me in the past ! : I need to check if Finereader deletes a few pdf files without moving them to folder_moved or simply letting them in folder_in ! )

I have just asked skwire if he can help for a CLI PDFTextChecker. ;)

Thanks in advance ;)

4wd:
Truncate is fine as long as it keeps the file extension. ;)
For instance : If a filename is longer than 200 characters (let's say that basically my folder subfolder paths are always less than 60 characters), truncate the last part of the filename to less than 200 characters and keep the extension.-jity2 (November 15, 2019, 04:27 AM)
--- End quote ---

OK, see what I can do.

What happens if the path is longer than 260 characters?

- It doesn't 'unzip' rar files (it is working fine for zip files).
--- End quote ---

Strange, works fine with zip and rar files here.
Maybe a later version of the 7z libraries are required for them or just switch to calling 7zsa.exe instead.

Do you have a rar file you can let me play with?

- If possible do not delete the original zip or rar file if there is an error unzipping.
--- End quote ---

OK, have to see if there is an error code returned.

- It does not find zip files inside zip files (even if Run it twice). It is maybe because it doesn't look for zip and rar files inside subfolders ?
--- End quote ---

That's because I missed the word 'recursively' in your OP.  :-[

Guess it'd have to keep looping until there were no archives left or something ... have to think about it.

>PDFTextChecker.exe.  This version uses the pdftotext.exe to extract any text from the PDF and then checks the resulting text file for any alphanumeric characters.  If any are found, it considers that searchable.
--- End quote ---

Yeah, I created an image PDF for a test.  After running pdftotext.exe on it there was a much smaller text file with 'Page 1/1' in it.  So might end up thinking image PDFs are text PDFs due to headers/footers in the PDF.

If there's not likely to be headers/footers then it'd be easy enough to check and act on the result.

I realize that I also remove some strange characters in the pdf filenames like accents, °, !, +, &amp; , ..etc.. with the freeware "Bulk Rename Utility" as PDFTextChecker can't do a check on them. I don't know if it is possible this in Powershell ?
--- End quote ---

Should be easy enough by removing any characters not in the old ASCII table.

BTW, just as a matter of interest, are any files other than PDFs required?

ie. Extract archives then delete anything that's too small or not a PDF.

jity2:
Thanks 4wd, ;)

What happens if the path is longer than 260 characters?
--- End quote ---
Some old softwares can't open the file later on (also it adds complexity to some of my hard drives and can cause various errors https://www.donationcoder.com/forum/index.php?topic=39992.msg373167#msg373167).

Do you have a rar file you can let me play with?
--- End quote ---
Here is a simple rar file created by "WinRAR 5.71 64bits" for your tests : https://www.cjoint.com/c/IKprhRLGMRD

Here is what Powershell tells me when I tried to run it with the above rar file. note: it has deleted the rar file  :

--- ---Expand-7Zip : Invalid archive: open/read error! Is it encrypted and a wrong password was provided?
If your archive is an exotic one, it is possible that SevenZipSharp has no signature for its format and thus decided it is TAR by mistake.
At C:\Users\E\Documents\S\jityPDFt3v2.ps1:25 char:5
+     Expand-7Zip -ArchiveFileName $files[$i] -TargetPath $tempdest
+     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (SevenZip4PowerS...ip+ExpandWorker:ExpandWorker) [Expand-7Zip], SevenZipArchiveException
    + FullyQualifiedErrorId : err01,SevenZip4PowerShell.Expand7Zip

BTW, just as a matter of interest, are any files other than PDFs required?

ie. Extract archives then delete anything that's too small or not a PDF.
--- End quote ---

I am not sure that I understand but I would like to keep all the kind of files that could be inside the zip files even if there are not pdf files.

Thanks in advance ;)
Jity2

4wd:
What happens if the path is longer than 260 characters?
--- End quote ---
Some old softwares can't open the file later on (also it adds complexity to some of my hard drives and can cause various errors https://www.donation....msg373167#msg373167).-jity2 (November 15, 2019, 11:34 AM)
--- End quote ---

I meant if the path section of the full path to the file is longer than 260 characters, eg. "R:\a path\that is really\really long, like\over 260 characters\file.pdf"

Here is a simple rar file created by "WinRAR 5.71 64bits" for your tests : https://www.cjoint.com/c/IKprhRLGMRD
--- End quote ---

Well there you go - 7zip can't open that archive, was it created using some strange options?

First one I've come across that it can't handle, (admittedly I stopped using anything but 7z archives a long long time ago though).

Guess I'll have to do a workaround for rar archives and use unrar instead.

I am not sure that I understand but I would like to keep all the kind of files that could be inside the zip files even if there are not pdf files.
--- End quote ---

OK, just wondering since small PDFs were deleted and it didn't look like anything other than PDFs were being worked on.

Have updated the script above so it doesn't delete the archive if it gets an error.

jity2:
Hi 4wd,

Many thanks for your answer. ;)

I meant if the path section of the full path to the file is longer than 260 characters, eg. "R:\a path\that is really\really long, like\over 260 characters\file.pdf"
--- End quote ---
Thanks for the explanation. I was only thinking about "filename_with_more_than_260_characters" and not "folder_name_path_with_more_than_260_characters" !
I didn't realized. This is more difficult indeed !
Let's keep it easy : either forget about this step ;) (or either just do truncate "filename_with_more_than_260_characters" and I manually do a check at the end of the month just in case for long folders path. Sorry about the trouble.)


Well there you go - 7zip can't open that archive, was it created using some strange options?
--- End quote ---
In fact I think the last big change for WinRAR was the new Version 5 in 2017 (https://www.ghacks.net/2017/08/14/winrar-5-50-released-with-important-changes/). And alas 7Zip4Powershell won't be updated soon (see https://github.com/thoemmi/7Zip4Powershell/commit/7857e2f6a7132d6bc19d98f1e274d20675cc2e57 and https://github.com/thoemmi/7Zip4Powershell/commit/6a2ed494bf8ec305bf4cf03ed9e4d0aa226a635f).
edit: "7-Zip v15.06 and later support extraction of files in the RAR5 format" https://en.wikipedia.org/wiki/7-Zip
I did a test with an old rar file and it worked.

Have updated the script above so it doesn't delete the archive if it gets an error.
--- End quote ---
Thank you. This is working well. ;)

Thanks in advance, ;)
Jity2

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version