topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Thursday April 18, 2024, 10:15 pm
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Messages - jity2 [ switch to compact view ]

Pages: [1] 2 3 4 5 6next
1
Screenshot Captor / Re: Scroll capture - Errors and Home button
« on: July 19, 2020, 04:18 AM »
Dear Mouser,

I think I have a similar problem with the scrolling window capture (for some non downloadable Google Drive shared big pdf files.).
Maybe an idea : The "save each capture as a separate image" feature is great but could it be automatic without keeping all the capture images in RAM memory at once ? Like that no more RAM problem ! ;)
I.e. : Screenshot Captor saves each screenshot file on the hard drive just before scrolling to the next page. ;)
Next I could split the saved images files at once for instance with XnView (See : Tools/Batch processing). And make a big pdf out of it. ;)
 
Many thanks in advance ;)


2
Hi 4wd,

Thank you so much for your updated code. ;) This is working great. ;)

Not sure whether you mean folder1 should be completely empty at the end or not since files that aren't PDF will be remaining - ie. delete everything in folder1 after running the Check-PDF function.

In which case, what happens to the non-PDF files that were in the folders?
Delete or move with PDF?
Keeping (like you have already done) what is left inside folder1 is fine. ;)


I did some tests and here is a zip files with examples :
The main bug (see "2-itextsharp.pdfa") is when in folder1 a folder is named "bla.pdf" it causes a "stackoverflow" bug in powershell (it closes Powershell and restarts it when I run it in edit mode).
The other thing is strange filenames (maybe some asian text?). Edit 1: it is because on my computer the folder path is too long on some of those one ! Otherwise it is renaming them fine ! ;)

Others are detecting invalid pdf files (see "1-malformed_pdf"- from what I understand there are at least 2 problems : 1)Invalid pdf file and 2)The image file format has not been recognized  ). From experience it is a complex problem and I think it is better if I do it by hand with PDFinfoGUI (*) and remove them with an excel macro as I can see very fast if I need to download again some important files. So please forget about those. ;)
(*) neither yes or no in column encrypted and other columns - Then I copy the list - except the important one
https://filebin.net/...tests.zip?t=6qfxn4l3

Also, many thanks for the detailed explanations of long names. I appreciated.;) Your truncate filename current code is just already very fine for me. ;) Thanks. ;)

Currently doesn't check for the existence of a file with the same name before renaming.
If I understand well it is because "folder1\venise-.pdf" would have the same name of a file already available in folder3 "folder3\venise.pdf". Renaming the new one (for instance with a counter "venise1.pdf" would be fine).

I have added a small function in order to delete empty folders in folder3
Function Delete-SmallPDF2 {
  param (
    [bool]$delSmall
  )
  if ($delSmall) {
    Get-ChildItem "$($folder3)\*" -Include *.pdf -Recurse | ? {$_.length -lt 2048} | % {Remove-Item $_.fullname -Force}
  }
  Get-ChildItem "$($folder3)" -recurse | Where {$_.PSIsContainer -and `
    @(Get-ChildItem -Lit $_.Fullname -r | Where {!$_.PSIsContainer}).Length -eq 0} | Remove-Item -recurse
}


Edit 2:
I forgot I had this error message :
PS C:\Windows\System32\WindowsPowerShell\v1.0> C:\Users\E\Documents\tests\jityPDF.ps1
Add-Type : Cannot bind parameter 'Path' to the target. Exception setting "Path": "Cannot find path 'C:\Windows\System32\WindowsPowerShell\v1.0\itextsharp.dll' because it does not exist."
So I have copied "itextsharp.dll" in C:\Windows\System32\WindowsPowerShell\v1.0\itextsharp.dll
It may explain why if I try to use 4wd's code in another hard drive (example : L:\) even if I copied the 7z.dll, 7z.exe and itextsharp.dll files and adapt the code for new locations, the script doesn't show errors but it fails to run properly the Check-PDF part. It moves some image based pdf in folder2 instead of folder3 for some files ? So I stay in C:\ ;)


Thanks in advance ;)
Jity2

3
Hi 4wd,
Wow. Many thanks. This is fantastic. ;)

I did some tests and the only small things that are not working are :
- it doesn't delete empty folders of folder1 (if some subfolders are empty or not with other kind of files).

-
Testing for text PDFs ...
Move-Item : Cannot retrieve the dynamic parameters for the cmdlet. The specified wildcard character pattern is not valid: [Lac_ven_Drnyvn,_Sr_ehajoe_Uduizn,_Giles_Suilo-Sm(e-kjd.org).pdf
At C:\Users\E\Documents\test\jityPDFt3v5_7zip.ps1:82 char:7
+       Move-Item "$($files[$i])" -Destination "$($outfile)"
+       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidArgument: (:) [Move-Item], ParameterBindingException
    + FullyQualifiedErrorId : GetDynamicParametersException,Microsoft.PowerShell.Commands.MoveItemCommand

Then, the test "for text PDFs" stops and the other same kind of files are not processed.

Also if the pdf file has "[" in its filename, it ignores it (but creates an empty folder in folder2 if the pdf is located into a subfolder).

And the great thing is that it finds encrypted pdf with or without OCR already done moves them accordingly. ;)

Thanks in advance ;)
Jity2

4
Hi 4wd,

Many thanks for your detailed explanations. ;) Your last code update is working great for 'unzipping' zip and rar files. ;)  :Thmbsup:

(A simple note for those following : I have downloaded this 7zip version (https://www.7-zip.org/a/7z1900-x64.exe) that I have extracted it in "C:\Program Files\7-Zip". Then I have copied "7z.exe" and "7z.dll" in my working directory.)

Still doesn't do extracted archives ... still thinking about it  :P

I thought that if I run it twice it would find them (example : folder1\6\6\example.zip)  during the next run (which would be fine) but no.

Thanks in advance ;)
Jity2

5
Hi again,

Would this help for testing if PDF have been OCRed with Powershell ?
https://superuser.com/a/1278521/27956

Thanks in advance, ;)
Jity2

6
Hi 4wd,

Many thanks for your answer. ;)

I meant if the path section of the full path to the file is longer than 260 characters, eg. "R:\a path\that is really\really long, like\over 260 characters\file.pdf"
Thanks for the explanation. I was only thinking about "filename_with_more_than_260_characters" and not "folder_name_path_with_more_than_260_characters" !
I didn't realized. This is more difficult indeed !
Let's keep it easy : either forget about this step ;) (or either just do truncate "filename_with_more_than_260_characters" and I manually do a check at the end of the month just in case for long folders path. Sorry about the trouble.)


Well there you go - 7zip can't open that archive, was it created using some strange options?
In fact I think the last big change for WinRAR was the new Version 5 in 2017 (https://www.ghacks.n...h-important-changes/). And alas 7Zip4Powershell won't be updated soon (see https://github.com/t...98f1e274d20675cc2e57 and https://github.com/t...f03ed9e4d0aa226a635f).
edit: "7-Zip v15.06 and later support extraction of files in the RAR5 format" https://en.wikipedia.org/wiki/7-Zip
I did a test with an old rar file and it worked.

Have updated the script above so it doesn't delete the archive if it gets an error.
Thank you. This is working well. ;)

Thanks in advance, ;)
Jity2


7
Thanks 4wd, ;)

What happens if the path is longer than 260 characters?
Some old softwares can't open the file later on (also it adds complexity to some of my hard drives and can cause various errors https://www.donation....msg373167#msg373167).

Do you have a rar file you can let me play with?
Here is a simple rar file created by "WinRAR 5.71 64bits" for your tests : https://www.cjoint.com/c/IKprhRLGMRD

Here is what Powershell tells me when I tried to run it with the above rar file. note: it has deleted the rar file  :
Expand-7Zip : Invalid archive: open/read error! Is it encrypted and a wrong password was provided?
If your archive is an exotic one, it is possible that SevenZipSharp has no signature for its format and thus decided it is TAR by mistake.
At C:\Users\E\Documents\S\jityPDFt3v2.ps1:25 char:5
+     Expand-7Zip -ArchiveFileName $files[$i] -TargetPath $tempdest
+     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (SevenZip4PowerS...ip+ExpandWorker:ExpandWorker) [Expand-7Zip], SevenZipArchiveException
    + FullyQualifiedErrorId : err01,SevenZip4PowerShell.Expand7Zip


BTW, just as a matter of interest, are any files other than PDFs required?

ie. Extract archives then delete anything that's too small or not a PDF.

I am not sure that I understand but I would like to keep all the kind of files that could be inside the zip files even if there are not pdf files.

Thanks in advance ;)
Jity2

8
Hi "4wd",

Wow! Many thanks. ;)

>How would you determine what the shortened name should be?
Truncate is fine as long as it keeps the file extension. ;)
For instance : If a filename is longer than 200 characters (let's say that basically my folder subfolder paths are always less than 60 characters), truncate the last part of the filename to less than 200 characters and keep the extension.


I did a few tests with your code and found the following :
(note: I had troubles installing https://www.powershe...Zip4Powershell/1.9.0, this helped me : How to Fix Install-Module is missing in PowerShell https://winaero.com/...-missing-powershell/ ;) )

- It doesn't 'unzip' rar files (it is working fine for zip files).
- If possible do not delete the original zip or rar file if there is an error unzipping.
- It does not find zip files inside zip files (even if Run it twice). It is maybe because it doesn't look for zip and rar files inside subfolders ?


>PDFTextChecker.exe.  This version uses the pdftotext.exe to extract any text from the PDF and then checks the resulting text file for any alphanumeric characters.  If any are found, it considers that searchable.
[The following is very optional : From experience maybe add a step : from "any alphanumeric characters" to minimum 1,000 (?) alphanumeric characters (I stumbled in the past on some strange pdf files with more than 0 but less than 10 characters = first page OCRed and all the other pages not ocred) ? Anyway, nothing is perfect for any kind of pdf files !]

I realize that I also remove some strange characters in the pdf filenames like accents, °, !, +, & , ..etc.. with the freeware "Bulk Rename Utility" as PDFTextChecker can't do a check on them. I don't know if it is possible this in Powershell ?

[ Just for the little story : In fact manually I do other steps ! :
I run PDFInfoGUI https://www.dcmember...download/pdfinfogui/. I import "!Not_Searchable.txt", I rank column by "Encrypted" :
I copy/paste in Excel the pdf files with neither yes or no in the column encrypted. Then I run an excel macro to delete those files as they are 'buggy' and can't be opened by PDFSumatra.
I copy/paste in Excel the pdf files with no in the column encrypted. And I replace the path of the  "!Not_Searchable.txt" file. then I use this https://www.donation....msg330784#msg330784 to move the files. And then I do the OCR with Finereader.
I ignore the pdf files with yes in the column encrypted. In the past I have tried to use a "Pdf Password Remover" with CLI commands. It worked well for most of the files but it destroyed a few files (maybe 2%?). I also realized that once un-encrypted, most of the pdf files were already OCRed ! So I decided that it was a step too much ! ]



For Finereader, I am sorry I was not clear enough. There is no need to send a CLI command to it as I use its "Hot Folder" feature which allows me to run automatically every day which is fine for me. ;)
(For the little story : in Finereader I have 3 folders : folder_in (original pdf files) , folder_moved (when Finereader has OCRed the file it moves it from in to moved), folder_out (when OCR is done)
Then in pathsync I use the following settings : see 2019-11-15_082720.png. After writing this I realize that I forget one more step as it has already appeared to me in the past ! : I need to check if Finereader deletes a few pdf files without moving them to folder_moved or simply letting them in folder_in ! )


I have just asked skwire if he can help for a CLI PDFTextChecker. ;)

Thanks in advance ;)

9
Then I use ABBYY Finereader 12 Corporate (not the last version which limits the number of page you can OCR per month !) which does an OCR automatically every day of "folder2" (be careful if you follow this process as it sometimes delete pdf files without warnings! So use the options to keep original files in a separated folder! Then at the end of the month I use manually the freeware "pathsync"(https://www.cockos.com/pathsync/)  to check differences and find those Finereader bugs!)
why is there a limit?  i thought finereader is just a local windows software.  why would there be a monthly limit?  is there a subscription that goes along with it?  I don't remember this at all with finereader, but I haven't looked at it for like 10 years maybe.

I'm thinking of gettng X1 search or something to easily search through documents with full fidelity.  I have archivarius right now, but i wish the output would be a little more fancy than plain text.
See Abbyy Finereader 15 corporate for individuals : "Automate digitization and conversion routines 5,000 pages/month, 2 cores" https://www.abbyy.co...ader/specifications/

10
Dear all,

I am trying to speed up my monthly archiving manual work by doing some automatic things. I am on Win8.1 64 bits home.

[0) I am using Synckback pro V8 to periodically move files (zip, rar, pdf, jpg and png) to a folder named "folder1".
When this Synckback pro profile has run, it can run a program (see image 2019-11-14_091418.png) after the above files has been moved ("Run after profile").
I have discovered that I can run a .bat file which itself run several bat files (https://stackoverflo...es-within-a-bat-file). ]

The main.bat file (or powershell?) should do the following :

1) delete duplicate files (MD5 checksum?)
(maybe using this powershell script https://n3wjack.net/...ith-just-powershell/ ?)

2) unzip+unrar zip and rar files recursively (and once done delete the originals) of "folder1"
I thought using this old coding snack in ahk (RecurUnZip https://www.donation....msg192366#msg192366) but I need something automatic as my "folder1" path doesn't change.
(I have tried to adapt this powershell code with its comments alas unsucessfully https://superuser.co...-the-archives/620077 !)

3) Reduce filepath to less than 260 characters (because some old programs can't open filepath that are long than 260 characters them later on). I manually use "Path Scanner" (http://www.parhelia-...canner/Download.aspx)

4) Delete pdf files that are less than 2ko (because they are garbage) (I manually do this by using the freeware Everything and rank by size pdf files)

5) Run PDFTextChecker (https://www.donation....msg255322#msg255322) on "folder1" by itself (it creates 2 files "!Not_Searchable.txt" and "!Searchable.txt"). Move the Not_Searchable pdf files in "folder2" and Searchable files in "folder3" (maybe using this old coding snack https://www.donation....msg330784#msg330784 ?).


Then I use ABBYY Finereader 12 Corporate (not the last version which limits the number of page you can OCR per month !) which does an OCR automatically every day of "folder2" (be careful if you follow this process as it sometimes delete pdf files without warnings! So use the options to keep original files in a separated folder! Then at the end of the month I use manually the freeware "pathsync"(https://www.cockos.com/pathsync/)  to check differences and find those Finereader bugs!)

Many thanks in advance, ;)
Jity2

11
I have tested your update and this is so great 4wd ! ;)
Many thanks. ;)

Inspired by https://stackoverflo...fore-printing-to-pdf I have added for Chrome (see in row #49) :
Code: PowerShell [Select]
  1. $args = "`"$($inFile)`" --headless --run-all-compositor-stages-before-draw --virtual-time-budget=10000 --print-to-pdf=`"$($outFile)`""
For me it doesn't seem to change anything ! Maybe this would help someone in the future... ;)

Thank you again ;)
See ya

12
Wow ! Many thanks 4wd. This is working like a charm ! ;)

After several tests I realize that I have to close your script(s) each night even if it has not finished its job.
note: wkhtmltopdf uses very little CPU and Chrome far more. So I run several copies of your script at the same time (especially for folders containing htm and html files).
And the next day when I open the script(s), it would be great if it can avoid converting to pdf if the pdf file already exist in the destination folder.
Currently I have to manually dig and move the files and specific folders in order to avoid spending a few hours just to continue where it has stopped.

I have tried to insert this code at row #43:

Code: PowerShell [Select]
  1. if {($inFile.Substring([Math]::Max($inFile.Length - 3, 0))) = $outFile
  2. next i
  3. }

After testing it, alas this proves that ...I am still not a coder !! ;)

Many thanks in advance ;)

13
Wow this is fantastic 4wd ! ;)
For the second part I have seen no visible differences when adding the "working directory".

Many thanks. I much appreciated. ;)

So as a summary:
In order to convert htm to pdf and html to pdf, use this Powershell script written by 4wd (if Powershell is new to you have a look here https://www.donation....msg399588#msg399588) :
Code: PowerShell [Select]
  1. <#
  2.   CTP.ps1
  3.  
  4.   Recursively convert *.htm and *.html to PDF + exclude htm and html files that are smaller than 3kb.
  5. #>
  6.  
  7. Function Get-Folder {
  8.   Add-Type -AssemblyName System.Windows.Forms
  9.   $FolderBrowser = New-Object System.Windows.Forms.FolderBrowserDialog
  10.   [void]$FolderBrowser.ShowDialog()
  11.   $temp = $FolderBrowser.SelectedPath
  12.   If($temp -eq '') {Exit}
  13.   If(-Not $temp.EndsWith('\')) {$temp = $temp + '\'}
  14.   Return $temp
  15. }  
  16.  
  17. If($PSVersionTable.PSVersion.Major -lt 3) {
  18.   Write-Host '** Script requires at least Powershell V3 **'
  19. } else {
  20.   Write-Host 'Choose folder with PDF files: ' -NoNewline -BackgroundColor DarkGreen -ForegroundColor White
  21.   $srcFolder = (Get-Folder)
  22.   Write-Host $srcFolder
  23.   Write-Host 'Choose output folder: ' -NoNewline -BackgroundColor DarkGreen -ForegroundColor White
  24.   Do {$dstFolder = (Get-Folder)} While($dstFolder -eq $srcFolder)
  25.   Write-Host $dstFolder
  26.  
  27.   $aFiles = (Get-ChildItem -Include *.html,*.htm -Path ($srcFolder + "*") -Recurse | Where-Object {$_.Length -gt 3kb} )
  28.   for($i = 0; $i -lt $aFiles.Count; $i++) {
  29.     $inFile = [string]$aFiles[$i]
  30.     Write-Host 'File:' $inFile -BackgroundColor DarkBlue -ForegroundColor Yellow
  31.     $outFile = $dstFolder + $inFile.Replace($srcFolder, "") + '.pdf'
  32.     $temp = Split-Path $outFile -Parent
  33.     if (!(Test-Path $temp)) {
  34.       New-Item $temp -ItemType Directory | Out-Null
  35.     }
  36.     $args = "`"$($infile)`" -p 127.0.0.1 `"$($outFile)`""
  37.     Start-Process -FilePath ".\wkhtmltopdf.exe" -Wait -NoNewWindow -ArgumentList $args
  38.   }
  39. }


In order to convert mht to pdf, use this Powershell script written by 4wd (mht files created with Google Chrome) :
Code: PowerShell [Select]
  1. <#
  2.   CTP.ps1
  3.  
  4.   Recursively convert *.mht to PDF.
  5. #>
  6.  
  7. Function Get-Folder {
  8.   Add-Type -AssemblyName System.Windows.Forms
  9.   $FolderBrowser = New-Object System.Windows.Forms.FolderBrowserDialog
  10.   [void]$FolderBrowser.ShowDialog()
  11.   $temp = $FolderBrowser.SelectedPath
  12.   If($temp -eq '') {Exit}
  13.   If(-Not $temp.EndsWith('\')) {$temp = $temp + '\'}
  14.   Return $temp
  15. }  
  16.  
  17. If($PSVersionTable.PSVersion.Major -lt 3) {
  18.   Write-Host '** Script requires at least Powershell V3 **'
  19. } else {
  20.   Write-Host 'Choose folder with PDF files: ' -NoNewline -BackgroundColor DarkGreen -ForegroundColor White
  21.   $srcFolder = (Get-Folder)
  22.   Write-Host $srcFolder
  23.   Write-Host 'Choose output folder: ' -NoNewline -BackgroundColor DarkGreen -ForegroundColor White
  24.   Do {$dstFolder = (Get-Folder)} While($dstFolder -eq $srcFolder)
  25.   Write-Host $dstFolder
  26.  
  27.   $aFiles = (Get-ChildItem -Include *.mht -Path ($srcFolder + "*") -Recurse)
  28.   for($i = 0; $i -lt $aFiles.Count; $i++) {
  29.     $inFile = [string]$aFiles[$i]
  30.     Write-Host 'File:' $inFile -BackgroundColor DarkBlue -ForegroundColor Yellow
  31.     $outFile = $dstFolder + $inFile.Replace($srcFolder, "") + '.pdf'
  32.     $temp = Split-Path $outFile -Parent
  33.     if (!(Test-Path $temp)) {
  34.       New-Item $temp -ItemType Directory | Out-Null
  35.     }
  36.     $args = "`"$($infile)`" --headless --print-to-pdf=`"$($outFile)`""
  37.     Start-Process -FilePath "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" -Wait -NoNewWindow -ArgumentList $args -WorkingDirectory "C:\Program Files (x86)\Google\Chrome\Application"
  38.   }
  39. }


Many thanks again to 4wd for the great help. ;)
See ya

14
Hi,
Many thanks 4wd. ;) I much appreciated. ;)

I did some tests (one old month saved with IE and one old month saved with Firefox) with your updated script and it works fine for html and htm files at the same time. ;)
As I have quite some files converting htm and html files should run for about a few months! But it seems that I can speed up the converting if I run several (copied and renamed -shortcut included ) powershell instances. ;)
2019-03-16_103749.png


I have added a few manual steps :
Modified from https://stackoverflo...irectory-recursively , here is the Powershell code that I use to remove the pdf files that are smaller than 3ko (created by wkhtmltopdf, they in fact contains no text in my case):
Get-ChildItem $path -Filter *.pdf -recurse -file | ? {$_.length -lt 3000} | % {Remove-Item $_.fullname}

Then I use the freeware "Remove Empty Directories" http://www.jonasjohn.de/red.htm which removes..all empty directories in all the subfolders. It is very powerful IMHO. ;)

I don't know if this is possible but it would be great if the Powershell could exclude converting htm and html files that are smaller than 3ko ? Thanks in advance ;)


You also have a great memory for my 2016 request. ;) But I must acknowledge that I wouldn't have been able to modify the 10% left in the new code!!!

For mht files to pdf:
Thanks for the mht to html link. It reminds me that finding a simple solution can leads to many trials !
Mine were not created with Internet Explorer but were and are created with Google Chrome. In my manual tests, these mht files are often better displayed in Chrome than in I.E.

Here are my tests :
[windows+R]
cmd
[Select]
cd C:\Program Files (x86)\Google\Chrome\Application

Apparently Chrome also understands if I change the code from
chrome --headless --print-to-pdf="C:\result\20170619_075623.pdf" "C:\source\t2\20170619_075623.htm"
to
chrome --headless "C:\source\t2\20170619_075623.htm" --print-to-pdf="C:\result\20170619_075623.pdf"

And after some tests (thanks to https://www.autohotk...iewtopic.php?t=26819) it helped me having a working code ! It works ;) but it copies other files (png..) in the target folder ! 

2019-03-16_124647.png
 :
WorKingDir := "C:\Program Files (x86)\Google\Chrome\Application"      
pdParams := "chrome.exe --headless "
FileSelectFolder,SourcePath,,0,Select Source Folder
If SourcePath =
ExitApp

FileSelectFolder,TargetPath,*%SourcePath%,0,Select Target Folder
If TargetPath =
ExitApp

pdParams := "chrome.exe --headless "
WorKingDir := "C:\Program Files (x86)\Google\Chrome\Application"      
RunWait % comspec " /c xCopy """ SourcePath A_loopField """ """ TargetPath A_loopField """ *.mht /s /i /y",, Hide


Loop, Files, % TargetPath "\*.mht" , R
   {       
   SplitPath, A_LoopFileFullPath, name, dir, ext, name_no_ext
   outPDF_repared :=  dir "\" name_no_ext "" ".pdf"
   pCmd := pdParams " " """"  A_LoopFileFullPath """"  " " """" "--print-to-pdf="outPDF_repared """"   
   RunWait % comspec " /c " pCmd , % WorKingDir , Hide
   FileAppend, % "Result pdrepair`n" outPDF_repared "`n", % A_Temp "\LOG_pdrepair.txt"
   FileRead, outLOG, % TargetPath "\LOG.txt"
   FileAppend, % outLOG "`n" , % A_Temp "\LOG_pdrepair.txt"
   FileDelete, % A_LoopFileFullPath
   }                             

Msgbox 0x40000,, % "END!",1                                           

ExitApp


I have tried to change:
RunWait % comspec " /c xCopy """ SourcePath A_loopField """ """ TargetPath A_loopField """ /s /i /y",, Hide
with
RunWait % comspec " /c xCopy """ SourcePath A_loopField """ """ TargetPath A_loopField """ *.mht /s /i /y",, Hide
or
RunWait % comspec " /c xCopy """ SourcePath A_loopField """ *.mht """ TargetPath A_loopField """ /s /i /y",, Hide
or
RunWait % comspec " /c xCopy *.mht """ SourcePath A_loopField """ """ TargetPath A_loopField """ /s /i /y",, Hide
or
RunWait % comspec " /c xCopy "\*.mht" """ SourcePath A_loopField """ """ TargetPath A_loopField """ /s /i /y",, Hide
Alas I am stuck !


So I have tried to modify your Powershell script :
<#
  CTP.ps1
 
  Recursively convert *.mht to PDF.
#>
 
Function Get-Folder {
  Add-Type -AssemblyName System.Windows.Forms
  $FolderBrowser = New-Object System.Windows.Forms.FolderBrowserDialog
  [void]$FolderBrowser.ShowDialog()
  $temp = $FolderBrowser.SelectedPath
  If($temp -eq '') {Exit}
  If(-Not $temp.EndsWith('\')) {$temp = $temp + '\'}
  Return $temp

 
If($PSVersionTable.PSVersion.Major -lt 3) {
  Write-Host '** Script requires at least Powershell V3 **'
} else {
  Write-Host 'Choose folder with PDF files: ' -NoNewline -BackgroundColor DarkGreen -ForegroundColor White
  $srcFolder = (Get-Folder)
  Write-Host $srcFolder
  Write-Host 'Choose output folder: ' -NoNewline -BackgroundColor DarkGreen -ForegroundColor White
  Do {$dstFolder = (Get-Folder)} While($dstFolder -eq $srcFolder)
  Write-Host $dstFolder
 
  $aFiles = (Get-ChildItem -Include *.mht -Path ($srcFolder + "*") -Recurse)
  for($i = 0; $i -lt $aFiles.Count; $i++) {
    $inFile = [string]$aFiles[$i]
    Write-Host 'File:' $inFile -BackgroundColor DarkBlue -ForegroundColor Yellow
    $outFile = $dstFolder + $inFile.Replace($srcFolder, "") + '.pdf'
    $temp = Split-Path $outFile -Parent
    if (!(Test-Path $temp)) {
      New-Item $temp -ItemType Directory | Out-Null
    }
    $args = "`"$($infile)`" chrome --headless --print-to-pdf=`"$($outFile)`""
    Start-Process -FilePath "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" -Wait -NoNewWindow -ArgumentList $args
  }
}

See "--print-to-pdf=". Alas this doesn't work !


@lainB: I am saving mht like you now (https://www.donation....msg417446#msg417446). I don't use Google Docs now for uploaded files (I used to save htm files from Firefox but I have stopped). I just now mainly upload pdf files into Google Drive.


Many thanks in advance ;)

15
Wow 4wd! Thank you ! ;)
It is working for html and htm but not for wkhtmltopdf as it can't convert from mht to pdf.

Sorry, bit busy prepping to go overseas atm, if I have time in the next day or two I'll clean it up.
No problem. By that time I will do some tests. Many thanks in advance. ;)

16
I just tested with Chrome Headless browser :
(Adapted from https://superuser.com/a/1211603/27956)
[windows+R]
cmd
cd C:\Program Files (x86)\Google\Chrome\Application
chrome --headless --print-to-pdf="C:\result\20170619_075623.pdf" "C:\source\t2\20170619_075623.htm"

I need now to try to adapt the above ahk script.
Edit:Here is what I have tried but I am stuck !

WorKingDir := "C:\Program Files (x86)\Google\Chrome\Application"      
pdParams := "chrome.exe --headless --print-to-pdf= "
FileSelectFolder,SourcePath,,0,Select Source Folder
If SourcePath =
ExitApp

FileSelectFolder,TargetPath,*%SourcePath%,0,Select Target Folder
If TargetPath =
ExitApp

If (SourcePath==TargetPath){
  Msgbox 0x40000,, % "SourcePath and TargetPath cant be the same "    TargetPath
ExitApp
}

Loop, Files, %SourcePath%\*.htm, R
   {
SplitPath, A_LoopFileFullPath, , , , OutNameNoExt
pCmd := pdParams """"  A_LoopFileFullPath """"  " " """" TargetPath "\" OutNameNoExt "_.pdf" """"   
RunWait % comspec " /c " pCmd , % WorKingDir
   }
ExitApp

Thanks in advance ;)

17
I did some Pandoc tests but I keep getting the same error :

Code example:
pandoc C:\prog\pandoc\pandoc-2.7.1-windows-x86_64\t5\20170611_074645.htm -t latex --pdf-engine=xelatex -s -o C:\prog\pandoc\pandoc-2.7.1-windows-x86_64\t5\20170611_074645.pdf

Error:
2019-03-15_103115.png
I did some google searches but I am stuck!

Thanks in advance ;)

18
Hi,
Thanks Shades. I am trying PanDoc right now !

@Tomos: It is just that if there is some text content inside the html, it is a text content that can be read (or later indexed by Google Drive) in the pdf file.
No need to use a program to do the OCR of image files contained inside html files.
I am not sure I am clear! But I don't want an image only pdf file as a result.

Thanks in advance ;)

19
Dear all,

I would like to convert the html files of my archives into pdf (text OCRed + local related saved images included) so I can make keyword searches in them with Google Drive.
I need that the related images saved with the html file (usually in a related folder) be included as well as the url available in the html. I would prefer that the tool does its job with an offline mode as all the info that I have is already saved in the local html files, so it doesn't spend times to try all missing urls.
 

My about 20 years archives (many thousand of files) were mostly saved with Internet Explorer (Maxthon) for a few years then mostly with Firefox, and now Google Chrome (and httrack).

The idea : I choose one big folder "Source" (usually a month archives), it scan alone all the html, htm and mth files, in all the subfolders, than create in a big folder "Target" all the converted pdf files with the original names in the same subfolders.
Example :
C:\Source\2009\2019-04\15\2009_04_15_075256.html
...
C:\Target\2009\2019-04\15\2009_04_15_075256.pdf



I have tried to use wkhtmltopdf which is based on webkit (Safari https://github.com/w...tmltopdf/issues/3163) with the following script (based on a old AHK script found here) :

WorKingDir := "C:\prog\wkhtmltox\bin"      
pdParams := "wkhtmltopdf.exe "
FileSelectFolder,SourcePath,,0,Select Source Folder
If SourcePath =
ExitApp

FileSelectFolder,TargetPath,*%SourcePath%,0,Select Target Folder
If TargetPath =
ExitApp

If (SourcePath==TargetPath){
  Msgbox 0x40000,, % "SourcePath and TargetPath cant be the same "    TargetPath
ExitApp
}

Loop, Files, %SourcePath%\*.htm, R
   {
SplitPath, A_LoopFileFullPath, , , , OutNameNoExt
pCmd := pdParams """"  A_LoopFileFullPath """"  " " """" TargetPath "\" OutNameNoExt "_.pdf" """"   
RunWait % comspec " /c " pCmd , % WorKingDir
   }
ExitApp

Results seems ok but here are the problems that I have found :
- The output folders are not created as wkhtmltopdf puts all created pdf files into the Target folder without subfolders. This creates problems when html files have the same name as wkhtmltopdf overwrites them !
Feature request: create output folders if necessary
https://github.com/w...tmltopdf/issues/2421

- Sometimes it creates small unnecessary pdf files (2ko!). I can later delete them. I think they are created as when saving a webpage CTRL+S there are also some small htm files created in a related folder.
Example:
C:\A\save01.html
C:\A\save01\image.gif
C:\A\save01\image.htm
...
so in fact this is normal and ok ! ;)



- I have tried to implement the following trick :
offline mode: does not try to look for missing component online for locally saved html files
https://github.com/w...tmltopdf/issues/3294


I have replaced this line of code:
pdParams := "wkhtmltopdf.exe "

with :
pdParams := "wkhtmltopdf.exe --proxy=http://127.0.0.1:0 "

Alas it didn't work as it is still slowly trying to crawl missing urls online.


- No silent mode as flashing cmd windows don't let me continue to work on my computer!


If I am not using the correct tool, I'd be pleased to try other ideas (maybe based on other rendering engines ?). ;)
I realize that no method is error free so text or images may not be rendered perfectly each time but as long as I have most of them it will be fine. ;)

Many thanks in advance ;) 
Jity2

Win 8.1 64bits home

20
Dear all,

After some tests:
I will keep using ScreenGrab and save mht and (scrolled) png files locally. And when I know in advance that I need the maximum data of a webpage I will use the Google Chrome addons "Save Page We" with all options and/or "WARCreate".
Then use DtSearch to make keywords searches inside them.

For Google Drive (GD), the only correct option for me right now is to convert mht files into Docx files. But I am not sure that I want to spend time doing this each month. It would be much more easier if Google Drive was indexing the content of mht files  like they do for html files.
I'll keep asking them but I am not much optimist !
Mht files are afterall similar to email files. See this interesting old link for several file format options :"What's the best “file format” for saving complete web pages (images, etc.) in a single archive?"  https://stackoverflo...ages-images-etc-in-a
(also : https://gsuite-devel...hable-in-google.html
https://developers.g...ference/files/insert
https://developers.g...dk#drive_integration
more generally:
https://support.goog...s/answer/35287?hl=en
https://www.google.c...ts/file_formats.html

Note: html files saved with "SavePage WE" in GD: html are indexed but like you see it displayed when you preview an html file. The result is awful like it was read with a simple text reader. So if you make a two keyword searches that has some html code garbage inside it, GD won't find it. The strange thing is that the thumbnail of those html files are displayed correctly with images if you click on "I" like info. It appears on the right part of your screen and displays only to top part of the html. IMHO this thumbnail is using some kind of sandbox ?? I thought that GD didn't displayed html correctly because they were afraid of some worm javascript (or..?) that would causes damage to GD ? )

From mht to Docx:
I use a Word macro [ALT+F11] to convert mht files with WORD 2013 in Docx (it can also convert them into non-OCRed pdf,..etc.) :
Here is the code copied/adapted from http://muzso.hu/2013...-in-microsoft-office ] :

Option Explicit

Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)

Sub ConvertDocs()
    Dim fs As Object
    Dim oFolder As Object
    Dim tFolder As Object
    Dim oFile As Object
    Dim strDocName As String
    Dim intPos As Integer
    Dim locFolder As String
    Dim fileType As String
    Dim office2007 As Boolean
    Dim lf As LinkFormat
    Dim oField As Field
    Dim oIShape As InlineShape
    Dim oShape As Shape
    On Error Resume Next
    locFolder = InputBox("Enter the path to the folder with the documents to be converted", "File Conversion", "C:\Users\your_username\Documents\mht_to_docx_results\")



  '  If Application.Version >= 12 Then
  '      office2007 = True
        Do
            fileType = UCase(InputBox("Enter one of the following formats (to convert to): TXT, RTF, HTML, DOC, DOCX or PDF", "File Conversion", "DOCX"))
        'Loop Until (fileType = "HTML") 'fileType = "TXT" Or fileType = "RTF" Or   Or fileType = "PDF" Or fileType = "DOC" Or fileType = "DOCX"
        Loop Until (fileType = "TXT" Or fileType = "RTF" Or fileType = "HTML" Or fileType = "PDF" Or fileType = "DOC" Or fileType = "DOCX")
       
  '  Else
   '     office2007 = False
    '    Do
     '       fileType = UCase(InputBox("Enter one of the following formats (to convert to): TXT, RTF, HTML or DOC", "File Conversion", "TXT"))
     '   Loop Until (fileType = "TXT" Or fileType = "RTF" Or fileType = "HTML" Or fileType = "DOC")
    'End Select
   
  '  End If
   
   
    Application.ScreenUpdating = False
    Set fs = CreateObject("Scripting.FileSystemObject")
    Set oFolder = fs.GetFolder(locFolder)
    Set tFolder = fs.CreateFolder(locFolder & "Converted")
    Set tFolder = fs.GetFolder(locFolder & "Converted")
    For Each oFile In oFolder.Files
        Dim d As Document
        Set d = Application.Documents.Open(oFile.Path)
        ' put the document into print view
     '   If fileType = "RTF" Or fileType = "DOC" Or fileType = "DOCX" Then
     '       With ActiveWindow.View
     '           .ReadingLayout = False
     '           .Type = wdPrintView
     '       End With
     '   End If
        ' try to embed linked images from fields, shapes and inline shapes into the document
        ' (for some reason this does not work for all images in all HTML files I've tested)
       ' If Not fileType = "HTML" Then
       '     For Each oField In d.Fields
       '         Set lf = oField.LinkFormat
       '         If oField.Type = wdFieldIncludePicture And Not lf Is Nothing And Not lf.SavePictureWithDocument Then
       '             lf.SavePictureWithDocument = True
       '             Sleep (2000)
       '             lf.BreakLink()
       '             d.UndoClear()
       '         End If
       '     Next
 '           For Each oShape In d.Shapes
 '               Set lf = oShape.LinkFormat
 '               If Not lf Is Nothing And Not lf.SavePictureWithDocument Then
 '                   lf.SavePictureWithDocument = True
 '                   Sleep (2000)
 '                   lf.BreakLink()
 '                   d.UndoClear()
 '               End If
  '          Next
 '           For Each oIShape In d.InlineShapes
 '               Set lf = oIShape.LinkFormat
 '               If Not lf Is Nothing And Not lf.SavePictureWithDocument Then
 '                   lf.SavePictureWithDocument = True
  '                  Sleep (2000)
  '                  lf.BreakLink() = d.UndoClear()
  ''              End If
  '          Next
  '      End If
        strDocName = d.Name
        intPos = InStrRev(strDocName, ".")
        strDocName = Left(strDocName, intPos - 1)
        ChangeFileOpenDirectory (tFolder)
        ' Check out these links for a comprehensive list of supported file formats and format constants:
        ' http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word.wdsaveformat.aspx
        ' http://msdn.microsoft.com/en-us/library/office/bb238158.aspx
        ' (In the latter list you can see the values that the constants are associated with.
        '  Office 2003 only supported values up to wdFormatXML(=11). Values from wdFormatXMLDocument(=12)
        '  til wdFormatDocumentDefault(=16) were added in Office 2007, and wdFormatPDF(=17) and wdFormatXPS(=18)
        '  were added in Office 2007 SP2. Office 2010 added the various wdFormatFlatXML* formats and wdFormatOpenDocumentText.)
       ' If Not office2007 And fileType = "DOCX" Then
       '     fileType = "DOC"
       ' End If
        Select Case fileType
            Case Is = "TXT"
                strDocName = strDocName & ".txt"
                d.SaveAs FileName:=strDocName, FileFormat:=wdFormatText
                'ActiveDocument.SaveAs FileName:=strDocName, FileFormat:=wdFormatText
            Case Is = "RTF"
                strDocName = strDocName & ".rtf"
                d.SaveAs FileName:=strDocName, FileFormat:=wdFormatRTF
            Case Is = "HTML"
                strDocName = strDocName & ".html"
                d.SaveAs FileName:=strDocName, FileFormat:=wdFormatFilteredHTML
            Case Is = "DOC"
                strDocName = strDocName & ".doc"
                d.SaveAs FileName:=strDocName, FileFormat:=wdFormatDocument
            Case Is = "DOCX"
                strDocName = strDocName & ".docx"
                ' *** Word 2007+ users - remove the apostrophe at the start of the next line ***
                d.SaveAs FileName:=strDocName, FileFormat:=wdFormatDocumentDefault
            Case Is = "PDF"
                strDocName = strDocName & ".pdf"
                ' *** Word 2007 SP2+ users - remove the apostrophe at the start of the next line ***
                d.ExportAsFixedFormat OutputFileName:=strDocName, ExportFormat:=wdExportFormatPDF
        End Select
        d.Close
        ChangeFileOpenDirectory (oFolder)
    Next oFile
    Application.ScreenUpdating = True
End Sub


This works ok. The Docx files are about half smaller than mht !
But:
- it crashed with mht files with video links inside them!
- And alas Word tends to freeze my computer while converting some files!
- And sometimes wide webpages are cut (so not all the images are displayed properly) !
- Be careful: you need to have a folder with only mht files (not any other kind of files) before being able to convert them in docx otherwise the Word macro won't work !

I also did test a long time ago this batch software (https://www.coolutil...m/TotalHTMLConverter https://www.coolutil...om/online/MHT-to-DOC) but it was too slow and converted mht into doc not docx. Not all images were added from html files to the doc files...


(I also tested this option but it didn't work well :
From png to pdf:
- Seems to work with a trial version of Nuance Power PDF Advanced, but the OCR result is not very good (in order to check open the OCRed pdf file and save it as a txt file)!
- Acrobat Pro 8 : in two passes : one for creating the pdf file. And a second for doing the OCR. Alas OCR is not always possible due to size 45”x45” limits reached (even for some png files with 200”x200” with this trick https://acrobatusers...at-topics/ocr-error/). ;(  )

In GD:
Word files indexed. Limit : 1 Million characters.
PNG indexed only if less than 2 MB. And only the title is OCRed if the saved page is too wide.
(also big pdf files are not indexed in full. From memory it is something along : max 100 first pages of OCRed pdf files and max 10 first pages for non-OCRed files. There is maybe also a 50MB indexing limit. But for bigger files you still preview the file and make a keyword search CTRL+F and it will find the correct pages in the full pdf file. But the full content of this pdf won't be indexed in full automatically by GD)

Note: Mht can’t be previewed neither indexed in OneDrive !

The only last option that I see would be that ScreenGrab would save the webpage as an OCRed pdf file (like some print screen drivers) + mht. Like that this would work fine for most files natively in GD.
But I am sure this is technically doable !

Please let me know if you have other ideas ! ;)

21
Dear all,

Thanks for the comments.

Main idea:
With one click on a button, the Web Extension (in Google Chrome or Firefox) silently saves the complete webpage as html and as a complete screenshot. Then it closes silently the tab. 

Hopefully for me, the author of ScreenGrab (https://chrome.googl...jmaomipdeegbpk?hl=en) kindly helped me and released a new version that can do a screenshot, save the webpage as an .mht file and close the tab only if I use a shortcut (like CTRL+Q for instance). I encourage you to, like me, make a donation to him. ;)

Notes:
- This works only in Google Chrome and not in Firefox.
- YEAR_DAY_TIME_.mht (only one file instead of several files. In my case, In the past after a few million files, I have hitted some kind of limit - windows or NTFS??- which prevent me adding more files inside the partition where I unzipped my files. This happens faster if my html filenames were longer than just YEAR-DAY_TIME.html + its associated folder).
- The mht results can sometimes contain less, same or more (yes I didn't think it was possible!) information than the standard CTRL+S.
- The mht files are best displayed with Google Chrome than with Internet Explorer.
- I can index the mht files fine locally with DtSearch.
- Alas online, I can't make searches inside mht files with Google Drive (it doesn't index mht files. It exists an external viewer but I am reluctant to use it.)
I am in the process of deciding what to do : convert all my old html files and new mht files as .docx or pdf (with Word or ..). I also noticed that the png (or jpg) files were not correctly OCRed in Google Drive (often only the Title of an test article webpage is in fact OCRed. The OCR is ok only if I manually convert the png file as a Google Document (which doesn't count in your Google Drive storage. But be careful if you have more than 1 Million files in it as its GUI will start to slow down - G-Suite is far more powerful). Previewing a docx in Google Drive and making a keyword search is also faster than doing the same with a Google Document surprisingly.
- The mht and the png files from ScreenGrab are stored in "C:\Users\{username}\Downloads\ScreenGrab". If you need, like me, the data elsewhere you can use Syncback.




22
BTW, one good website that may be useful to you for Picasa and Google Photo is : https://sites.google...ite/picasaresources/

23
Maybe try with Bulk Rename Utility / actions/ import rename pairs (see their help file) + renaming options/prevent duplicates ?

24
Dear all,

Main idea:
With one click on a button, the Web Extension (in Google Chrome or Firefox) silently saves the complete webpage as html and as a complete screenshot. Then it closes silently the tab. 

Process detailed :
1) Save the complete webpage like if I use the shortcut [CTRL+S] and rename the page as MONTH_DAY_MIN_SEC.html + its related folder ("htmlfilename_files" containing small images and etc....)
And all of that without prompting to enter filename and/or directory.

2) Save a complete screenshot of the page
This can be the equivalent of this extension : ScreenGrab "Save Complete page" :
https://chrome.googl...kihagkjmaomipdeegbpk
(or something like this trick :  tip : [CTRL+I], then [CTRL+P], then typewrite capture Full screenshot page (see https://zapier.com/b...reenshots-in-chrome/). In my tests this don't capture always all the page. But this may better than nothing!)

3) close the tab
Like when I use [CTRL+W]
(or a webextension like https://chrome.googl...midmoifolgpjkmhdmalf )


Optional :
- the button can be a standard icon or a bookmark button. It can change color when the saving process is running.
- it could be great if the saved content in saved in different folder each day (like: C:\NEWS\MONTH_DAY\MONTH_DAY_MIN_SEC.html + folder : C:\NEWS\MONTH_DAY\MONTH_DAY_MIN_SEC_files ...)
- the save html webpage can be replaced by "Save Page WE" https://chrome.googl...geafimnjhojgjamoafof
but it creates only one html file containing images a little like a .mht file. It works ok but probably not 100% on all pages.
- save a WARC file of the webpage ("WARC is now recognized by most national library systems as the standard to follow for web archival" https://en.wikipedia...org/wiki/Web_ARChive). Example : WARCreate https://chrome.googl...lcbmckhiljgaabnpcaaa
- Force closing the tab if after 10 seconds not everything is saved ?
- maybe saving the screenshot first is faster than saving html first ?


Why am I asking for this ?
With the new Firefox Quantum released a few months ago, I am forced to change my habits. I save a lot of pages daily with one click on a bookmark. I am very happy with it. But I currently uses an outdated Firefox + Imacro addon versions in order to save webpages very fast. See https://www.reddit.c...d_web_pages/dvbnycf/ and https://www.donation...ex.php?topic=23782.0
Previously I used a Firegesture shortcut to save webpages. This resulted in too many webpages saved for my taste! So now I prefer one click on a bookmark in the top middle of my screen !
But why saving the same webpage with different formats ?
Because I realized that sometimes the saved html page is not saved properly. Then when I tries many years later to open it with the current browser of the day, it does not display well because of some small inserted urls or something ! Too bad for me as there is a small missing image that either I have to dig in the related html folder to see or it was not saved for some reasons !
So being able to see a screenshot might not be a bad idea (and in the Windows Explorer I can also display screenshots as 'extra large icon'). ;)
Furthermore, now years later, I use locally Dtsearch for indexing all my contents and online Google Drive which does automatically OCR (even it is far from perfect!) on screenshots ! ;)


Many thanks in advance ;)

25
 :-[ Thanks for the links NetRunner . ;)

Pages: [1] 2 3 4 5 6next