DONE: Mass convert already locally saved html (+htm +mht) files to pdf

ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

DonationCoder.com Software > Finished Programs

<< < (3/4) > >>

jity2:
Hi,
Many thanks 4wd. ;) I much appreciated. ;)

I did some tests (one old month saved with IE and one old month saved with Firefox) with your updated script and it works fine for html and htm files at the same time. ;)
As I have quite some files converting htm and html files should run for about a few months! But it seems that I can speed up the converting if I run several (copied and renamed -shortcut included ) powershell instances. ;)
DONE: Mass convert already locally saved html (+htm +mht) files to pdf

I have added a few manual steps :
Modified from https://stackoverflow.com/questions/27897669/powershell-remove-files-smaller-than-500bytes-in-directory-recursively , here is the Powershell code that I use to remove the pdf files that are smaller than 3ko (created by wkhtmltopdf, they in fact contains no text in my case):

--- ---Get-ChildItem $path -Filter *.pdf -recurse -file | ? {$_.length -lt 3000} | % {Remove-Item $_.fullname}
Then I use the freeware "Remove Empty Directories" http://www.jonasjohn.de/red.htm which removes..all empty directories in all the subfolders. It is very powerful IMHO. ;)

I don't know if this is possible but it would be great if the Powershell could exclude converting htm and html files that are smaller than 3ko ? Thanks in advance ;)

You also have a great memory for my 2016 request. ;) But I must acknowledge that I wouldn't have been able to modify the 10% left in the new code!!!

For mht files to pdf:
Thanks for the mht to html link. It reminds me that finding a simple solution can leads to many trials !
Mine were not created with Internet Explorer but were and are created with Google Chrome. In my manual tests, these mht files are often better displayed in Chrome than in I.E.

Here are my tests :
[windows+R]
cmd

cd C:\Program Files (x86)\Google\Chrome\Application

Apparently Chrome also understands if I change the code from
chrome --headless --print-to-pdf="C:\result\20170619_075623.pdf" "C:\source\t2\20170619_075623.htm"
to
chrome --headless "C:\source\t2\20170619_075623.htm" --print-to-pdf="C:\result\20170619_075623.pdf"

And after some tests (thanks to https://www.autohotkey.com/boards/viewtopic.php?t=26819) it helped me having a working code ! It works ;) but it copies other files (png..) in the target folder !

DONE: Mass convert already locally saved html (+htm +mht) files to pdf
:

--- ---WorKingDir := "C:\Program Files (x86)\Google\Chrome\Application"
pdParams := "chrome.exe --headless "
FileSelectFolder,SourcePath,,0,Select Source Folder
If SourcePath =
ExitApp

FileSelectFolder,TargetPath,*%SourcePath%,0,Select Target Folder
If TargetPath =
ExitApp

pdParams := "chrome.exe --headless "
WorKingDir := "C:\Program Files (x86)\Google\Chrome\Application"
RunWait % comspec " /c xCopy """ SourcePath A_loopField """ """ TargetPath A_loopField """ *.mht /s /i /y",, Hide

Loop, Files, % TargetPath "\*.mht" , R
{
SplitPath, A_LoopFileFullPath, name, dir, ext, name_no_ext
outPDF_repared := dir "\" name_no_ext "" ".pdf"
pCmd := pdParams " " """" A_LoopFileFullPath """" " " """" "--print-to-pdf="outPDF_repared """"
RunWait % comspec " /c " pCmd , % WorKingDir , Hide
FileAppend, % "Result pdrepair`n" outPDF_repared "`n", % A_Temp "\LOG_pdrepair.txt"
FileRead, outLOG, % TargetPath "\LOG.txt"
FileAppend, % outLOG "`n" , % A_Temp "\LOG_pdrepair.txt"
FileDelete, % A_LoopFileFullPath
}

Msgbox 0x40000,, % "END!",1

ExitApp

I have tried to change:
RunWait % comspec " /c xCopy """ SourcePath A_loopField """ """ TargetPath A_loopField """ /s /i /y",, Hide
with
RunWait % comspec " /c xCopy """ SourcePath A_loopField """ """ TargetPath A_loopField """ *.mht /s /i /y",, Hide
or
RunWait % comspec " /c xCopy """ SourcePath A_loopField """ *.mht """ TargetPath A_loopField """ /s /i /y",, Hide
or
RunWait % comspec " /c xCopy *.mht """ SourcePath A_loopField """ """ TargetPath A_loopField """ /s /i /y",, Hide
or
RunWait % comspec " /c xCopy "\*.mht" """ SourcePath A_loopField """ """ TargetPath A_loopField """ /s /i /y",, Hide
Alas I am stuck !

So I have tried to modify your Powershell script :

--- ---<#
CTP.ps1

Recursively convert *.mht to PDF.
#>

Function Get-Folder {
Add-Type -AssemblyName System.Windows.Forms
$FolderBrowser = New-Object System.Windows.Forms.FolderBrowserDialog
[void]$FolderBrowser.ShowDialog()
$temp = $FolderBrowser.SelectedPath
If($temp -eq '') {Exit}
If(-Not $temp.EndsWith('\')) {$temp = $temp + '\'}
Return $temp
}

If($PSVersionTable.PSVersion.Major -lt 3) {
Write-Host '** Script requires at least Powershell V3 **'
} else {
Write-Host 'Choose folder with PDF files: ' -NoNewline -BackgroundColor DarkGreen -ForegroundColor White
$srcFolder = (Get-Folder)
Write-Host $srcFolder
Write-Host 'Choose output folder: ' -NoNewline -BackgroundColor DarkGreen -ForegroundColor White
Do {$dstFolder = (Get-Folder)} While($dstFolder -eq $srcFolder)
Write-Host $dstFolder

$aFiles = (Get-ChildItem -Include *.mht -Path ($srcFolder + "*") -Recurse)
for($i = 0; $i -lt $aFiles.Count; $i++) {
$inFile = [string]$aFiles[$i]
Write-Host 'File:' $inFile -BackgroundColor DarkBlue -ForegroundColor Yellow
$outFile = $dstFolder + $inFile.Replace($srcFolder, "") + '.pdf'
$temp = Split-Path $outFile -Parent
if (!(Test-Path $temp)) {
New-Item $temp -ItemType Directory | Out-Null
}
$args = "`"$($infile)`" chrome --headless --print-to-pdf=`"$($outFile)`""
Start-Process -FilePath "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" -Wait -NoNewWindow -ArgumentList $args
}
}
See "--print-to-pdf=". Alas this doesn't work !

@lainB: I am saving mht like you now (https://www.donationcoder.com/forum/index.php?topic=45152.msg417446#msg417446). I don't use Google Docs now for uploaded files (I used to save htm files from Firefox but I have stopped). I just now mainly upload pdf files into Google Drive.

Many thanks in advance ;)

4wd:
I don't know if this is possible but it would be great if the Powershell could exclude converting htm and html files that are smaller than 3ko ?-jity2 (March 16, 2019, 06:48 AM)
--- End quote ---

You already have the answer, same idea as deleting the small PDFs.

--- Code: PowerShell ---$aFiles = (Get-ChildItem -Include *.html,*.htm -Path ($srcFolder + "*") -Recurse | Where-Object {$_.Length -gt 3kb} )
As for Chrome, I couldn't get it to work from PowerShell either, I'll have another look when I have time.

But you do have the args wrong, should be:

--- Code: PowerShell ---$args = "`"$($infile)`" --headless --print-to-pdf=`"$($outFile)`""
Just the arguments for the command.

Could try adding a working directory also:

--- Code: PowerShell ---Start-Process -FilePath "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" -Wait -NoNewWindow -ArgumentList $args -WorkingDirectory "C:\Program Files (x86)\Google\Chrome\Application"

jity2:
Wow this is fantastic 4wd ! ;)
For the second part I have seen no visible differences when adding the "working directory".

Many thanks. I much appreciated. ;)

So as a summary:
In order to convert htm to pdf and html to pdf, use this Powershell script written by 4wd (if Powershell is new to you have a look here https://www.donationcoder.com/forum/index.php?topic=42713.msg399588#msg399588) :

--- Code: PowerShell ---<# CTP.ps1 Recursively convert *.htm and *.html to PDF + exclude htm and html files that are smaller than 3kb.#> Function Get-Folder { Add-Type -AssemblyName System.Windows.Forms $FolderBrowser = New-Object System.Windows.Forms.FolderBrowserDialog [void]$FolderBrowser.ShowDialog() $temp = $FolderBrowser.SelectedPath If($temp -eq '') {Exit} If(-Not $temp.EndsWith('\')) {$temp = $temp + '\'} Return $temp} If($PSVersionTable.PSVersion.Major -lt 3) { Write-Host '** Script requires at least Powershell V3 **'} else { Write-Host 'Choose folder with PDF files: ' -NoNewline -BackgroundColor DarkGreen -ForegroundColor White $srcFolder = (Get-Folder) Write-Host $srcFolder Write-Host 'Choose output folder: ' -NoNewline -BackgroundColor DarkGreen -ForegroundColor White Do {$dstFolder = (Get-Folder)} While($dstFolder -eq $srcFolder) Write-Host $dstFolder $aFiles = (Get-ChildItem -Include *.html,*.htm -Path ($srcFolder + "*") -Recurse | Where-Object {$_.Length -gt 3kb} ) for($i = 0; $i -lt $aFiles.Count; $i++) { $inFile = [string]$aFiles[$i] Write-Host 'File:' $inFile -BackgroundColor DarkBlue -ForegroundColor Yellow $outFile = $dstFolder + $inFile.Replace($srcFolder, "") + '.pdf' $temp = Split-Path $outFile -Parent if (!(Test-Path $temp)) { New-Item $temp -ItemType Directory | Out-Null } $args = "`"$($infile)`" -p 127.0.0.1 `"$($outFile)`"" Start-Process -FilePath ".\wkhtmltopdf.exe" -Wait -NoNewWindow -ArgumentList $args }}

In order to convert mht to pdf, use this Powershell script written by 4wd (mht files created with Google Chrome) :

--- Code: PowerShell ---<# CTP.ps1 Recursively convert *.mht to PDF.#> Function Get-Folder { Add-Type -AssemblyName System.Windows.Forms $FolderBrowser = New-Object System.Windows.Forms.FolderBrowserDialog [void]$FolderBrowser.ShowDialog() $temp = $FolderBrowser.SelectedPath If($temp -eq '') {Exit} If(-Not $temp.EndsWith('\')) {$temp = $temp + '\'} Return $temp} If($PSVersionTable.PSVersion.Major -lt 3) { Write-Host '** Script requires at least Powershell V3 **'} else { Write-Host 'Choose folder with PDF files: ' -NoNewline -BackgroundColor DarkGreen -ForegroundColor White $srcFolder = (Get-Folder) Write-Host $srcFolder Write-Host 'Choose output folder: ' -NoNewline -BackgroundColor DarkGreen -ForegroundColor White Do {$dstFolder = (Get-Folder)} While($dstFolder -eq $srcFolder) Write-Host $dstFolder $aFiles = (Get-ChildItem -Include *.mht -Path ($srcFolder + "*") -Recurse) for($i = 0; $i -lt $aFiles.Count; $i++) { $inFile = [string]$aFiles[$i] Write-Host 'File:' $inFile -BackgroundColor DarkBlue -ForegroundColor Yellow $outFile = $dstFolder + $inFile.Replace($srcFolder, "") + '.pdf' $temp = Split-Path $outFile -Parent if (!(Test-Path $temp)) { New-Item $temp -ItemType Directory | Out-Null } $args = "`"$($infile)`" --headless --print-to-pdf=`"$($outFile)`"" Start-Process -FilePath "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" -Wait -NoNewWindow -ArgumentList $args -WorkingDirectory "C:\Program Files (x86)\Google\Chrome\Application" }}

Many thanks again to 4wd for the great help. ;)
See ya

4wd:
Many thanks again to 4wd for the great help. ;)-jity2 (March 16, 2019, 09:24 AM)
--- End quote ---

Thanks, edited my OP, now does the lot, (.html, .htm, .mht).

Updated

jity2:
Wow ! Many thanks 4wd. This is working like a charm ! ;)

After several tests I realize that I have to close your script(s) each night even if it has not finished its job.
note: wkhtmltopdf uses very little CPU and Chrome far more. So I run several copies of your script at the same time (especially for folders containing htm and html files).
And the next day when I open the script(s), it would be great if it can avoid converting to pdf if the pdf file already exist in the destination folder.
Currently I have to manually dig and move the files and specific folders in order to avoid spending a few hours just to continue where it has stopped.

I have tried to insert this code at row #43:

--- Code: PowerShell ---if {($inFile.Substring([Math]::Max($inFile.Length - 3, 0))) = $outFilenext i}
After testing it, alas this proves that ...I am still not a coder !! ;)

Many thanks in advance ;)

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version