topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • March 22, 2019, 08:19 AM
  • Proudly celebrating 13 years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: DONE: Mass convert already locally saved html (+htm +mht) files to pdf  (Read 670 times)

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 116
    • View Profile
    • Donate to Member
Dear all,

I would like to convert the html files of my archives into pdf (text OCRed + local related saved images included) so I can make keyword searches in them with Google Drive.
I need that the related images saved with the html file (usually in a related folder) be included as well as the url available in the html. I would prefer that the tool does its job with an offline mode as all the info that I have is already saved in the local html files, so it doesn't spend times to try all missing urls.
 

My about 20 years archives (many thousand of files) were mostly saved with Internet Explorer (Maxthon) for a few years then mostly with Firefox, and now Google Chrome (and httrack).

The idea : I choose one big folder "Source" (usually a month archives), it scan alone all the html, htm and mth files, in all the subfolders, than create in a big folder "Target" all the converted pdf files with the original names in the same subfolders.
Example :
C:\Source\2009\2019-04\15\2009_04_15_075256.html
...
C:\Target\2009\2019-04\15\2009_04_15_075256.pdf



I have tried to use wkhtmltopdf which is based on webkit (Safari https://github.com/w...tmltopdf/issues/3163) with the following script (based on a old AHK script found here) :

WorKingDir := "C:\prog\wkhtmltox\bin"      
pdParams := "wkhtmltopdf.exe "
FileSelectFolder,SourcePath,,0,Select Source Folder
If SourcePath =
ExitApp

FileSelectFolder,TargetPath,*%SourcePath%,0,Select Target Folder
If TargetPath =
ExitApp

If (SourcePath==TargetPath){
  Msgbox 0x40000,, % "SourcePath and TargetPath cant be the same "    TargetPath
ExitApp
}

Loop, Files, %SourcePath%\*.htm, R
   {
SplitPath, A_LoopFileFullPath, , , , OutNameNoExt
pCmd := pdParams """"  A_LoopFileFullPath """"  " " """" TargetPath "\" OutNameNoExt "_.pdf" """"   
RunWait % comspec " /c " pCmd , % WorKingDir
   }
ExitApp

Results seems ok but here are the problems that I have found :
- The output folders are not created as wkhtmltopdf puts all created pdf files into the Target folder without subfolders. This creates problems when html files have the same name as wkhtmltopdf overwrites them !
Feature request: create output folders if necessary
https://github.com/w...tmltopdf/issues/2421

- Sometimes it creates small unnecessary pdf files (2ko!). I can later delete them. I think they are created as when saving a webpage CTRL+S there are also some small htm files created in a related folder.
Example:
C:\A\save01.html
C:\A\save01\image.gif
C:\A\save01\image.htm
...
so in fact this is normal and ok ! ;)



- I have tried to implement the following trick :
offline mode: does not try to look for missing component online for locally saved html files
https://github.com/w...tmltopdf/issues/3294


I have replaced this line of code:
pdParams := "wkhtmltopdf.exe "

with :
pdParams := "wkhtmltopdf.exe --proxy=http://127.0.0.1:0 "

Alas it didn't work as it is still slowly trying to crawl missing urls online.


- No silent mode as flashing cmd windows don't let me continue to work on my computer!


If I am not using the correct tool, I'd be pleased to try other ideas (maybe based on other rendering engines ?). ;)
I realize that no method is error free so text or images may not be rendered perfectly each time but as long as I have most of them it will be fine. ;)

Many thanks in advance ;) 
Jity2

Win 8.1 64bits home

Shades

  • Member
  • Joined in 2006
  • **
  • Posts: 2,478
    • View Profile
    • Donate to Member
There is a piece of software, called: PanDoc.

It converts a lot of text based formats to other text based formats. One of those is HTML to PDF. It is available for all the major operating systems. It is freeware and really good at what it does. However, it is a command-line tool and that makes it immediate software non grata to some. A manual is included and likely you'll need to install GhostScript (also freeware for all major OS's) for PDF. The amount of parameters you can adjust is staggering and while that may frighten you a bit, the default values for these have worked well for me, on the occasions that I used PanDoc.

Both offline and online documentation is easy to follow. As it is a command-line tool, that means you can scripting to automatically go through your whole collection, even on timed intervals if you have a desire for that as well.

tomos

  • Charter Member
  • Joined in 2006
  • ***
  • Posts: 11,418
    • View Profile
    • Donate to Member
@jity2 yöu mention OCR - what for? Are there some images with text involved?
Tom

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 116
    • View Profile
    • Donate to Member
Hi,
Thanks Shades. I am trying PanDoc right now !

@Tomos: It is just that if there is some text content inside the html, it is a text content that can be read (or later indexed by Google Drive) in the pdf file.
No need to use a program to do the OCR of image files contained inside html files.
I am not sure I am clear! But I don't want an image only pdf file as a result.

Thanks in advance ;)

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 116
    • View Profile
    • Donate to Member
I did some Pandoc tests but I keep getting the same error :

Code example:
pandoc C:\prog\pandoc\pandoc-2.7.1-windows-x86_64\t5\20170611_074645.htm -t latex --pdf-engine=xelatex -s -o C:\prog\pandoc\pandoc-2.7.1-windows-x86_64\t5\20170611_074645.pdf

Error:
2019-03-15_103115.pngDONE: Mass convert already locally saved html (+htm +mht) files to pdf
I did some google searches but I am stuck!

Thanks in advance ;)

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 116
    • View Profile
    • Donate to Member
I just tested with Chrome Headless browser :
(Adapted from https://superuser.com/a/1211603/27956)
[windows+R]
cmd
cd C:\Program Files (x86)\Google\Chrome\Application
chrome --headless --print-to-pdf="C:\result\20170619_075623.pdf" "C:\source\t2\20170619_075623.htm"

I need now to try to adapt the above ahk script.
Edit:Here is what I have tried but I am stuck !

WorKingDir := "C:\Program Files (x86)\Google\Chrome\Application"      
pdParams := "chrome.exe --headless --print-to-pdf= "
FileSelectFolder,SourcePath,,0,Select Source Folder
If SourcePath =
ExitApp

FileSelectFolder,TargetPath,*%SourcePath%,0,Select Target Folder
If TargetPath =
ExitApp

If (SourcePath==TargetPath){
  Msgbox 0x40000,, % "SourcePath and TargetPath cant be the same "    TargetPath
ExitApp
}

Loop, Files, %SourcePath%\*.htm, R
   {
SplitPath, A_LoopFileFullPath, , , , OutNameNoExt
pCmd := pdParams """"  A_LoopFileFullPath """"  " " """" TargetPath "\" OutNameNoExt "_.pdf" """"   
RunWait % comspec " /c " pCmd , % WorKingDir
   }
ExitApp

Thanks in advance ;)
« Last Edit: March 15, 2019, 05:21 AM by jity2 »

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,163
    • View Profile
    • Donate to Member
Something quick in PowerShell using wkhtmltopdf, (put the executable in the same directory as the script), and Chrome.

Recursively converts .html, .htm, and .mht to PDF files.

Btw, if it looks familiar, 90% came from here.


Run it from a PoSh console or use a shortcut with the following as the Target, (assuming shortcut in the same folder as the script):
Code: Text [Select]
  1. %SystemRoot%\system32\WindowsPowerShell\v1.0\powershell.exe -sta -NoProfile -ExecutionPolicy Bypass -File "CTP.ps1"

CTP.ps1
Code: PowerShell [Select]
  1. <#
  2.   CTP.ps1
  3.  
  4.   Convert .htm(l) and .mht to PDF
  5. #>
  6.  
  7. Function Get-Folder {
  8.   Add-Type -AssemblyName System.Windows.Forms
  9.   $FolderBrowser = New-Object System.Windows.Forms.FolderBrowserDialog
  10.   [void]$FolderBrowser.ShowDialog()
  11.   $temp = $FolderBrowser.SelectedPath
  12.   If($temp -eq '') {Exit}
  13.   Return $temp
  14. }  
  15.  
  16. If($PSVersionTable.PSVersion.Major -lt 3) {
  17.   Write-Host '** Script requires at least Powershell V3 **'
  18. } else {
  19.   Write-Host 'Choose input folder: ' -NoNewline -BackgroundColor DarkGreen -ForegroundColor White
  20.   $srcFolder = (Get-Folder)
  21.   Write-Host $srcFolder
  22.   Write-Host 'Choose output folder: ' -NoNewline -BackgroundColor DarkGreen -ForegroundColor White
  23.   Do {$dstFolder = (Get-Folder)} While($dstFolder -eq $srcFolder)
  24.   Write-Host $dstFolder
  25.  
  26.   # PowerShell doesn't care how many stacked '\' there are in a path, multiples will always be seen as one
  27.   # ie. C:\\\\Windows = C:\Windows
  28.   # Means you can just add to the path without worrying about the trailing character
  29.   # Collect all files bigger than 3kB
  30.   $aFiles = (Get-ChildItem -Include *.html,*.htm,*.mht -Path ($srcFolder + "\*") -Recurse | Where-Object {$_.Length -gt 3kb} )
  31.   for($i = 0; $i -lt $aFiles.Count; $i++) {
  32.     $inFile = [string]$aFiles[$i]
  33. # Substitute destination folder for source folder and tack .pdf on the end
  34. # Could probably replace the extension instead ... but laziness and all that
  35.     $outFile = $dstFolder + $inFile.Replace($srcFolder, "") + '.pdf'
  36. # If output file doesn't exist then process
  37.     if (!(Test-Path $outFile)) {
  38. # Grab the parent of the output file, create the folder structure if it doesn't exist
  39.       $temp = Split-Path $outFile -Parent
  40.       if (!(Test-Path $temp)) {
  41.         New-Item $temp -ItemType Directory | Out-Null
  42.       }
  43.  
  44.       Write-Host 'File:' $inFile -BackgroundColor DarkBlue -ForegroundColor Yellow
  45. # Switch command based on the last 3 characters of the file name, anything that isn't 'mht' gets processed
  46. # as htm(l)
  47.       switch ($inFile.Substring([Math]::Max($inFile.Length - 3, 0))) {
  48.         mht {
  49.           $args = "`"$($inFile)`" --headless --print-to-pdf=`"$($outFile)`""
  50.           Start-Process -FilePath "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" -Wait -NoNewWindow -ArgumentList $args
  51.           }
  52.         default {
  53.           $args = "-p 127.0.0.1 -q `"$($inFile)`" `"$($outFile)`""
  54.           Start-Process -FilePath ".\wkhtmltopdf.exe" -Wait -NoNewWindow -ArgumentList $args
  55.           }
  56.       }
  57.     }
  58.   }
  59. }
  60. Write-Host ''
  61. Write-Host 'Close window to exit ...'
  62. cmd /c pause | out-null
« Last Edit: March 17, 2019, 06:53 AM by 4wd »

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 116
    • View Profile
    • Donate to Member
Wow 4wd! Thank you ! ;)
It is working for html and htm but not for wkhtmltopdf as it can't convert from mht to pdf.

Sorry, bit busy prepping to go overseas atm, if I have time in the next day or two I'll clean it up.
No problem. By that time I will do some tests. Many thanks in advance. ;)

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,163
    • View Profile
    • Donate to Member
Added *.htm to the Get-ChildItem filter so no need to edit and run separately, (not tested but works in other scripts - just remove it if there's a problem).

As for .mht, you could use PowerShell to pass them through the IE core to output as PDF ... so goes the theory.

Alternatively, convert to HTML first then run the script, interesting blog post: http://raywoodcocksl...-html-mht-files.html
« Last Edit: March 15, 2019, 05:58 PM by 4wd »

IainB

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 7,366
  • Slartibartfarst
    • View Profile
    • Read more about this member.
    • Donate to Member
The solution to meet the requirement need not necessarily be complex. Try going to the lowest common denominator - e.g., the data type. For example, I have been saving documents and web pages as .html and now (usually) .mht/.mhtml for years and searching them successfully with WDS (Windows Desktop Search) and GDS (Google Desktop Search). The files are all backed up (synced) to Google Drive.

Interesting points:
  • The search and preview of these files in Google Drive itself though is not much use as it seems to have become somewhat proprietary in the way it enforces the preferred proprietary and/or Google docs extensions.
    Moreover, the user is obliged to risk permanent degradation of their data if they convert to Google Docs format(s) from other formats.
  • The best browsers for being able to view text and images in .html and .mht and .mhtml seem to be Internet Explorer :Thmbsup: and Firefox (not sure about the latest Firefox though). Chrome  :down: doesn't seem to do it it at all well, and Brave's capability  :down: appears to be nonexistant. :o
  • Other tools for viewing these files include Everything (search tool) and xplorer² (Windows Explorer alternative), and not forgetting Universal Viewer.
« Last Edit: March 15, 2019, 05:49 PM by IainB »

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 116
    • View Profile
    • Donate to Member
Hi,
Many thanks 4wd. ;) I much appreciated. ;)

I did some tests (one old month saved with IE and one old month saved with Firefox) with your updated script and it works fine for html and htm files at the same time. ;)
As I have quite some files converting htm and html files should run for about a few months! But it seems that I can speed up the converting if I run several (copied and renamed -shortcut included ) powershell instances. ;)
2019-03-16_103749.pngDONE: Mass convert already locally saved html (+htm +mht) files to pdf


I have added a few manual steps :
Modified from https://stackoverflo...irectory-recursively , here is the Powershell code that I use to remove the pdf files that are smaller than 3ko (created by wkhtmltopdf, they in fact contains no text in my case):
Get-ChildItem $path -Filter *.pdf -recurse -file | ? {$_.length -lt 3000} | % {Remove-Item $_.fullname}

Then I use the freeware "Remove Empty Directories" http://www.jonasjohn.de/red.htm which removes..all empty directories in all the subfolders. It is very powerful IMHO. ;)

I don't know if this is possible but it would be great if the Powershell could exclude converting htm and html files that are smaller than 3ko ? Thanks in advance ;)


You also have a great memory for my 2016 request. ;) But I must acknowledge that I wouldn't have been able to modify the 10% left in the new code!!!

For mht files to pdf:
Thanks for the mht to html link. It reminds me that finding a simple solution can leads to many trials !
Mine were not created with Internet Explorer but were and are created with Google Chrome. In my manual tests, these mht files are often better displayed in Chrome than in I.E.

Here are my tests :
[windows+R]
cmd
[Select]
cd C:\Program Files (x86)\Google\Chrome\Application

Apparently Chrome also understands if I change the code from
chrome --headless --print-to-pdf="C:\result\20170619_075623.pdf" "C:\source\t2\20170619_075623.htm"
to
chrome --headless "C:\source\t2\20170619_075623.htm" --print-to-pdf="C:\result\20170619_075623.pdf"

And after some tests (thanks to https://www.autohotk...iewtopic.php?t=26819) it helped me having a working code ! It works ;) but it copies other files (png..) in the target folder ! 

2019-03-16_124647.pngDONE: Mass convert already locally saved html (+htm +mht) files to pdf
 :
WorKingDir := "C:\Program Files (x86)\Google\Chrome\Application"      
pdParams := "chrome.exe --headless "
FileSelectFolder,SourcePath,,0,Select Source Folder
If SourcePath =
ExitApp

FileSelectFolder,TargetPath,*%SourcePath%,0,Select Target Folder
If TargetPath =
ExitApp

pdParams := "chrome.exe --headless "
WorKingDir := "C:\Program Files (x86)\Google\Chrome\Application"      
RunWait % comspec " /c xCopy """ SourcePath A_loopField """ """ TargetPath A_loopField """ *.mht /s /i /y",, Hide


Loop, Files, % TargetPath "\*.mht" , R
   {       
   SplitPath, A_LoopFileFullPath, name, dir, ext, name_no_ext
   outPDF_repared :=  dir "\" name_no_ext "" ".pdf"
   pCmd := pdParams " " """"  A_LoopFileFullPath """"  " " """" "--print-to-pdf="outPDF_repared """"   
   RunWait % comspec " /c " pCmd , % WorKingDir , Hide
   FileAppend, % "Result pdrepair`n" outPDF_repared "`n", % A_Temp "\LOG_pdrepair.txt"
   FileRead, outLOG, % TargetPath "\LOG.txt"
   FileAppend, % outLOG "`n" , % A_Temp "\LOG_pdrepair.txt"
   FileDelete, % A_LoopFileFullPath
   }                             

Msgbox 0x40000,, % "END!",1                                           

ExitApp


I have tried to change:
RunWait % comspec " /c xCopy """ SourcePath A_loopField """ """ TargetPath A_loopField """ /s /i /y",, Hide
with
RunWait % comspec " /c xCopy """ SourcePath A_loopField """ """ TargetPath A_loopField """ *.mht /s /i /y",, Hide
or
RunWait % comspec " /c xCopy """ SourcePath A_loopField """ *.mht """ TargetPath A_loopField """ /s /i /y",, Hide
or
RunWait % comspec " /c xCopy *.mht """ SourcePath A_loopField """ """ TargetPath A_loopField """ /s /i /y",, Hide
or
RunWait % comspec " /c xCopy "\*.mht" """ SourcePath A_loopField """ """ TargetPath A_loopField """ /s /i /y",, Hide
Alas I am stuck !


So I have tried to modify your Powershell script :
<#
  CTP.ps1
 
  Recursively convert *.mht to PDF.
#>
 
Function Get-Folder {
  Add-Type -AssemblyName System.Windows.Forms
  $FolderBrowser = New-Object System.Windows.Forms.FolderBrowserDialog
  [void]$FolderBrowser.ShowDialog()
  $temp = $FolderBrowser.SelectedPath
  If($temp -eq '') {Exit}
  If(-Not $temp.EndsWith('\')) {$temp = $temp + '\'}
  Return $temp

 
If($PSVersionTable.PSVersion.Major -lt 3) {
  Write-Host '** Script requires at least Powershell V3 **'
} else {
  Write-Host 'Choose folder with PDF files: ' -NoNewline -BackgroundColor DarkGreen -ForegroundColor White
  $srcFolder = (Get-Folder)
  Write-Host $srcFolder
  Write-Host 'Choose output folder: ' -NoNewline -BackgroundColor DarkGreen -ForegroundColor White
  Do {$dstFolder = (Get-Folder)} While($dstFolder -eq $srcFolder)
  Write-Host $dstFolder
 
  $aFiles = (Get-ChildItem -Include *.mht -Path ($srcFolder + "*") -Recurse)
  for($i = 0; $i -lt $aFiles.Count; $i++) {
    $inFile = [string]$aFiles[$i]
    Write-Host 'File:' $inFile -BackgroundColor DarkBlue -ForegroundColor Yellow
    $outFile = $dstFolder + $inFile.Replace($srcFolder, "") + '.pdf'
    $temp = Split-Path $outFile -Parent
    if (!(Test-Path $temp)) {
      New-Item $temp -ItemType Directory | Out-Null
    }
    $args = "`"$($infile)`" chrome --headless --print-to-pdf=`"$($outFile)`""
    Start-Process -FilePath "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" -Wait -NoNewWindow -ArgumentList $args
  }
}

See "--print-to-pdf=". Alas this doesn't work !


@lainB: I am saving mht like you now (http://www.donationc....msg417446#msg417446). I don't use Google Docs now for uploaded files (I used to save htm files from Firefox but I have stopped). I just now mainly upload pdf files into Google Drive.


Many thanks in advance ;)
« Last Edit: March 16, 2019, 07:05 AM by jity2 »

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,163
    • View Profile
    • Donate to Member
I don't know if this is possible but it would be great if the Powershell could exclude converting htm and html files that are smaller than 3ko ?

You already have the answer, same idea as deleting the small PDFs.

Code: PowerShell [Select]
  1. $aFiles = (Get-ChildItem -Include *.html,*.htm -Path ($srcFolder + "*") -Recurse | Where-Object {$_.Length -gt 3kb} )

As for Chrome, I couldn't get it to work from PowerShell either, I'll have another look when I have time.

But you do have the args wrong, should be:
Code: PowerShell [Select]
  1. $args = "`"$($infile)`" --headless --print-to-pdf=`"$($outFile)`""

Just the arguments for the command.

Could try adding a working directory also:

Code: PowerShell [Select]
  1. Start-Process -FilePath "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" -Wait -NoNewWindow -ArgumentList $args -WorkingDirectory "C:\Program Files (x86)\Google\Chrome\Application"
« Last Edit: March 16, 2019, 08:30 AM by 4wd »

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 116
    • View Profile
    • Donate to Member
Wow this is fantastic 4wd ! ;)
For the second part I have seen no visible differences when adding the "working directory".

Many thanks. I much appreciated. ;)

So as a summary:
In order to convert htm to pdf and html to pdf, use this Powershell script written by 4wd (if Powershell is new to you have a look here http://www.donationc....msg399588#msg399588) :
Code: PowerShell [Select]
  1. <#
  2.   CTP.ps1
  3.  
  4.   Recursively convert *.htm and *.html to PDF + exclude htm and html files that are smaller than 3kb.
  5. #>
  6.  
  7. Function Get-Folder {
  8.   Add-Type -AssemblyName System.Windows.Forms
  9.   $FolderBrowser = New-Object System.Windows.Forms.FolderBrowserDialog
  10.   [void]$FolderBrowser.ShowDialog()
  11.   $temp = $FolderBrowser.SelectedPath
  12.   If($temp -eq '') {Exit}
  13.   If(-Not $temp.EndsWith('\')) {$temp = $temp + '\'}
  14.   Return $temp
  15. }  
  16.  
  17. If($PSVersionTable.PSVersion.Major -lt 3) {
  18.   Write-Host '** Script requires at least Powershell V3 **'
  19. } else {
  20.   Write-Host 'Choose folder with PDF files: ' -NoNewline -BackgroundColor DarkGreen -ForegroundColor White
  21.   $srcFolder = (Get-Folder)
  22.   Write-Host $srcFolder
  23.   Write-Host 'Choose output folder: ' -NoNewline -BackgroundColor DarkGreen -ForegroundColor White
  24.   Do {$dstFolder = (Get-Folder)} While($dstFolder -eq $srcFolder)
  25.   Write-Host $dstFolder
  26.  
  27.   $aFiles = (Get-ChildItem -Include *.html,*.htm -Path ($srcFolder + "*") -Recurse | Where-Object {$_.Length -gt 3kb} )
  28.   for($i = 0; $i -lt $aFiles.Count; $i++) {
  29.     $inFile = [string]$aFiles[$i]
  30.     Write-Host 'File:' $inFile -BackgroundColor DarkBlue -ForegroundColor Yellow
  31.     $outFile = $dstFolder + $inFile.Replace($srcFolder, "") + '.pdf'
  32.     $temp = Split-Path $outFile -Parent
  33.     if (!(Test-Path $temp)) {
  34.       New-Item $temp -ItemType Directory | Out-Null
  35.     }
  36.     $args = "`"$($infile)`" -p 127.0.0.1 `"$($outFile)`""
  37.     Start-Process -FilePath ".\wkhtmltopdf.exe" -Wait -NoNewWindow -ArgumentList $args
  38.   }
  39. }


In order to convert mht to pdf, use this Powershell script written by 4wd (mht files created with Google Chrome) :
Code: PowerShell [Select]
  1. <#
  2.   CTP.ps1
  3.  
  4.   Recursively convert *.mht to PDF.
  5. #>
  6.  
  7. Function Get-Folder {
  8.   Add-Type -AssemblyName System.Windows.Forms
  9.   $FolderBrowser = New-Object System.Windows.Forms.FolderBrowserDialog
  10.   [void]$FolderBrowser.ShowDialog()
  11.   $temp = $FolderBrowser.SelectedPath
  12.   If($temp -eq '') {Exit}
  13.   If(-Not $temp.EndsWith('\')) {$temp = $temp + '\'}
  14.   Return $temp
  15. }  
  16.  
  17. If($PSVersionTable.PSVersion.Major -lt 3) {
  18.   Write-Host '** Script requires at least Powershell V3 **'
  19. } else {
  20.   Write-Host 'Choose folder with PDF files: ' -NoNewline -BackgroundColor DarkGreen -ForegroundColor White
  21.   $srcFolder = (Get-Folder)
  22.   Write-Host $srcFolder
  23.   Write-Host 'Choose output folder: ' -NoNewline -BackgroundColor DarkGreen -ForegroundColor White
  24.   Do {$dstFolder = (Get-Folder)} While($dstFolder -eq $srcFolder)
  25.   Write-Host $dstFolder
  26.  
  27.   $aFiles = (Get-ChildItem -Include *.mht -Path ($srcFolder + "*") -Recurse)
  28.   for($i = 0; $i -lt $aFiles.Count; $i++) {
  29.     $inFile = [string]$aFiles[$i]
  30.     Write-Host 'File:' $inFile -BackgroundColor DarkBlue -ForegroundColor Yellow
  31.     $outFile = $dstFolder + $inFile.Replace($srcFolder, "") + '.pdf'
  32.     $temp = Split-Path $outFile -Parent
  33.     if (!(Test-Path $temp)) {
  34.       New-Item $temp -ItemType Directory | Out-Null
  35.     }
  36.     $args = "`"$($infile)`" --headless --print-to-pdf=`"$($outFile)`""
  37.     Start-Process -FilePath "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" -Wait -NoNewWindow -ArgumentList $args -WorkingDirectory "C:\Program Files (x86)\Google\Chrome\Application"
  38.   }
  39. }


Many thanks again to 4wd for the great help. ;)
See ya

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,163
    • View Profile
    • Donate to Member
Many thanks again to 4wd for the great help. ;)

Thanks, edited my OP, now does the lot, (.html, .htm, .mht).

Updated

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 116
    • View Profile
    • Donate to Member
Wow ! Many thanks 4wd. This is working like a charm ! ;)

After several tests I realize that I have to close your script(s) each night even if it has not finished its job.
note: wkhtmltopdf uses very little CPU and Chrome far more. So I run several copies of your script at the same time (especially for folders containing htm and html files).
And the next day when I open the script(s), it would be great if it can avoid converting to pdf if the pdf file already exist in the destination folder.
Currently I have to manually dig and move the files and specific folders in order to avoid spending a few hours just to continue where it has stopped.

I have tried to insert this code at row #43:

Code: PowerShell [Select]
  1. if {($inFile.Substring([Math]::Max($inFile.Length - 3, 0))) = $outFile
  2. next i
  3. }

After testing it, alas this proves that ...I am still not a coder !! ;)

Many thanks in advance ;)

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,163
    • View Profile
    • Donate to Member
Just need to test for the existence of the output file:

Code: PowerShell [Select]
  1. if (!(Test-Path $outFile)) {
  2. ...
  3. }

Added, try it now, (haven't tested it but it should work).
« Last Edit: March 17, 2019, 07:31 AM by 4wd »

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 116
    • View Profile
    • Donate to Member
I have tested your update and this is so great 4wd ! ;)
Many thanks. ;)

Inspired by https://stackoverflo...fore-printing-to-pdf I have added for Chrome (see in row #49) :
Code: PowerShell [Select]
  1. $args = "`"$($inFile)`" --headless --run-all-compositor-stages-before-draw --virtual-time-budget=10000 --print-to-pdf=`"$($outFile)`""
For me it doesn't seem to change anything ! Maybe this would help someone in the future... ;)

Thank you again ;)
See ya

skwire

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 5,136
    • View Profile
    • Donate to Member
Thanks, 4wd.  Moving thread to the Finished section.