DONE: Mass convert already locally saved html (+htm +mht) files to pdf

ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

DonationCoder.com Software > Finished Programs

(1/4) > >>

jity2:
Dear all,

I would like to convert the html files of my archives into pdf (text OCRed + local related saved images included) so I can make keyword searches in them with Google Drive.
I need that the related images saved with the html file (usually in a related folder) be included as well as the url available in the html. I would prefer that the tool does its job with an offline mode as all the info that I have is already saved in the local html files, so it doesn't spend times to try all missing urls.

My about 20 years archives (many thousand of files) were mostly saved with Internet Explorer (Maxthon) for a few years then mostly with Firefox, and now Google Chrome (and httrack).

The idea : I choose one big folder "Source" (usually a month archives), it scan alone all the html, htm and mth files, in all the subfolders, than create in a big folder "Target" all the converted pdf files with the original names in the same subfolders.
Example :
C:\Source\2009\2019-04\15\2009_04_15_075256.html
...
C:\Target\2009\2019-04\15\2009_04_15_075256.pdf

I have tried to use wkhtmltopdf which is based on webkit (Safari https://github.com/wkhtmltopdf/wkhtmltopdf/issues/3163) with the following script (based on a old AHK script found here) :

--- ---WorKingDir := "C:\prog\wkhtmltox\bin"
pdParams := "wkhtmltopdf.exe "
FileSelectFolder,SourcePath,,0,Select Source Folder
If SourcePath =
ExitApp

FileSelectFolder,TargetPath,*%SourcePath%,0,Select Target Folder
If TargetPath =
ExitApp

If (SourcePath==TargetPath){
Msgbox 0x40000,, % "SourcePath and TargetPath cant be the same " TargetPath
ExitApp
}

Loop, Files, %SourcePath%\*.htm, R
{
SplitPath, A_LoopFileFullPath, , , , OutNameNoExt
pCmd := pdParams """" A_LoopFileFullPath """" " " """" TargetPath "\" OutNameNoExt "_.pdf" """"
RunWait % comspec " /c " pCmd , % WorKingDir
}
ExitApp
Results seems ok but here are the problems that I have found :
- The output folders are not created as wkhtmltopdf puts all created pdf files into the Target folder without subfolders. This creates problems when html files have the same name as wkhtmltopdf overwrites them !
Feature request: create output folders if necessary
https://github.com/wkhtmltopdf/wkhtmltopdf/issues/2421

- Sometimes it creates small unnecessary pdf files (2ko!). I can later delete them. I think they are created as when saving a webpage CTRL+S there are also some small htm files created in a related folder.
Example:
C:\A\save01.html
C:\A\save01\image.gif
C:\A\save01\image.htm
...
so in fact this is normal and ok ! ;)

- I have tried to implement the following trick :
offline mode: does not try to look for missing component online for locally saved html files
https://github.com/wkhtmltopdf/wkhtmltopdf/issues/3294

I have replaced this line of code:

--- ---pdParams := "wkhtmltopdf.exe "
with :

--- ---pdParams := "wkhtmltopdf.exe --proxy=http://127.0.0.1:0 "
Alas it didn't work as it is still slowly trying to crawl missing urls online.

- No silent mode as flashing cmd windows don't let me continue to work on my computer!

If I am not using the correct tool, I'd be pleased to try other ideas (maybe based on other rendering engines ?). ;)
I realize that no method is error free so text or images may not be rendered perfectly each time but as long as I have most of them it will be fine. ;)

Many thanks in advance ;)
Jity2

Win 8.1 64bits home

Shades:
There is a piece of software, called: PanDoc.

It converts a lot of text based formats to other text based formats. One of those is HTML to PDF. It is available for all the major operating systems. It is freeware and really good at what it does. However, it is a command-line tool and that makes it immediate software non grata to some. A manual is included and likely you'll need to install GhostScript (also freeware for all major OS's) for PDF. The amount of parameters you can adjust is staggering and while that may frighten you a bit, the default values for these have worked well for me, on the occasions that I used PanDoc.

Both offline and online documentation is easy to follow. As it is a command-line tool, that means you can scripting to automatically go through your whole collection, even on timed intervals if you have a desire for that as well.

tomos:
@jity2 yöu mention OCR - what for? Are there some images with text involved?

jity2:
Hi,
Thanks Shades. I am trying PanDoc right now !

@Tomos: It is just that if there is some text content inside the html, it is a text content that can be read (or later indexed by Google Drive) in the pdf file.
No need to use a program to do the OCR of image files contained inside html files.
I am not sure I am clear! But I don't want an image only pdf file as a result.

Thanks in advance ;)

jity2:
I did some Pandoc tests but I keep getting the same error :

Code example:

--- ---pandoc C:\prog\pandoc\pandoc-2.7.1-windows-x86_64\t5\20170611_074645.htm -t latex --pdf-engine=xelatex -s -o C:\prog\pandoc\pandoc-2.7.1-windows-x86_64\t5\20170611_074645.pdf
Error:
DONE: Mass convert already locally saved html (+htm +mht) files to pdf
I did some google searches but I am stuck!

Thanks in advance ;)

Navigation

[0] Message Index

[#] Next page

Go to full version