Dear all,
After some tests:I will keep using ScreenGrab and save mht and (scrolled) png files locally. And when I know in advance that I need the maximum data of a webpage I will use the Google Chrome addons "Save Page We" with all options and/or "WARCreate".
Then use DtSearch to make keywords searches inside them.
For Google Drive (GD), the only correct option for me right now is to convert mht files into Docx files. But I am not sure that I want to spend time doing this each month. It would be much more easier if Google Drive was indexing the content of mht files like they do for html files. I'll keep asking them but I am not much optimist !
Mht files are afterall similar to email files. See this interesting old link for several file format options :
"What's the best “file format” for saving complete web pages (images, etc.) in a single archive?" https://stackoverflo...ages-images-etc-in-a(also :
https://gsuite-devel...hable-in-google.htmlhttps://developers.g...ference/files/inserthttps://developers.g...dk#drive_integrationmore generally:
https://support.goog...s/answer/35287?hl=enhttps://www.google.c...ts/file_formats.htmlNote: html files saved with "SavePage WE" in GD: html are indexed but like you see it displayed when you preview an html file. The result is awful like it was read with a simple text reader. So if you make a two keyword searches that has some html code garbage inside it, GD won't find it. The strange thing is that the thumbnail of those html files are displayed correctly with images if you click on "I" like info. It appears on the right part of your screen and displays only to top part of the html. IMHO this thumbnail is using some kind of sandbox ?? I thought that GD didn't displayed html correctly because they were afraid of some worm javascript (or..?) that would causes damage to GD ? )
From mht to Docx:I use a Word macro [ALT+F11] to convert mht files with WORD 2013 in Docx (it can also convert them into non-OCRed pdf,..etc.) :
Here is the code copied/adapted from
http://muzso.hu/2013...-in-microsoft-office ] :
Option Explicit
Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Sub ConvertDocs()
Dim fs As Object
Dim oFolder As Object
Dim tFolder As Object
Dim oFile As Object
Dim strDocName As String
Dim intPos As Integer
Dim locFolder As String
Dim fileType As String
Dim office2007 As Boolean
Dim lf As LinkFormat
Dim oField As Field
Dim oIShape As InlineShape
Dim oShape As Shape
On Error Resume Next
locFolder = InputBox("Enter the path to the folder with the documents to be converted", "File Conversion", "C:\Users\your_username\Documents\mht_to_docx_results\")
' If Application.Version >= 12 Then
' office2007 = True
Do
fileType = UCase(InputBox("Enter one of the following formats (to convert to): TXT, RTF, HTML, DOC, DOCX or PDF", "File Conversion", "DOCX"))
'Loop Until (fileType = "HTML") 'fileType = "TXT" Or fileType = "RTF" Or Or fileType = "PDF" Or fileType = "DOC" Or fileType = "DOCX"
Loop Until (fileType = "TXT" Or fileType = "RTF" Or fileType = "HTML" Or fileType = "PDF" Or fileType = "DOC" Or fileType = "DOCX")
' Else
' office2007 = False
' Do
' fileType = UCase(InputBox("Enter one of the following formats (to convert to): TXT, RTF, HTML or DOC", "File Conversion", "TXT"))
' Loop Until (fileType = "TXT" Or fileType = "RTF" Or fileType = "HTML" Or fileType = "DOC")
'End Select
' End If
Application.ScreenUpdating = False
Set fs = CreateObject("Scripting.FileSystemObject")
Set oFolder = fs.GetFolder(locFolder)
Set tFolder = fs.CreateFolder(locFolder & "Converted")
Set tFolder = fs.GetFolder(locFolder & "Converted")
For Each oFile In oFolder.Files
Dim d As Document
Set d = Application.Documents.Open(oFile.Path)
' put the document into print view
' If fileType = "RTF" Or fileType = "DOC" Or fileType = "DOCX" Then
' With ActiveWindow.View
' .ReadingLayout = False
' .Type = wdPrintView
' End With
' End If
' try to embed linked images from fields, shapes and inline shapes into the document
' (for some reason this does not work for all images in all HTML files I've tested)
' If Not fileType = "HTML" Then
' For Each oField In d.Fields
' Set lf = oField.LinkFormat
' If oField.Type = wdFieldIncludePicture And Not lf Is Nothing And Not lf.SavePictureWithDocument Then
' lf.SavePictureWithDocument = True
' Sleep (2000)
' lf.BreakLink()
' d.UndoClear()
' End If
' Next
' For Each oShape In d.Shapes
' Set lf = oShape.LinkFormat
' If Not lf Is Nothing And Not lf.SavePictureWithDocument Then
' lf.SavePictureWithDocument = True
' Sleep (2000)
' lf.BreakLink()
' d.UndoClear()
' End If
' Next
' For Each oIShape In d.InlineShapes
' Set lf = oIShape.LinkFormat
' If Not lf Is Nothing And Not lf.SavePictureWithDocument Then
' lf.SavePictureWithDocument = True
' Sleep (2000)
' lf.BreakLink() = d.UndoClear()
'' End If
' Next
' End If
strDocName = d.Name
intPos = InStrRev(strDocName, ".")
strDocName = Left(strDocName, intPos - 1)
ChangeFileOpenDirectory (tFolder)
' Check out these links for a comprehensive list of supported file formats and format constants:
' http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word.wdsaveformat.aspx
' http://msdn.microsoft.com/en-us/library/office/bb238158.aspx
' (In the latter list you can see the values that the constants are associated with.
' Office 2003 only supported values up to wdFormatXML(=11). Values from wdFormatXMLDocument(=12)
' til wdFormatDocumentDefault(=16) were added in Office 2007, and wdFormatPDF(=17) and wdFormatXPS(=18)
' were added in Office 2007 SP2. Office 2010 added the various wdFormatFlatXML* formats and wdFormatOpenDocumentText.)
' If Not office2007 And fileType = "DOCX" Then
' fileType = "DOC"
' End If
Select Case fileType
Case Is = "TXT"
strDocName = strDocName & ".txt"
d.SaveAs FileName:=strDocName, FileFormat:=wdFormatText
'ActiveDocument.SaveAs FileName:=strDocName, FileFormat:=wdFormatText
Case Is = "RTF"
strDocName = strDocName & ".rtf"
d.SaveAs FileName:=strDocName, FileFormat:=wdFormatRTF
Case Is = "HTML"
strDocName = strDocName & ".html"
d.SaveAs FileName:=strDocName, FileFormat:=wdFormatFilteredHTML
Case Is = "DOC"
strDocName = strDocName & ".doc"
d.SaveAs FileName:=strDocName, FileFormat:=wdFormatDocument
Case Is = "DOCX"
strDocName = strDocName & ".docx"
' *** Word 2007+ users - remove the apostrophe at the start of the next line ***
d.SaveAs FileName:=strDocName, FileFormat:=wdFormatDocumentDefault
Case Is = "PDF"
strDocName = strDocName & ".pdf"
' *** Word 2007 SP2+ users - remove the apostrophe at the start of the next line ***
d.ExportAsFixedFormat OutputFileName:=strDocName, ExportFormat:=wdExportFormatPDF
End Select
d.Close
ChangeFileOpenDirectory (oFolder)
Next oFile
Application.ScreenUpdating = True
End Sub
This works ok. The Docx files are about half smaller than mht !
But:
- it crashed with mht files with video links inside them!
- And alas Word tends to freeze my computer while converting some files!
- And sometimes wide webpages are cut (so not all the images are displayed properly) !
- Be careful: you need to have a folder with only mht files (not any other kind of files) before being able to convert them in docx otherwise the Word macro won't work !
I also did test a long time ago this batch software (
https://www.coolutil...m/TotalHTMLConverter https://www.coolutil...om/online/MHT-to-DOC) but it was too slow and converted mht into doc not docx. Not all images were added from html files to the doc files...
(I also tested this option but it didn't work well :
From png to pdf:
- Seems to work with a trial version of Nuance Power PDF Advanced, but the OCR result is not very good (in order to check open the OCRed pdf file and save it as a txt file)!
- Acrobat Pro 8 : in two passes : one for creating the pdf file. And a second for doing the OCR. Alas OCR is not always possible due to size 45”x45” limits reached (even for some png files with 200”x200” with this trick https://acrobatusers...at-topics/ocr-error/). ;( )
In GD:
Word files indexed. Limit : 1 Million characters.
PNG indexed only if less than 2 MB. And only the title is OCRed if the saved page is too wide.
(also big pdf files are not indexed in full. From memory it is something along : max 100 first pages of OCRed pdf files and max 10 first pages for non-OCRed files. There is maybe also a 50MB indexing limit. But for bigger files you still preview the file and make a keyword search CTRL+F and it will find the correct pages in the full pdf file. But the full content of this pdf won't be indexed in full automatically by GD)
Note: Mht can’t be previewed neither indexed in OneDrive !
The only last option that I see would be that ScreenGrab would save the webpage as an OCRed pdf file (like some print screen drivers) + mht. Like that this would work fine for most files natively in GD.
But I am sure this is technically doable !
Please let me know if you have other ideas !