topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Friday November 8, 2024, 8:13 pm
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: [REQ] Web Extension : Silently save complete html + screenshot then close tab  (Read 7928 times)

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 126
    • View Profile
    • Donate to Member
Dear all,

Main idea:
With one click on a button, the Web Extension (in Google Chrome or Firefox) silently saves the complete webpage as html and as a complete screenshot. Then it closes silently the tab. 

Process detailed :
1) Save the complete webpage like if I use the shortcut [CTRL+S] and rename the page as MONTH_DAY_MIN_SEC.html + its related folder ("htmlfilename_files" containing small images and etc....)
And all of that without prompting to enter filename and/or directory.

2) Save a complete screenshot of the page
This can be the equivalent of this extension : ScreenGrab "Save Complete page" :
https://chrome.googl...kihagkjmaomipdeegbpk
(or something like this trick :  tip : [CTRL+I], then [CTRL+P], then typewrite capture Full screenshot page (see https://zapier.com/b...reenshots-in-chrome/). In my tests this don't capture always all the page. But this may better than nothing!)

3) close the tab
Like when I use [CTRL+W]
(or a webextension like https://chrome.googl...midmoifolgpjkmhdmalf )


Optional :
- the button can be a standard icon or a bookmark button. It can change color when the saving process is running.
- it could be great if the saved content in saved in different folder each day (like: C:\NEWS\MONTH_DAY\MONTH_DAY_MIN_SEC.html + folder : C:\NEWS\MONTH_DAY\MONTH_DAY_MIN_SEC_files ...)
- the save html webpage can be replaced by "Save Page WE" https://chrome.googl...geafimnjhojgjamoafof
but it creates only one html file containing images a little like a .mht file. It works ok but probably not 100% on all pages.
- save a WARC file of the webpage ("WARC is now recognized by most national library systems as the standard to follow for web archival" https://en.wikipedia...org/wiki/Web_ARChive). Example : WARCreate https://chrome.googl...lcbmckhiljgaabnpcaaa
- Force closing the tab if after 10 seconds not everything is saved ?
- maybe saving the screenshot first is faster than saving html first ?


Why am I asking for this ?
With the new Firefox Quantum released a few months ago, I am forced to change my habits. I save a lot of pages daily with one click on a bookmark. I am very happy with it. But I currently uses an outdated Firefox + Imacro addon versions in order to save webpages very fast. See https://www.reddit.c...d_web_pages/dvbnycf/ and https://www.donation...ex.php?topic=23782.0
Previously I used a Firegesture shortcut to save webpages. This resulted in too many webpages saved for my taste! So now I prefer one click on a bookmark in the top middle of my screen !
But why saving the same webpage with different formats ?
Because I realized that sometimes the saved html page is not saved properly. Then when I tries many years later to open it with the current browser of the day, it does not display well because of some small inserted urls or something ! Too bad for me as there is a small missing image that either I have to dig in the related html folder to see or it was not saved for some reasons !
So being able to see a screenshot might not be a bad idea (and in the Windows Explorer I can also display screenshots as 'extra large icon'). ;)
Furthermore, now years later, I use locally Dtsearch for indexing all my contents and online Google Drive which does automatically OCR (even it is far from perfect!) on screenshots ! ;)


Many thanks in advance ;)
« Last Edit: March 13, 2018, 12:32 PM by jity2 »

sphere

  • Participant
  • Joined in 2018
  • *
  • default avatar
  • Posts: 176
    • View Profile
    • Donate to Member
I cannot help with your broader question, however I was hit with a similar issue when the firefox scrapbook add-on I was using to save pages was killed by the quantum upgrade.  I have been using zotero, a research citation manager, that might meet some of your  requirements. It downloads with single click and has the ability to sort and search based on creation date. I found out that Zotero used the same engine as the scrapbook plug-in to create webpage snapshots. I had been happy with the quality of those snapshots.

It used to be that Zotero kind of lived in the browser as a plugin. However, now it is a standalone application (database) with plugins (it calls connectors) located in the browser.  As you surf, the Zotero plugin attempts to categorize the type of site you are on. The icon it displays will change based on how it categorizes the webpage. When you press the icon, it automatically creates a snapshot of the page AND does its best to pull the information that would be needed if you were citing the page in a paper. A small pop up in the right corner of the screen shows when it has started and when it has finished.  If it is a pdf, it will download the pdf and try and pull all the relevant information from the pdf.  The zotero indexs all of the web pages and pdfs so they are searchable. You can also navigate the the folder where all the saved web pages (or snapshots as they are called) are located. These are in Mhtml format.

While the file name will not have the date created in it as you would like, you have the file created date. In fact Zotero has an interesting "timeline" tool where you can organize your saved links (citations) as a timeline using various variables, including i believe the date created. https://www.zotero.o..._tutorials/timelines

At first, I was a little apprehensive about using Zotero because I do not want another application in my workflow. However, it has been open source for quite some time, it has a big following, and I have enjoyed the interface.  It also has an export feature I could see being useful should I need it.

 
Hope this helps

IainB

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 7,544
  • @Slartibartfarst
    • View Profile
    • Read more about this member.
    • Donate to Member
    Reading over this page today, I wondered whether the requirement as stated was not obsolete.
    I mean, I have had much the same requirements/problems as described in the OP and subsequent comments, but they have largely all gone away.
    For example -
On the subject of webpage copying:
...
  • I have searched the Internet for years to find something that makes a decent copy of web pages, for archive/library reference purposes.
    The best I had found was the Firefox add-on Scrapbook.
    More recently, I found Zotero.
    Nothing else seems to come close.
    They are both very good indeed.
    Then I discovered that they both use the same engine:WebPageDump <http://www.dbai.tuwien.ac.at/user/pollak/webpagedump/>
    (Details partially copied below and with just the download file embedded hyperlinks.)

  • WizNote: Superb web page copying (and editable too), but I had the same qualms as you re Cloud-based and security, so do not use it. By the way, I did a review of WizNote on the DC Forum:
     WizNote (a PIM from China) - Mini-Review + Provisional User Forum.

  • See also comments here: Re: flamory <https://www.donationcoder.com/forum/index.php?topic=41879.msg392045#msg392045>

  • OneNote is not very good for webpage capture/viewing, but is good for partial webpage clips.

  • Wezinc could have been almost exactly what I was looking for as a PIM, though it's performance for webpage copying was unreliable - it would occasionally miss some parts of pages. It also did rather nifty relationship mind-maps. I was a Beta tester for Wezinc and was disappointed when the developer seemed to just shut down without notice. Maybe he was ill/died. If you wanted it, I have the last Beta version that he provided to me. It was never published on his website, so it'd not be in Wayback.

  • Zotero remains current and supported and thus arguably the "best" proprietary option left by default.

  • .mhtml copies of webpages is probably the most "open" and non-proprietary way to go at present, under the circumstances, so that is where I have gone. This has the advantage of providing webpage copies  that are self-contained single files that are  readable by various browsers and able to be indexed by WDS (Windows Desktop Search and GDS (Google Desktop Search) - the latter still being fully functional and best-in-class by default. Web pages saved by Scrapbook are also indexable by WDS/GDS, and are easily viewable if one goes to the index.html file for each Scrapbook page. This is of course a pain, but I have not yet figured out a way to batch convert the thousands of Scrapbooked web pages I have in my library (together with their nested lower levels and any embedded files).

Ergonomics and efficiency: Importantly, saving an .mhtml file with a preferred prefix - e,g., (say) 2018-06-02 1423hrs - is a trivial matter with AHK (AutoHotkey) or other macro/scripting tool, and saving files to the .mhtml standard is generally non-proprietary and thus not necessarily restricted to any given  browser or browser extension. It can all be done rapidly, including closing the tab being copied.

Webpage image capture: This also is trivial, using various client-based tools - e,gScreengrab - and even cloud-based services - e,g, ScreenshotGuru. However, though images (screenshots) of webpages and OCR/indexing of same was also (of course) a consideration for me, my mandatory requirements included the capture of all the associated data and metadata that is embedded in a webpage, so an image was never really likely to be a useful option for my needs  - even as a belts-and-braces approach - and, furthermore, it would require storage space for essentially superfluous/redundant images.

By the way:
  • The reason I mentioned saving webpages as .mhtml files is that there is no longer any need for Scrapbook or Zotero's proprietary access to those webpages. They should be able to be read by any half-decent file browser.

  • The reason I mentioned WDS (Windows Desktop Search} and GDS {Google Desktop Search) is that they can search the content of these files, so the files don't need to be kept in a proprietary database or database format. The disk thus becomes the database. There are some tools out there that work in a collaborative fashion with that database - one notable example is Folder Viewer from MatirSoft - <http://www.matirsoft.com/>, which I am currently trying to get my head around. It is not a toy. It incorporates/collaborates with the IE browser (as far as I understand it, at present) and various file viewers. It also collaborates with GDS, if that is detected on installation (I have it installed). GDS is arguably probably still the best desktop search tool on the planet. Folder Viewer does lots of other stuff as well. This seems to make it potentially a most promising data management and inspection tool. Not sure yet. Still trialling it (is $FREE).

  • Scrapbook --> Zotero migration may be feasible.
    I have not yet trialled this, but here are some really helpful discussion notes from someone in an offline discussion:
    Migrate Scrapbook files to Zotero:
    ...I was installing Zotero in preparation of importing my Scrapbook files from Firefox.  I found the following tool that you might be interested in:
    <https://forums.zotero.org/discussion/70812/scrapbook-scrapbook-x-to-zotero-migration-tool>

    I have to do a little brush up and check my system before running the python script, but it looks promising to get the Scrapbook files in a system that is still actively updated. Prior to this I had needed to use a portable installation of Firefox that was still able to run Scrapbook.  That instance of Firefox could NOT run at the same time as my desktop application. I am sharing as I imagine you might also have been left in the lurch some with what happened with Scrapbook and my hope is this might help you.

    Later: ...Hey I wanted to follow up about the Scrapbook to Zotero script.  I had to figure out the syntax, which I copied below.

    When I ran the script, There were about 30 orphan files where I got ERROR: failed to export 201702021... entry (#1261), no index.html Try to inspect directory C:  user......  to decide what to do about orphan files.
    Looking at some of those files, it does appear that something is wrong with the index files. Either missing or duplicated. My guess is that something probably went wrong when Scrapbook originally tried "downloading" those pages.

    After running the script I was provided with a tmp.rdf file to use to import into Zotero.  There was no data file accompanying the rdf file like there is when you export from Zotero.

    When I imported using this file, Zotero imported the urls into a "tmp" collection, with tags that were from its scrapbook's folder system (very nice to bring over some of the organizational structure), BUT it did not import Scrapbook's snapshots of the websites.  I am not sure if the script did not finish running (because of the errors above). OR that it is not designed to import the snapshots. For me being able to search the snapshots text is very important.  I am going to try and go through and move the orphan files and then run it again and see if it creates a data file.

    I am not sure how well versed you are with python scripts, but just in case, here is the syntax for the script.
    scrapbook2zotero.exe C:\Users\Your username\AppData\Roaming\Mozilla\Firefox\[Your Firefox profile]\Scrapbook tmp.rdf

    You will need to make sure the path is the same for your installation of Firefox. The destination folder is the folder that the script is located in by default.
    ____________________________________

    Bookmarks - export from Firefox to Zotero:
    I have decided to also look into importing my bookmarks into zotero in order to have back up of those. To that end I have found this tool which might be of interest to you:
    <https://forums.zotero.org/discussion/160/import-firefox-bookmarks-file-as-zotero-collection/p3>
    ____________________________________
[/list]

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 126
    • View Profile
    • Donate to Member
Dear all,

Thanks for the comments.

Main idea:
With one click on a button, the Web Extension (in Google Chrome or Firefox) silently saves the complete webpage as html and as a complete screenshot. Then it closes silently the tab. 

Hopefully for me, the author of ScreenGrab (https://chrome.googl...jmaomipdeegbpk?hl=en) kindly helped me and released a new version that can do a screenshot, save the webpage as an .mht file and close the tab only if I use a shortcut (like CTRL+Q for instance). I encourage you to, like me, make a donation to him. ;)

Notes:
- This works only in Google Chrome and not in Firefox.
- YEAR_DAY_TIME_.mht (only one file instead of several files. In my case, In the past after a few million files, I have hitted some kind of limit - windows or NTFS??- which prevent me adding more files inside the partition where I unzipped my files. This happens faster if my html filenames were longer than just YEAR-DAY_TIME.html + its associated folder).
- The mht results can sometimes contain less, same or more (yes I didn't think it was possible!) information than the standard CTRL+S.
- The mht files are best displayed with Google Chrome than with Internet Explorer.
- I can index the mht files fine locally with DtSearch.
- Alas online, I can't make searches inside mht files with Google Drive (it doesn't index mht files. It exists an external viewer but I am reluctant to use it.)
I am in the process of deciding what to do : convert all my old html files and new mht files as .docx or pdf (with Word or ..). I also noticed that the png (or jpg) files were not correctly OCRed in Google Drive (often only the Title of an test article webpage is in fact OCRed. The OCR is ok only if I manually convert the png file as a Google Document (which doesn't count in your Google Drive storage. But be careful if you have more than 1 Million files in it as its GUI will start to slow down - G-Suite is far more powerful). Previewing a docx in Google Drive and making a keyword search is also faster than doing the same with a Google Document surprisingly.
- The mht and the png files from ScreenGrab are stored in "C:\Users\{username}\Downloads\ScreenGrab". If you need, like me, the data elsewhere you can use Syncback.



« Last Edit: June 02, 2018, 06:49 AM by jity2, Reason: Added more info in the last 2 paragraphs »

IainB

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 7,544
  • @Slartibartfarst
    • View Profile
    • Read more about this member.
    • Donate to Member
@jity2: Thankx for the feedback. Please let us know - keep us posted - as to how you get on.

jity2

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 126
    • View Profile
    • Donate to Member
Dear all,

After some tests:
I will keep using ScreenGrab and save mht and (scrolled) png files locally. And when I know in advance that I need the maximum data of a webpage I will use the Google Chrome addons "Save Page We" with all options and/or "WARCreate".
Then use DtSearch to make keywords searches inside them.

For Google Drive (GD), the only correct option for me right now is to convert mht files into Docx files. But I am not sure that I want to spend time doing this each month. It would be much more easier if Google Drive was indexing the content of mht files  like they do for html files.
I'll keep asking them but I am not much optimist !
Mht files are afterall similar to email files. See this interesting old link for several file format options :"What's the best “file format” for saving complete web pages (images, etc.) in a single archive?"  https://stackoverflo...ages-images-etc-in-a
(also : https://gsuite-devel...hable-in-google.html
https://developers.g...ference/files/insert
https://developers.g...dk#drive_integration
more generally:
https://support.goog...s/answer/35287?hl=en
https://www.google.c...ts/file_formats.html

Note: html files saved with "SavePage WE" in GD: html are indexed but like you see it displayed when you preview an html file. The result is awful like it was read with a simple text reader. So if you make a two keyword searches that has some html code garbage inside it, GD won't find it. The strange thing is that the thumbnail of those html files are displayed correctly with images if you click on "I" like info. It appears on the right part of your screen and displays only to top part of the html. IMHO this thumbnail is using some kind of sandbox ?? I thought that GD didn't displayed html correctly because they were afraid of some worm javascript (or..?) that would causes damage to GD ? )

From mht to Docx:
I use a Word macro [ALT+F11] to convert mht files with WORD 2013 in Docx (it can also convert them into non-OCRed pdf,..etc.) :
Here is the code copied/adapted from http://muzso.hu/2013...-in-microsoft-office ] :

Option Explicit

Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)

Sub ConvertDocs()
    Dim fs As Object
    Dim oFolder As Object
    Dim tFolder As Object
    Dim oFile As Object
    Dim strDocName As String
    Dim intPos As Integer
    Dim locFolder As String
    Dim fileType As String
    Dim office2007 As Boolean
    Dim lf As LinkFormat
    Dim oField As Field
    Dim oIShape As InlineShape
    Dim oShape As Shape
    On Error Resume Next
    locFolder = InputBox("Enter the path to the folder with the documents to be converted", "File Conversion", "C:\Users\your_username\Documents\mht_to_docx_results\")



  '  If Application.Version >= 12 Then
  '      office2007 = True
        Do
            fileType = UCase(InputBox("Enter one of the following formats (to convert to): TXT, RTF, HTML, DOC, DOCX or PDF", "File Conversion", "DOCX"))
        'Loop Until (fileType = "HTML") 'fileType = "TXT" Or fileType = "RTF" Or   Or fileType = "PDF" Or fileType = "DOC" Or fileType = "DOCX"
        Loop Until (fileType = "TXT" Or fileType = "RTF" Or fileType = "HTML" Or fileType = "PDF" Or fileType = "DOC" Or fileType = "DOCX")
       
  '  Else
   '     office2007 = False
    '    Do
     '       fileType = UCase(InputBox("Enter one of the following formats (to convert to): TXT, RTF, HTML or DOC", "File Conversion", "TXT"))
     '   Loop Until (fileType = "TXT" Or fileType = "RTF" Or fileType = "HTML" Or fileType = "DOC")
    'End Select
   
  '  End If
   
   
    Application.ScreenUpdating = False
    Set fs = CreateObject("Scripting.FileSystemObject")
    Set oFolder = fs.GetFolder(locFolder)
    Set tFolder = fs.CreateFolder(locFolder & "Converted")
    Set tFolder = fs.GetFolder(locFolder & "Converted")
    For Each oFile In oFolder.Files
        Dim d As Document
        Set d = Application.Documents.Open(oFile.Path)
        ' put the document into print view
     '   If fileType = "RTF" Or fileType = "DOC" Or fileType = "DOCX" Then
     '       With ActiveWindow.View
     '           .ReadingLayout = False
     '           .Type = wdPrintView
     '       End With
     '   End If
        ' try to embed linked images from fields, shapes and inline shapes into the document
        ' (for some reason this does not work for all images in all HTML files I've tested)
       ' If Not fileType = "HTML" Then
       '     For Each oField In d.Fields
       '         Set lf = oField.LinkFormat
       '         If oField.Type = wdFieldIncludePicture And Not lf Is Nothing And Not lf.SavePictureWithDocument Then
       '             lf.SavePictureWithDocument = True
       '             Sleep (2000)
       '             lf.BreakLink()
       '             d.UndoClear()
       '         End If
       '     Next
 '           For Each oShape In d.Shapes
 '               Set lf = oShape.LinkFormat
 '               If Not lf Is Nothing And Not lf.SavePictureWithDocument Then
 '                   lf.SavePictureWithDocument = True
 '                   Sleep (2000)
 '                   lf.BreakLink()
 '                   d.UndoClear()
 '               End If
  '          Next
 '           For Each oIShape In d.InlineShapes
 '               Set lf = oIShape.LinkFormat
 '               If Not lf Is Nothing And Not lf.SavePictureWithDocument Then
 '                   lf.SavePictureWithDocument = True
  '                  Sleep (2000)
  '                  lf.BreakLink() = d.UndoClear()
  ''              End If
  '          Next
  '      End If
        strDocName = d.Name
        intPos = InStrRev(strDocName, ".")
        strDocName = Left(strDocName, intPos - 1)
        ChangeFileOpenDirectory (tFolder)
        ' Check out these links for a comprehensive list of supported file formats and format constants:
        ' http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word.wdsaveformat.aspx
        ' http://msdn.microsoft.com/en-us/library/office/bb238158.aspx
        ' (In the latter list you can see the values that the constants are associated with.
        '  Office 2003 only supported values up to wdFormatXML(=11). Values from wdFormatXMLDocument(=12)
        '  til wdFormatDocumentDefault(=16) were added in Office 2007, and wdFormatPDF(=17) and wdFormatXPS(=18)
        '  were added in Office 2007 SP2. Office 2010 added the various wdFormatFlatXML* formats and wdFormatOpenDocumentText.)
       ' If Not office2007 And fileType = "DOCX" Then
       '     fileType = "DOC"
       ' End If
        Select Case fileType
            Case Is = "TXT"
                strDocName = strDocName & ".txt"
                d.SaveAs FileName:=strDocName, FileFormat:=wdFormatText
                'ActiveDocument.SaveAs FileName:=strDocName, FileFormat:=wdFormatText
            Case Is = "RTF"
                strDocName = strDocName & ".rtf"
                d.SaveAs FileName:=strDocName, FileFormat:=wdFormatRTF
            Case Is = "HTML"
                strDocName = strDocName & ".html"
                d.SaveAs FileName:=strDocName, FileFormat:=wdFormatFilteredHTML
            Case Is = "DOC"
                strDocName = strDocName & ".doc"
                d.SaveAs FileName:=strDocName, FileFormat:=wdFormatDocument
            Case Is = "DOCX"
                strDocName = strDocName & ".docx"
                ' *** Word 2007+ users - remove the apostrophe at the start of the next line ***
                d.SaveAs FileName:=strDocName, FileFormat:=wdFormatDocumentDefault
            Case Is = "PDF"
                strDocName = strDocName & ".pdf"
                ' *** Word 2007 SP2+ users - remove the apostrophe at the start of the next line ***
                d.ExportAsFixedFormat OutputFileName:=strDocName, ExportFormat:=wdExportFormatPDF
        End Select
        d.Close
        ChangeFileOpenDirectory (oFolder)
    Next oFile
    Application.ScreenUpdating = True
End Sub


This works ok. The Docx files are about half smaller than mht !
But:
- it crashed with mht files with video links inside them!
- And alas Word tends to freeze my computer while converting some files!
- And sometimes wide webpages are cut (so not all the images are displayed properly) !
- Be careful: you need to have a folder with only mht files (not any other kind of files) before being able to convert them in docx otherwise the Word macro won't work !

I also did test a long time ago this batch software (https://www.coolutil...m/TotalHTMLConverter https://www.coolutil...om/online/MHT-to-DOC) but it was too slow and converted mht into doc not docx. Not all images were added from html files to the doc files...


(I also tested this option but it didn't work well :
From png to pdf:
- Seems to work with a trial version of Nuance Power PDF Advanced, but the OCR result is not very good (in order to check open the OCRed pdf file and save it as a txt file)!
- Acrobat Pro 8 : in two passes : one for creating the pdf file. And a second for doing the OCR. Alas OCR is not always possible due to size 45”x45” limits reached (even for some png files with 200”x200” with this trick https://acrobatusers...at-topics/ocr-error/). ;(  )

In GD:
Word files indexed. Limit : 1 Million characters.
PNG indexed only if less than 2 MB. And only the title is OCRed if the saved page is too wide.
(also big pdf files are not indexed in full. From memory it is something along : max 100 first pages of OCRed pdf files and max 10 first pages for non-OCRed files. There is maybe also a 50MB indexing limit. But for bigger files you still preview the file and make a keyword search CTRL+F and it will find the correct pages in the full pdf file. But the full content of this pdf won't be indexed in full automatically by GD)

Note: Mht can’t be previewed neither indexed in OneDrive !

The only last option that I see would be that ScreenGrab would save the webpage as an OCRed pdf file (like some print screen drivers) + mht. Like that this would work fine for most files natively in GD.
But I am sure this is technically doable !

Please let me know if you have other ideas ! ;)
« Last Edit: June 04, 2018, 04:26 AM by jity2 »