ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

Commandline method of downloading a webpage

(1/3) > >>

questorfla:
I need a way to be able to download a webpage through use of a script.  I can get to the page with no problem in batch, vbs etc.   Use of Ctrl+S in Chrome for example allows me to download and save it to MHTML (Once the correct flags options were set) and the resulting MHTML file is perfect.  It opens locally and ll the contents work exactly as they would have if I were on that site.  Including all images, links, and other visual html design.

I need to be able to do the same as ctrl+S(ave) to MHTML in a script that can be run on demand to create this document as needed. 

Using Wget I end up with a variety of results depending on how I word the statement but none of them is even close to the same as the ctrl+S to get the MHTML version.  If there is a command-line version of SAVE AS for Chrome I have not yet found it.  It seems that something so simple in GUI surely must have some way to accomplish the same results in what should be one line in a script.

wraith808:
That's because with wget (or curl) you're directly getting the page you're pointing to, not a packaged version of the page.  When I've had to do this for work, I've had to go pretty deep into wget knowledge of the switches to get the external resources in the directory structure... a pain, but actually not that bad, once you know what the commands do.

wget --recursive --domains [domain] --no-parent --page-requisites --html-extension --convert-links --restrict-file-names=windows --no-clobber [url]

where [domain] is the domain of the site that you're getting (so it doesn't go offsite), and [url] is the url you want to start with.  I offer no guarantee- there might be some syntax errors, but that should get you started.

I know there's LWA by aignes, but I'm not sure if you can script that from the commandline: http://aignes.com/lwa.htm

questorfla:
Wow.  Talk about switches!  How about I tell you that the link is www.example.com/website.  It is hosted on localhost only so it isn't an actual WEB site per se.
But I open it by typing www.example.com/website into any browser. Can you narrow down the switches a bit with that info?   I feel "whipped" :)
This is just a single page like "the home page" that the site opens to.  All I need is that one page.  As a manual download with Ctrl+S it is an approx 2MB MHTML file.  If I could find that old Mouse Keys programmer I would just program the mouse strokes into a script to do it :(

IainB:
WebPageDump may be of use/interest for this.
Refer DCF discussion thread: Re: Transform list of url links -> copies of actual web pages - try WebPageDump?

If you find a way to initiate WebPageDump or any other script to download a webpage, from the command line, then I'd be very interested to know how you did it please.

wraith808:
Wow.  Talk about switches!  How about I tell you that the link is www.example.com/website.  It is hosted on localhost only so it isn't an actual WEB site per se.
But I open it by typing www.example.com/website into any browser. Can you narrow down the switches a bit with that info?   I feel "whipped" :)
This is just a single page like "the home page" that the site opens to.  All I need is that one page.  As a manual download with Ctrl+S it is an approx 2MB MHTML file.  If I could find that old Mouse Keys programmer I would just program the mouse strokes into a script to do it :(
-questorfla (June 19, 2016, 07:10 PM)
--- End quote ---

I can tell you what they do... and you can take out what you need.  Since you're going via the http protocol, it doesn't really matter if you're running local or not.

--recursive: get the page recursively, i.e. don't just stop with the queried page
--domains [domain]: don't go outside of the current domain
--no-parent: don't go upward, no matter what the links say
--page-requisites: get extra needed files, like css and js files
--html-extension: other than prereqs, we're only looking for html files
--convert-links: converts the links on the page to explicitly point to the downloaded files, i.e. remove the domain from the files
--restrict-file-names=windows: use windows compatible file names
--no-clobber: if you have to resume the download, it won't download those files that already exist.

I'm not sure how to limit it to only one level- my use for it was to download the website for offline extraction.  you might be able to add --l1, but I've never tried it.

All of these switches are detailed on the wget page (https://www.gnu.org/software/wget/), and they might have a better explanation of them.



Navigation

[0] Message Index

[#] Next page

Go to full version