Author Topic: Commandline method of downloading a webpage (Read 11902 times)

questorfla · « **on:** June 19, 2016, 04:37 PM »

I need a way to be able to download a webpage through use of a script. I can get to the page with no problem in batch, vbs etc. Use of Ctrl+S in Chrome for example allows me to download and save it to MHTML (Once the correct flags options were set) and the resulting MHTML file is perfect. It opens locally and ll the contents work exactly as they would have if I were on that site. Including all images, links, and other visual html design.

I need to be able to do the same as ctrl+S(ave) to MHTML in a script that can be run on demand to create this document as needed.

Using Wget I end up with a variety of results depending on how I word the statement but none of them is even close to the same as the ctrl+S to get the MHTML version. If there is a command-line version of SAVE AS for Chrome I have not yet found it. It seems that something so simple in GUI surely must have some way to accomplish the same results in what should be one line in a script.

wraith808 · « **Reply #1 on:** June 19, 2016, 06:53 PM »

That's because with wget (or curl) you're directly getting the page you're pointing to, not a packaged version of the page. When I've had to do this for work, I've had to go pretty deep into wget knowledge of the switches to get the external resources in the directory structure... a pain, but actually not that bad, once you know what the commands do.

wget --recursive --domains [domain] --no-parent --page-requisites --html-extension --convert-links --restrict-file-names=windows --no-clobber [url]

where [domain] is the domain of the site that you're getting (so it doesn't go offsite), and [url] is the url you want to start with. I offer no guarantee- there might be some syntax errors, but that should get you started.

I know there's LWA by aignes, but I'm not sure if you can script that from the commandline: http://aignes.com/lwa.htm

questorfla · « **Reply #2 on:** June 19, 2016, 07:10 PM »

Wow. Talk about switches! How about I tell you that the link is www.example.com/website. It is hosted on localhost only so it isn't an actual WEB site per se.
But I open it by typing www.example.com/website into any browser. Can you narrow down the switches a bit with that info? I feel "whipped"

This is just a single page like "the home page" that the site opens to. All I need is that one page. As a manual download with Ctrl+S it is an approx 2MB MHTML file. If I could find that old Mouse Keys programmer I would just program the mouse strokes into a script to do it

IainB · « **Reply #3 on:** June 20, 2016, 04:12 AM »

WebPageDump may be of use/interest for this.
Refer DCF discussion thread: Re: Transform list of url links -> copies of actual web pages - try WebPageDump?

If you find a way to initiate WebPageDump or any other script to download a webpage, from the command line, then I'd be very interested to know how you did it please.

wraith808 · « **Reply #4 on:** June 20, 2016, 07:27 AM »

Wow. Talk about switches! How about I tell you that the link is www.example.com/website. It is hosted on localhost only so it isn't an actual WEB site per se.
But I open it by typing www.example.com/website into any browser. Can you narrow down the switches a bit with that info? I feel "whipped"
This is just a single page like "the home page" that the site opens to. All I need is that one page. As a manual download with Ctrl+S it is an approx 2MB MHTML file. If I could find that old Mouse Keys programmer I would just program the mouse strokes into a script to do it
-questorfla (June 19, 2016, 07:10 PM)

I can tell you what they do... and you can take out what you need. Since you're going via the http protocol, it doesn't really matter if you're running local or not.

--recursive: get the page recursively, i.e. don't just stop with the queried page
--domains [domain]: don't go outside of the current domain
--no-parent: don't go upward, no matter what the links say
--page-requisites: get extra needed files, like css and js files
--html-extension: other than prereqs, we're only looking for html files
--convert-links: converts the links on the page to explicitly point to the downloaded files, i.e. remove the domain from the files
--restrict-file-names=windows: use windows compatible file names
--no-clobber: if you have to resume the download, it won't download those files that already exist.

I'm not sure how to limit it to only one level- my use for it was to download the website for offline extraction. you might be able to add --l1, but I've never tried it.

All of these switches are detailed on the wget page (https://www.gnu.org/software/wget/), and they might have a better explanation of them.

4wd · « **Reply #5 on:** June 20, 2016, 08:11 AM »

This VBS script from StackOverflow worked fine here from the command line.

NOTE: You do need IE to be installed.

eg.

Code: Text [Select]

cscript SaveAsMHT.vbs https://www.donationcoder.com K:\test.mht

Gave a .mht file that could be opened in IE with the DC home page.

Considering you're talking about a single page capture it might be easier than using wget.

wraith808 · « **Reply #6 on:** June 20, 2016, 10:39 AM »

Does that MHT go offsite? Or is it all contained? I'd guess that everything for DC is onsite, so maybe that's not apparent. When I used to use MHT, I'd end up dropping offsite resources.

4wd · « **Reply #7 on:** June 20, 2016, 07:02 PM »

Quite likely but he said it was a single page website on a local server he wanted to archive so it may be suitable, possibly no offsite links to worry about.

questorfla · « **Reply #8 on:** June 21, 2016, 01:01 AM »

This VBS script from StackOverflow worked fine here from the command line.

NOTE: You do need IE to be installed.

eg.
Code: Text [Select]
cscript SaveAsMHT.vbs https://www.donationcoder.com K:\test.mht
Gave a .mht file that could be opened in IE with the DC home page.

Considering you're talking about a single page capture it might be easier than using wget.
-4wd (June 20, 2016, 08:11 AM)

Sigh~

line 1 char 23 expected end of statement
Duh missed the part about getting the original script I figured it was one of your best one-liners.
Forget anything bad I ever said about you 4wd (tho I never did!) the darn thing worked!!

>>And Yes, it beats the heck out of Wget which I never could get to return everything exactly like it showed on the web. Script is Perfect!

questorfla · « **Reply #9 on:** June 21, 2016, 01:23 AM »

IanB This did the trick for me!

"If you find a way to initiate WebPageDump or any other script to download a webpage, from the command line, then I'd be very interested to know how you did it please."

I am sure you already tried 'httrack' as that is my preferred "full-court-press" site backup.
But I did not need a backup I needed a single file document and this did it!!

What I had was a php script used as the index.php file in a website that was dedicated only to making a specific type of layout of images linked to webpages.
Like a visual website menu. But they wanted a document that could be emailed to people they wanted to have it and not have them going to the website that created the document to start with.

Anyway, the odds of you needing it for what I needed it for are less than zero.
Suffice it to say that 4WD came through (as usual though I think he has helpers at Stack overflow as I can Never find the codes he does).
Running the command line exactly as stated exchanging the webpage you want and place you wan t it stored makes a perfect mht file that includes the exact images along with their html presentation aspect plus the correct hypelinks.
In other words, it look and acts exactly as if the user were on that page

In all fairness, every element needed WAS on the page I created from so no elements from offsite are include as they are not needed.
But it would be worth experimenting to see if it might suit your needs as well

Thanks 4wd.
With luck, maybe I can drop in in a few months and bring a few shrimp for the barby eh?
(More like a porterhouse though!)

questorfla · « **Reply #10 on:** June 21, 2016, 01:43 AM »

AS for the other thing I emailed you:
if I could simply string together two SendTo options the first being "sendto"> compressed zip folder and then the output from that going to the emel.vbs script I sent you that would solve my other wish.
Can you pull another rabbit out of your magic hat on that one?
As in ... First select the files then right click>sendto> compressed.zip | email.vbs?

4wd · « **Reply #11 on:** June 21, 2016, 02:20 AM »

Suffice it to say that 4WD came through (as usual though I think he has helpers at Stack overflow as I can Never find the codes he does).
-questorfla (June 21, 2016, 01:23 AM)

Thanks

FYI: Google - Then I quick scan the results stopping at Stack..... (5th result in my part of the world)

Originally was going to do it in Powershell but it apparently can't be done without producing a requester, so VBScript is the next option to use IE's engine.

IainB · « **Reply #12 on:** June 21, 2016, 04:30 AM »

I'm glad you got it sorted for what you needed.
I'm seeking something that will enable me to port my humungus Scrapbook database from Firefox to a new browser that actually works, and to be able to view it all in the new browser - I'm using SlimJet (Chrome). The Scrapbook add-on is now about the only reason I still use Firefox.
What I really need are a stand-alone Scrapbook database browser and a management tool that know the slightly proprietary structure of the data store.
I had thought about integrating Scrapbook and Zotero data stores, but though Scrapbook and Zotero both use WebPageDump, the structure of their data stores is slightly different.
I was even tinkering around writing some VBS code to work with Excel to do a lot of that, but sadlement it's in the "too hard" basket and my code knowledge is poor anyway.

anandcoral · « **Reply #13 on:** June 21, 2016, 04:37 AM »

Thanks 4wd, your vbs script find is really very useful, when needed.

Regards,

Anand

Author Topic: Commandline method of downloading a webpage (Read 11902 times)

questorfla

Commandline method of downloading a webpage

wraith808

Re: Commandline method of downloading a webpage

questorfla

Re: Commandline method of downloading a webpage

IainB

Re: Commandline method of downloading a webpage

wraith808

Re: Commandline method of downloading a webpage

4wd

Re: Commandline method of downloading a webpage

wraith808

Re: Commandline method of downloading a webpage

4wd

Re: Commandline method of downloading a webpage

questorfla

Re: Commandline method of downloading a webpage

questorfla

Re: Commandline method of downloading a webpage

questorfla

Re: Commandline method of downloading a webpage

4wd

Re: Commandline method of downloading a webpage

IainB

Re: Commandline method of downloading a webpage

anandcoral

Re: Commandline method of downloading a webpage