topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Tuesday December 10, 2024, 4:14 pm
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: Transform list of url links to copies of the actual web pages  (Read 10011 times)

nkormanik

  • Participant
  • Joined in 2010
  • *
  • Posts: 554
    • View Profile
    • Donate to Member
Folder A contains one simple text file with 10 website urls: Sites.txt

Have web browser (any) visit the 10 sites, and save the corresponding 'page' of each visited site to a separate file.

At the end of the process..., Folder B will contain the 10 files, 'xyz_01.html' ... 'xyz_10.html', from the list in Folder A.

Any thoughts and help greatly appreciated.

Nicholas Kormanik

skwire

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 5,287
    • View Profile
    • Donate to Member
Re: Transform list of url links to copies of the actual web pages
« Reply #1 on: March 17, 2016, 07:54 AM »
This can easily be done.  However, keep in mind that simply downloading the "page" of said site, unless they're simply static sites, doesn't always work that well when it comes to viewing them afterward.  If you're okay with that, this can be done with just a few lines of AHK code.  Let me know.

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,644
    • View Profile
    • Donate to Member
Re: Transform list of url links to copies of the actual web pages
« Reply #2 on: March 17, 2016, 08:05 AM »
In Powershell:

Code: PowerShell [Select]
  1. $urls = Get-Content K:\sites.txt
  2.  
  3. for($i=0; $i -lt $urls.Count; $i++) {
  4.   $client = new-object System.Net.WebClient
  5.   $client.DownloadFile( $urls[$i], "k:\Site_" + $i + ".html")
  6. }

Change paths to suit, (ie. the K:\).

NOTE: Media content, (images, etc), if fetched from other sites will not be downloaded, you'll just have the URI link - it'll still work when viewed in a browser, the content will be fetched from the external site as needed, just like when viewing the original page.  Actually, all that's downloaded is the page source since that's all a HTML file is - plain text.
« Last Edit: June 15, 2016, 12:33 AM by 4wd »

skwire

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 5,287
    • View Profile
    • Donate to Member
Re: Transform list of url links to copies of the actual web pages
« Reply #3 on: March 17, 2016, 08:52 AM »
In AutoHotkey:

Code: Autohotkey [Select]
  1. mySitesFile   := "c:\tmp\15\Sites.txt"
  2. mySitesFolder := "c:\tmp\16"
  3.  
  4. FileRead, myData, % mySitesFile
  5. Loop, Parse, myData, `n, `r
  6. {
  7.     UrlDownloadToFile, % A_LoopField, % mySitesFolder . "\xyz_" . A_Index . ".html"
  8. }

Change path variables to suit.

relipse

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 114
  • I love Jesus. Coding in PHP primarily.
    • View Profile
    • See my GitHub
    • Read more about this member.
    • Donate to Member
Re: Transform list of url links to copies of the actual web pages
« Reply #4 on: March 17, 2016, 09:05 AM »
Maybe there is a way to make Chrome do a download of the site (that will get images etc so the web page is perfectly viewable)
Ex C++Builder coder, current PHP coder, and noob Qt Coder

IainB

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 7,544
  • @Slartibartfarst
    • View Profile
    • Read more about this member.
    • Donate to Member
I have searched the Internet for years to find something that makes a decent copy of web pages, for archive/library reference purposes.
The best I had found was the Firefox add-on Scrapbook.
More recently, I found Zotero.
Nothing else seems to come close.
They are both very good indeed.
Then I discovered that they both use the same engine:WebPageDump
(Details partially copied below and with just the download file embedded hyperlinks.)
introduction
WebPageDump is a Firefox extension which allows you to save local copies of pages from the Web. It sounds simple, but it's not. The standard "Save page" function of web browsers fails with most web pages and also web site downloaders donĀ“t work in a satisfactory manner. This shortcomings were a serious problem for our research.

Each web page is saved in an automatic named subdirectory making it easy to create whole (shareable) web page collections. It is built upon the Scrapbook extension and enhances its capabilities regarding HTML entites, charsets and command-line/batch functionality improving the visual exactness of the local copy. ...
...
using
WebPageDump can be used simply with the "WebPageDump" Entry inside the Firefox "Tools" menu. Hence the actual web page will be saved inside a WPD named subdirectory after selecting the destination directory. This mode is going to be the "normal" mode for most web page collecting applications.

For batch processing the following options can be used through the Firefox command-line. This command-line options are mainly present for webpagedump testing purposes but maybe useful for some special applications. Be sure that a single batch command has ended before proceeding with another one. ...
...
downloads
WebPageDump v0.3 (beta) firefox extension
WebPageDump v0.3 (beta) source code
The extension is provided under the terms of the Mozilla Public License. If you want to install WebPageDump you will either have to manually allow extension installations from this url or save the xpi file with "save as". See changes.txt for the version information.
Tested web pages (~68 MB)
Because of copyright issues we have removed the package of test web pages. But we will make them available for serious scientific research. They were downloaded and modified with WebPageDump using the SmartCache Java Proxy.
_____________________

nkormanik

  • Participant
  • Joined in 2010
  • *
  • Posts: 554
    • View Profile
    • Donate to Member
Re: Transform list of url links to copies of the actual web pages
« Reply #6 on: March 21, 2016, 05:21 PM »
Excellent suggestions all.  Thank you!

Though I will definitely keep your code and try it out, the solution I did the little task with was a Firefox extensions called Shelve:

https://addons.mozil...irefox/addon/shelve/

That did the trick.

Thanks again!