topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Thursday March 28, 2024, 12:40 pm
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: Any ideas for a save website crawler for offline reading?  (Read 16142 times)

Carol Haynes

  • Waffles for England (patent pending)
  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 8,066
    • View Profile
    • Donate to Member
I am doing a university course at the moment and most of the course materials are available via a website.

I have to login to the website and programs like HTtrack don't seem to be able to download the pages for offline reading.

To make matters worse the website is optimised for Internet Explorer and doesn't seem to work well in other browsers.

Ideally I need an Internet Explorer 9 add-on that can download the website for offline reading (offline web pages were removed by MS from IE7 onwards).

There are too many pages to save them individually.

rjbull

  • Charter Member
  • Joined in 2005
  • ***
  • default avatar
  • Posts: 3,199
    • View Profile
    • Donate to Member
Re: Any ideas for a save website crawler for offline reading?
« Reply #1 on: April 02, 2012, 04:50 PM »
As long as you're willing to pay to register, it looks like Internet Download Manager (IDM) can.  From the Grabber Wizard. Creating a project section of Help:
Step 1. Set a start page
 
On the first step of the wizard you should specify the start page. By default, http protocol is assumed; other protocols like https are required to be specified explicitly. The start page also sets the current site. For example if you specified http://www.tonec.com/support/index.html, the current site would be www.tonec.com with all supported protocols like ftp, https, http applied to this site name.
 
If a site requires authorization, you should also set login and password on this step. Some sites allow browsing/downloading only after authentication on a certain page. In this case you should press on "Advanced>>" button, check "Enter login and password manually" box, and specify the page to login to the site. Also if the site has a logout button, you should specify here the logout pages that the Grabber should not open. If you set the login page, the Grabber will open a browser window after the fourth step and let you login to the site manually before proceeding with exploring and downloading.
[...]
If you need to download all pictures, video or audio files from a website, or download a complete web site, you may select the appropriate template in Project template listbox. Project templates make it easy to start your projects quickly, because all required settings are made automatically.

Sanity check - I run IDM all the time, but haven't tried that particular feature.

Carol Haynes

  • Waffles for England (patent pending)
  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 8,066
    • View Profile
    • Donate to Member
Re: Any ideas for a save website crawler for offline reading?
« Reply #2 on: April 02, 2012, 06:07 PM »
Nice idea but unfortunately it crashes every time I run it after about 14 pages :-(

Renegade

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 13,288
  • Tell me something you don't know...
    • View Profile
    • Renegade Minds
    • Donate to Member
Re: Any ideas for a save website crawler for offline reading?
« Reply #3 on: April 02, 2012, 07:52 PM »
Try Teleport Pro.
Slow Down Music - Where I commit thought crimes...

Freedom is the right to be wrong, not the right to do wrong. - John Diefenbaker

Carol Haynes

  • Waffles for England (patent pending)
  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 8,066
    • View Profile
    • Donate to Member
Re: Any ideas for a save website crawler for offline reading?
« Reply #4 on: April 03, 2012, 09:08 AM »
Try Teleport Pro.

Nope doesn't have the ability to go past a login page - it can use username and password but only for the very limited number of websites that all user name and password in the URL. It's also very basic and expensive unless you want to spend a ridiculous amount on the non 'Pro' versions. HTtrack is free (O/S) and is more fully featured than the PRO version.

IDM does what I need - just a shame it crashes every time I try to actually use it!

mwb1100

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,645
    • View Profile
    • Donate to Member
Re: Any ideas for a save website crawler for offline reading?
« Reply #5 on: April 03, 2012, 12:42 PM »
Maybe WebSite-Watcher will work? 

   - http://aignes.net/features.htm


Carol Haynes

  • Waffles for England (patent pending)
  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 8,066
    • View Profile
    • Donate to Member
Re: Any ideas for a save website crawler for offline reading?
« Reply #6 on: April 03, 2012, 12:58 PM »
I used to have a copy of WebSite Watcher - didn't know it could archive whole sites? I'll have another look.

Update: WSW uses an addon (Local Website Archive) to do this but from reading the description even the 'Pro' versionly seems to be able to archive individual pages. OK the Pro version lets you queue them for download but I need to collect hundreds of pages and retain the links for offline viewing. I don't see any way that Local Website Archive is set up to do that.

I could actually do that using OneNote.

mwb1100

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,645
    • View Profile
    • Donate to Member
Re: Any ideas for a save website crawler for offline reading?
« Reply #7 on: April 03, 2012, 01:27 PM »
Have you tried these suggestions from the HTTrack FAQ:

Q: I can not access several pages (access forbidden, or redirect to another location), but I can with my browser, what's going on?
A: You may need cookies! Cookies are specific data (for example, your username or password) that are sent to your browser once you have logged in certain sites so that you only have to log-in once. For example, after having entered your username in a website, you can view pages and articles, and the next time you will go to this site, you will not have to re-enter your username/password.
To "merge" your personnal cookies to an HTTrack project, just copy the cookies.txt file from your Netscape folder (or the cookies located into the Temporary Internet Files folder for IE) into your project folder (or even the HTTrack folder)


Q: Can HTTrack perform form-based authentication?
A: Yes. See the URL capture abilities (--catchurl for command-line release, or in the WinHTTrack interface)


Q: Can I use username/password authentication on a site?
A: Yes. Use user:password@your_url (example: http://foo:[email protected]/private/mybox.html)

kfitting

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 593
    • View Profile
    • Donate to Member
Re: Any ideas for a save website crawler for offline reading?
« Reply #8 on: April 03, 2012, 02:23 PM »
Local Website Archive is only for archiving single pages at a time.  I asked the author to handle multiple pages a year or two ago and he responded that he has no plans for it.

HTTrack... I am potentially an incompetent user, but it takes me forever to setup my crawl depth correctly.  I dont find it intuitive at all.  Not saying it isn't powerful, it certainly is!  But I just have not taken the time I require to figure it out sufficiently.

Carol Haynes

  • Waffles for England (patent pending)
  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 8,066
    • View Profile
    • Donate to Member
Re: Any ideas for a save website crawler for offline reading?
« Reply #9 on: April 03, 2012, 02:36 PM »
I tried HTtrack - it was the first one I tried but I cna't get it to work on webform password protected sites. I don't think th esite uses a session cookie I think it is doing something with Javascript, but I am not sure.

rjbull

  • Charter Member
  • Joined in 2005
  • ***
  • default avatar
  • Posts: 3,199
    • View Profile
    • Donate to Member
Re: Any ideas for a save website crawler for offline reading?
« Reply #10 on: April 03, 2012, 04:23 PM »
Obvious suggestion, contact Tonec about IDM crashing.  There are lots of other download managers, at least some of which have Web crawling ability.  I'm pretty sure I've done it with Free Download Manager (FDM) but don't have it installed here and can't work out from the Web site whether it can do logins...  can't work it out from ReGet Deluxe either  :(

Maybe WebReaper (donationware)?  It does mention
Proxy & website authentication, allowing websites with passwords or behind firewalls to be reaped.

cyberdiva

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,041
    • View Profile
    • Donate to Member
Re: Any ideas for a save website crawler for offline reading?
« Reply #11 on: April 04, 2012, 08:05 AM »
I used to use a program called Web Research that allowed me to save all or part of a web page.  One of its features is it permits you to save some or all pages linked to the page you want to save.  (I didn't use that feature, since I was always interested in saving just all or part of a single page.)  Your needs may be more complex, but I thought I'd mention it.  Web Research has both a "Personal" and a "Professional" version--the former is quite modestly priced.  I don't know whether the two have similar features--I own the Professional version.

kfitting

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 593
    • View Profile
    • Donate to Member
Re: Any ideas for a save website crawler for offline reading?
« Reply #12 on: April 04, 2012, 10:53 AM »
Web Research seems very interesting... how does it store the websites?  Local Website Archive stores the html so you dont need LWA to view the articles you've downloaded.  Does Web Research use a proprietary format?

katykaty

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 224
    • View Profile
    • Donate to Member
Re: Any ideas for a save website crawler for offline reading?
« Reply #13 on: April 04, 2012, 02:50 PM »
What does the course tutor say? They may be prepared to share the source documents.

Carol Haynes

  • Waffles for England (patent pending)
  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 8,066
    • View Profile
    • Donate to Member
Re: Any ideas for a save website crawler for offline reading?
« Reply #14 on: April 04, 2012, 04:39 PM »
It is a distance learning course and all the teaching/assessment materials are available via the university website (it has its own page).

They have provided the course to download in the form of PDF files - but they only go so far - they are basically PDF tarted up web prints but the pages have lots of links to examples and extra asides and comments that aren't in the PDFs (the links are there but the actually pop up content isn't). That is why I want an offline record of the site.

I think the course team would simply say use the PDFs.

NigelH

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 210
    • View Profile
    • Donate to Member
Re: Any ideas for a save website crawler for offline reading?
« Reply #15 on: April 04, 2012, 07:13 PM »
Can you access any of the material using Firefox or Chrome?
Zotero has pretty good capabilities for archiving web page content.
An IE connector for the standalone version is apparently in development

Store anything.

Zotero collects all your research in a single, searchable interface. You can add PDFs, images, audio and video files, snapshots of web pages, and really anything else. Zotero automatically indexes the full-text content of your library, enabling you to find exactly what you're looking for with just a few keystrokes.

cyberdiva

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,041
    • View Profile
    • Donate to Member
Re: Any ideas for a save website crawler for offline reading?
« Reply #16 on: April 04, 2012, 07:28 PM »
Web Research seems very interesting... how does it store the websites?  Local Website Archive stores the html so you dont need LWA to view the articles you've downloaded.  Does Web Research use a proprietary format?
To be honest, it has been a year or two since I last used Web Research, and I do not have it on the computer I bought this year.  I was never all that concerned about whether it used a proprietary format, since I was interested only in retrieving and consulting information I had saved, not exporting it.  I just went onto the Web Research website, and here's what it says about exporting documents:

Web Research offers various methods to export documents and folders:

    Export as files in original format
    Export as "Single File Web Page" (mht format)
    Export as album (chm format)
    Export as a Document Package
    Create a Web Page Presentation
    Copy the Web Research address of a document
    Print documents
    Transfer to Microsoft Word
    Programming interface (API)
    External linking via Web Research protocol handler

Perhaps I'm not understanding the above correctly, but a couple of the items seem to suggest that the material is not saved in a proprietary format.  I think you'd be best off writing to the company to find out for sure.

Carol Haynes

  • Waffles for England (patent pending)
  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 8,066
    • View Profile
    • Donate to Member
Re: Any ideas for a save website crawler for offline reading?
« Reply #17 on: April 04, 2012, 07:53 PM »
Can you access any of the material using Firefox or Chrome?
Zotero has pretty good capabilities for archiving web page content.
An IE connector for the standalone version is apparently in development

Store anything.

Zotero collects all your research in a single, searchable interface. You can add PDFs, images, audio and video files, snapshots of web pages, and really anything else. Zotero automatically indexes the full-text content of your library, enabling you to find exactly what you're looking for with just a few keystrokes.

I can access the site using Firefox and Chrome but it is all a bit screwed up and doesn't work properly. If I am not using IE the site actually pops up and says it only works with IE and there are issues with any other browser.

I'm not sure but I think part of the problem is that the page types are not .html - they are .chm and it uses ColdFusion.

I have tried loads of downloaders/web spiders/archivers etc. now and none seem to be able to get past the login page. Really frustrating.

Just upgraded Surfulater as I used to use that but even that if I use the browser extension to save the page just saves a link to the pacge and that only goes to the login - it can't save the contents!

To add to the problem the website is frame based and quite a few of the links open things in a number of frames. None of the downloaders I have tried seem to like frames too much!

cyberdiva

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,041
    • View Profile
    • Donate to Member
Re: Any ideas for a save website crawler for offline reading?
« Reply #18 on: April 04, 2012, 10:46 PM »
Just upgraded Surfulater as I used to use that but even that if I use the browser extension to save the page just saves a link to the pacge and that only goes to the login - it can't save the contents!
-Carol Haynes (April 04, 2012, 07:53 PM)
I think that Web Research will save the content of linked pages, though I don't know whether it plays nicely with frames.  I'm pretty sure you can download it for a free trial period.