Author Topic: Reliable web page capture... (Read 88447 times)

johnk · « **Reply #25 on:** July 16, 2008, 12:58 PM »

How about mirroring the entire site and then picking out the page/pages you want?

"HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility...
-Cuffy (July 16, 2008, 11:20 AM)

If you're doing research on the web and darting around from site to site, mirroring entire sites isn't really practical. Take the sample page I used for the test in the first post -- I can't begin to imagine how many thousands of pages there are on the BBC News site.

No, for web research, grabbing single pages as you go is really the only efficient way to work.

Since doing the test, I've come at it from another angle -- I've been having a play with the web research "specialists", Local Website Archive (LWA) and WebResearch Pro (WRP) to see if they could compete with Ultra Recall and Surfulater as "data dumps".

WRP is the more fully-featured, and has the nicer, glossier UI. But it seems very low on the radar -- you see very little discussion about it on the web. And they don't have a user forum on their site, which always gets a black mark from me. One of the big features in the latest version is that its database format is supported by Windows Search, but I had problems getting the two to work reliably together. And my last two emails to their support department went unanswered...

LWA is a more spartan affair, but it does what it says on the tin. And with a bit of tweaking, you can add documents from a variety of sources including plain text and html, and edit them which makes it a possible all-round "data dump" for research purposes, on a small scale.

Of course LWA was never intended for such a purpose. It's a web page archive, pure and simple, and a good one. You can add notes to entries. The database is simply separate html and related files for each item. Files and content are not indexed for search. LWA doesn't have tags/categories (WRP does) so you're stuck with folders. And as LWA keeps file information (metadata) and file content in different places, it's problematic for desktop search programs to analyze it unless they know about the structure (I've already asked the Archivarius team if they'll support the LWA database structure).

LWA may be a keeper though...

cmpm · « **Reply #26 on:** July 16, 2008, 03:04 PM »

I understand John.

Just posting for those interested in the other types of programs.

I'm glad you were specific as to what you want in 'clickable links' and the rest.
Of that I wasn't sure.
Not being familiar with the programs listed in your first post.
I just posted what I had on what I know of.

tomos · « **Reply #27 on:** July 16, 2008, 03:57 PM »

LWA may be a keeper though...
-johnk (July 16, 2008, 12:58 PM)

I'm not actually familiar with Ultra Recall but I'm wondering - seeing as it now has good web capture, is it simply a case of you looking for the best out there, or is there something in particular missing in UR?

Tying in with that question, I'm also curious what you mean by "information management" as said in first post

cmpm · « **Reply #28 on:** July 16, 2008, 04:13 PM »

You can save a complete webpage with the file menu in Firefox.
Save as-webpage.
I'm not sure you will get all you want.Plus it needs a internet connection.
(perhaps that is the problem)
But if you save it as a webpage the links are clickable.
Of course it will open in Firefox obviously.
But you could put the file in any folder.
Which may be the hindrance, being limited to saving in folders.

Just trying to understand.

No reply needed if irrelevant.

tomos · « **Reply #29 on:** July 16, 2008, 04:27 PM »

You can save a complete webpage with the file menu in Firefox.
Save as-webpage.
-cmpm (July 16, 2008, 04:13 PM)

me, I'm only really familiar with Surfulater & Evernote:
the advantage of these programmes (I think) is that you can save part (or all) of a web page, you also have a link to the page, you can then organise it, maybe tag it, add a comment, maybe link it to another page, or link a file to it, etc, etc.
I was curious myself what John was after in that sense (apart from the basic web-capture)

Reliable web page capture...

Edit/ added attachment

johnk · « **Reply #30 on:** July 16, 2008, 05:39 PM »

I'm not actually familiar with Ultra Recall but I'm wondering - seeing as it now has good web capture, is it simply a case of you looking for the best out there, or is there something in particular missing in UR?

Tying in with that question, I'm also curious what you mean by "information management" as said in first post
-tomos (July 16, 2008, 03:57 PM)

Good questions -- wish I could give a clear answer! I don't think my needs are very complicated. You're right -- now that Ultra Recall has sorted out web capture, it's a very strong contender. The only question mark over UR is speed, which I'd define here as "snappiness" (is that a word?). I used UR for quite a while in versions 1 and 2, and it always had niggling delays in data capture. Nothing horrendous, but saving web pages was a good example -- it would always take a few seconds longer than any other program to save a page.

I haven't used v3.5a long enough to make a decision, and I still have an open mind. But I have noticed, for example, that when you open some stored (archived) pages, loading them takes quite a few seconds. A little dialog pops up saying "please wait -- creating temporary item file". You have plenty of time to read it. Scrapbook or LWA load stored pages pretty much instantly (as they should).

I use information management as a slightly more elegant way of saying "data dump". Somewhere I can stick short, medium and long-term data, text and images, everything from project research to software registration data. I want that data indexed and tagged. I want the database to be scalable. Not industrial strength, but I want it to hold a normal person's data, work and personal, over several years without choking.

The more I search, the more I think that looking for one piece of software to do everything is silly, and maybe even counter-productive. When I think about the pieces of software I most enjoy using, they tend to do one simple task well. AM-Notebook as my note-taker, for example. Not flawless, but a nice, small focused program (and interestingly, by the same person/team as LWA).

Slightly off the beaten track, but may be of interest to some following this thread: one program that has been a "slow burn" for me in this area is Connected Text, a "personal wiki". That phrase alone will put some people off, and I know wiki-style editing is not for everyone. But it's a well-thought out piece of software. I've used it for some research on long-term writing projects, and it's been reliable, A good developer who fixes bugs quickly, and good forums.

Shades · « **Reply #31 on:** July 16, 2008, 07:21 PM »

As far as I understood, the Zotero plugin stores anything it has downloaded for showing into the browser when a snapshot is made. When I looked for some specific page for a doctor here in Paraguay it didn't take too much time to collect all necessary files and put them on his laptop so his browser showed exactly the same data as mine.

No, Internet is definitely not everywhere available in this country...(lack of phone lines and cellular antennas) and that is mostly the terrain where this doctor has to use his specific skill (reconstructing bones so people are able to walk and/or use their hands again). This is an bad side effect that occurs when people that are too related have babies, but you have here small communities like that.

cmpm · « **Reply #32 on:** July 16, 2008, 08:20 PM »

Yes I can see the intent of being able to to do what John wants with what you posted shades.

In fact one could load a ton of info on a hard drive and mail it, and the receiver would have quite a bit of info ready to go.

J-Mac · « **Reply #33 on:** July 17, 2008, 02:23 AM »

How about mirroring the entire site and then picking out the page/pages you want?
-Cuffy (July 16, 2008, 11:20 AM)

For me, that would be grabbing a whole lot extra that I don't want nor need just to get the one page that I do!

Thanks!

Jim

rjbull · « **Reply #34 on:** July 17, 2008, 03:24 AM »

me, I'm only really familiar with Surfulater & Evernote:
the advantage of these programmes (I think) is that you can save part (or all) of a web page
-tomos (July 16, 2008, 04:27 PM)

If I click the EverNote icon in Firefox, it pops up this message:

No text is selected. Do you want to add
an entire page to EverNote?

EverNote seems surprised that I might want to capture a complete page. Sometimes I do, of course, and then I generally use LWA. Yet I think EverNote's implication is sensible. Do I really want to keep the fluff as well as real content? No, of course I don't. In fact I mostly use EverNote at work, for capturing news items on work-related portals. I only want the particular news article, not all the advertising or other uninteresting (to me) items. Which makes me wonder, how many other people need compete capture all the time?

Another nice thing about EverNote is that it can output MHT files, so if I have to, I can send potted articles to other people, complete with images and clickable links. I wish there were a universal standard for "compiled HTML" that Firefox and other browsers used, not just IE.

tomos · « **Reply #35 on:** July 17, 2008, 03:53 AM »

I use information management as a slightly more elegant way of saying "data dump". Somewhere I can stick short, medium and long-term data, text and images, everything from project research to software registration data. I want that data indexed and tagged. I want the database to be scalable. Not industrial strength, but I want it to hold a normal person's data, work and personal, over several years without choking.

The more I search, the more I think that looking for one piece of software to do everything is silly, and maybe even counter-productive. When I think about the pieces of software I most enjoy using, they tend to do one simple task well. AM-Notebook as my note-taker, for example. Not flawless, but a nice, small focused program (and interestingly, by the same person/team as LWA).
-johnk (July 16, 2008, 05:39 PM)

I always used Surfulater for information management/"data dump"
Evernote I use for notes & short term web research (e.g. researching monitors)
I'm now using SQLNotes for information & the other two are by the wayside but still with loads of stuff in there

I think I have to take a long look at what I want to do myself & how/if I want to continue using all these programs
SQLNotes is sticking anyways - it okay at web capture but nothing like what you want but then it is in beta, especially in that respect

BTW, I agree with all your points. When you enter content, it should be this (simple) way.

Some of the complexities (which I'll resolve) is that the HTML pane can be used in other ways. For example, you can open an HTML file from disk. Then any changes to the content updates the disk file (EN does not have this feature). You work on what looks like SN [SQLNotes] content, but it is really a local file (and eventually an FTP or other web file). You can also open a URL and view it, in this case, editing is disabled. HTM, MHT (and PDF) files are handled differently too, etc. It has many modes and managing all of these... well... needs a bit of improvements
-PPLandry (July 10, 2008, 06:50 PM)

With all three you can export - I havent use Evernote that way but as rj says it can export MHT files
Surfulater will mail selected articles for you (html) and exports html and MHT
SQLNotes currently exports to html

J-Mac · « **Reply #36 on:** July 17, 2008, 10:48 PM »

To be honest, many times a pure and simple screenshot is all that I need. It is only occasionally that I need a true and complete capture of all aspects on the web page. For a full page screen capture it depends on the page itself as to which application I use.

For web pages that can be captured with one screenshot, I always use mouser's Screenshot Captor - you really can't beat that! However if the page is longer than one screenshot, and must be scrolled, I then use SnagIt. (For some odd reason, I cannot capture a scrolling page with Screenshot Captor - when I try, during the capture the window goes blank and gets very light/bright. Browser becomes unreponsive, requiring me to end the process for Screenshot Captor via the Windows Task Manager. About half the times I tried I also had to restart the browser, and on a few occasions I had to actually reboot! I suspect it may be an incompatibility between Screenshot Captor and nVidia graphics cards - and possibly AMD dual core processors).

When I need all objects on a web page, I use Local Website Archive. More recently I have been trying to use Ultra Recall, but even with that latest fix I cannot capture most secure pages at site where I am logged in. Rather than just grabbing it UR tries to refresh the page (never works, darn it!).

Jim

rjbull · « **Reply #37 on:** July 18, 2008, 09:53 AM »

When I need all objects on a web page, I use Local Website Archive. More recently I have been trying to use Ultra Recall, but even with that latest fix I cannot capture most secure pages at site where I am logged in. Rather than just grabbing it UR tries to refresh the page (never works, darn it!).
-J-Mac (July 17, 2008, 10:48 PM)

That happens to me when I try it on shareware registration sites and the like. I assume it's because you have to be securely logged in with the current browser, and the site doesn't recognise UR as being that. You might try using LWA with the "Send keystrokes" method, where it forces the browser to save a copy of the file to disk, then reads that, rather than trying to go directly to the original page.

Interesting note: Roboform recognises WebSite-Watcher as a mini-browser and attaches a Roboform taskbar when a WSW window appears. WSW has an option to directly archive files to LWA - at least, I think it does - so you could log in with WSW and Roboform, then use WSW to transfer the page to LWA. It doesn't look like Roboform sees LWA as a browser in itself, even though they're both from Martin Aignesberger, but I haven't checked thoroughly.

J-Mac · « **Reply #38 on:** July 18, 2008, 02:18 PM »

That happens to me when I try it on shareware registration sites and the like. I assume it's because you have to be securely logged in with the current browser, and the site doesn't recognise UR as being that. You might try using LWA with the "Send keystrokes" method, where it forces the browser to save a copy of the file to disk, then reads that, rather than trying to go directly to the original page.
-rjbull (July 18, 2008, 09:53 AM)

I used to do that, but when I reinstalled Windows on this computer I lost the ability. You have to create an .ini file in order to allow that, and the last I had checked Martin had not done anything with that for FF3.

One thing about all of Martin's applications - he doesn't seem to like adding any niceties at all. Most tasks have to be done the hard way or the long way. One example is just this - having to create .ini files for sending keystrokes. Also, if you try to select a folder in LWA that you want your capture to be stored in, if the one you would like to use doesn't exist, there is no standard "New Folder" button. You have to stop the capture and then open the main window of LWA, create the new folder and name it, and only then go and do the capture again. A lot of little touches like that are missing and he usually isn't real keen on adding them.

Which is one of the reasons I am looking for other ways to get this done.

Interesting note: Roboform recognises WebSite-Watcher as a mini-browser and attaches a Roboform taskbar when a WSW window appears. WSW has an option to directly archive files to LWA - at least, I think it does - so you could log in with WSW and Roboform, then use WSW to transfer the page to LWA. It doesn't look like Roboform sees LWA as a browser in itself, even though they're both from Martin Aignesberger, but I haven't checked thoroughly.
-rjbull (July 18, 2008, 09:53 AM)

Ultra Recall is the same - listed on RF's browser page and the toolbar is there in UR. Doesn't seem to help, though, regarding these capture issues.

Thanks!

Jim

johnk · « **Reply #39 on:** July 18, 2008, 06:17 PM »

One thing about all of Martin's applications - he doesn't seem to like adding any niceties at all. Most tasks have to be done the hard way or the long way.
-J-Mac (July 18, 2008, 02:18 PM)

I know what you mean -- I was quite amazed when I started using AM-Notebook that there were no shortcut keys either to start a new note or to restore the program from the system tray -- two of the most basic and most used functions (and this was version 4!). I had to use AutoHotkey to create the shortcuts (thank goodness for AHK). To be fair to Martin, he did add a global restore hotkey when the issue was raised in his forums.

There are two sides to this, though. On one level, I actually like the .ini file approach to capturing information in LWA. It means that you can generate semi-automated capture from all kinds of programs. In the last couple of days I've created ini files for Word and Thunderbird, and they work fine. At least "the hard way" is better than "no way".

J-Mac · « **Reply #40 on:** July 18, 2008, 11:13 PM »

John, if I knew how to properly cerate the ini files, I would. But Martin doesn't have any instructions for this on his web site. I guess he designs purely for programmer-types.

BTW, AM-Notebook, which I have owned since, I think, V.2, still requires you to name the note before you write it. I can't handle that!

Thanks!

Jim

johnk · « **Reply #41 on:** July 19, 2008, 07:08 AM »

John, if I knew how to properly cerate the ini files, I would. But Martin doesn't have any instructions for this on his web site. I guess he designs purely for programmer-types.
-J-Mac (July 18, 2008, 11:13 PM)

Jim -- I can assure you, I'm no programmer. But I know my way around a computer by now and I'm familiar with writing keyboard macros (which is the most difficult bit in creating LWA ini files). The ini files are not too difficult to put together. If you'd like some help, I'm happy to do it by PM. But I agree, Martin should at least offer a wizard to guide people through setting up an ini file. The section on ini files in the LWA help file is, well, not very helpful.

The ini files are actually LWA's trump card. While LWA's direct rival, WebResearch Pro, is much more powerful and advanced in many ways, it doesn't have an equivalent of LWA's ini files, so you can't create your own import filter. So, for example, WebResearch doesn't support Thunderbird natively, so you have to export to eml, blah, blah. Swings and roundabouts...

nevf · « **Reply #42 on:** August 19, 2008, 07:24 PM »

I have done extensive updates to the code that captures complete Web pages in Surfulater. The BBC News page for example, now captures without the problems shown in this thread. You can see the results in my blog post. Better Web Page Capture coming in Surfulater Version 3

Surfulater Version 3 is a major upgrade with many important new features. See our Blog for further information. V3 is planned for release in Sept 2008. Pre-release versions with the new Tagging capability are available for download on the blog.

cmpm · « **Reply #43 on:** October 18, 2008, 10:01 AM »

http://pagenest.com/index.html

a free complete web page capture utility
with many options
from freedownloadaday.com

might fit some criteria here

kartal · « **Reply #44 on:** November 01, 2008, 04:34 PM »

I am looking for a scrapbook solution that is designed mainly for images and multimedia, has nicer import export functions, that can work with FF. It should support drag and drop fully(both in an out) I am not talking about taking picture snapshots of the pages. This app should mainly focus on images and multimedia that is embedded or repesented in the pages. But image screenshot could be nice addition as well.

I love Scrapbook but I hate the way it exports file and you have really no control over the content of the export.

J-Mac · « **Reply #45 on:** November 01, 2008, 09:49 PM »

I am looking for a scrapbook solution that is designed mainly for images and multimedia, has nicer import export functions, that can work with FF. It should support drag and drop fully(both in an out) I am not talking about taking picture snapshots of the pages. This app should mainly focus on images and multimedia that is embedded or repesented in the pages. But image screenshot could be nice addition as well.

I love Scrapbook but I hate the way it exports file and you have really no control over the content of the export.

-kartal (November 01, 2008, 04:34 PM)

Are you looking for just the images? Do you even need to have a rendering of the web page itself?

For images alone, SnagIt has a built-in profile that pulls all images from a web page, but the latest version, 9, is terrible. If you decide to use it, try to get version 8.2 instead.

Jim

kartal · « **Reply #46 on:** November 01, 2008, 10:13 PM »

I am mainly interested in images. But what I really really want is that when the app takes images from the site, it puts all the info like web site, time of capture, possible hyperlinks etc into iptc or exif of the images(assuming that mainly jpeg images are captured). This would be great because I d onto want to keep seperate html for images. And pdf cannot be an option because i mainly browse images via thumbnails, pdf capture of the pages would be overkill.

J-Mac · « **Reply #47 on:** November 01, 2008, 11:50 PM »

I don’t know...

I generally use the Scrapbook extension in Firefox. I can save the images individually from the captured web page, and I can also use SnagIt to grab all the images from the Scrapbook capture of a web page.

Come to think of it, if I wanted to do what I believe you are looking to do, I'd just grab the page with Scrapbook and then perform a SnagIt capture using the "Images from Web page" profile. That's all I can think of right now, mainly because that works for me.

I understand that's possibly not what you are seeking though.

Jim

cmpm · « **Reply #48 on:** November 02, 2008, 08:44 AM »

a couple of webpage rippers
for portions of the web page

https://addons.mozil.../search/?q=Clipmarks

http://file2hd.com/

Darwin · « **Reply #49 on:** November 02, 2008, 09:24 AM »

I've used Evernote 2, NetSnippets, and now Ask Sam 7. All of them do a good job. Ask Sam is the best of the bunch WRT retaining the exact formatting of the page being saved. Of course, it's VERY expensive (I got it when it was on sale)...