ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

How to make a local copy of an ancient Web forum?

<< < (4/6) > >>

MrCrispy:
Some speculation without having used the ripping software or knowing how the forum works -

1. if the various ways to get to the same post (or thread) actually resolve to the same url, then the ripper should only download it one. Do they do this?

2. In some forums, expand/collapse/quick reply etc are Ajax/Javascript actions and not a page load. Can these be ignored at the page level?

3. If #1 is not true, then I guess some form of content analysis (where the ripper would detect the page has been downloaded previously as it has the same html) could be used to detect duplicates. I'm sure no one does this since it won't be reliable and even if it was, would be slow

4. Can you tell the ripper to exclude certain links, like those matching "expand/collapse" etc?

f0dder:
Compared to some behemoths, DonationCoder is almost like a grain in the sand. Just for reference, the IGN boards have exactly 189,401,031 messages in this exact moment. And they have a couple of users with more than 100,000 posts, so...
-Lashiec (November 26, 2008, 09:04 AM)
--- End quote ---
That's insane :-s

David.P:
Wow, I can't believe that you simply could do that! Thanks, I'll look into this database and see what I can excerpt from it.

Otherwise, what I have found out is that regarding "spidering intelligence", the software Blackwidow is one of the best, since you can specify

a) which pages are scanned for your wanted links only; and
b) which pages are actually downloaded.

This way, Blackwidow actually manages to crawl from posting to posting in that forum, downloading only the posts and ignoring EVERYTHING else on the website. It also manages not to download every posting multiple times.

Blackwidow however seems to have a hard time to rewrite the html code such that it actually becomes browsable offline  :(

The latter however is what the program WinHttrack does marvelously. WinHttrack unfortunately has the drawback that it doesn't have such sophisticated filter settings as Blackwidow. With WinHttrack, you can't differentiate between pages that are only scanned for links and pages that are downloaded. Therefore, WinHttrack (that does a beautiful job in converting the pages for offline browsing) ends up downloading much more than what you actually are after.

Thanks everyone,
David.P

PS: Filter settings in Blackwidow:

agentsteal:
It wasn't simple.
Why did you want a copy of this forum?

--Edit:
A moderator deleted my post  :mad:

We had a number of comments that this content is not appropriate for donationcoder.com.

It is not ethical to hack into websites and it is not something we condone or want to encourage.

I don't know where you are based but in some countries admitting what you did could land you in a court case and possibly prison. (Certainly the UK and US are getting much tougher on people hacking). Just because a forum appears to be abandoned wouldn't give you any protection.
--- End quote ---

The forum belongs to a company that went bankrupt in 2001. I wasn't doing anything malicious I just made a copy of the forum database.

Carol Haynes:
Sorry I didn't mean to upset anyone - even if the company went bust in 2001 some one owns the domain name and must be paying to host the site. I can't see how hacking into the backend of a website can really be condoned unless it is your site and you have lost the login credentials.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version