topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Thursday March 28, 2024, 5:40 pm
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: Reocities: the GeoCities one-man rescue project  (Read 350629 times)

40hz

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 11,857
    • View Profile
    • Donate to Member
Reocities: the GeoCities one-man rescue project
« on: October 29, 2009, 03:22 PM »
As most of you know, GeoCities is now a part of Internet history.

Apparently Yahoo! was also intent on having it fade into obscurity despite numerous outside offers to assist in backing it up for the historic record.

What follows is the tale of one guy who decided to go a little further than just offer.

From our friends over at Download Squad comes this story:

Link: http://www.downloads...e-but-not-forgotten/

Reocities: because Geocities is gone, but not forgotten
by Jay Hathaway (RSS feed) Oct 29th 2009 at 1:00PM


When Yahoo! decided to close down GeoCities, a lot of us shed a single tear for our first home on the Internet and moved on. For one man called Jacques, though, that wasn't good enough.

He took it upon himself to save as much of GeoCities as possible, by writing scripts that pinged the site to findAnd aactive pages, and then downloaded them to his personal storage space. The one-man project, called Reocities, rescued an estimated 600,000 GeoCities sites before the big shutdown.

The above article has links to the Retrocities homepage ( http://reocities.com):

Welcome To ReoCities...

Here lies what we could salvage from the ashes of GeoCities.

Yahoo! has done an amazing thing by keeping GeoCities alive for as long as they did, but we feel that it is a waste to leave the Internet with a hole of this magnitude. At a minimum, Yahoo! could have simply left GeoCities as a monument to the early days.
Maybe close it off from editing and simply make it static after getting rid of the spam pages once and for all.
Behind this minimalistic page stretches a wealth of Internet history. If any of it was yours and we have successfully recovered it, then we hope it makes you happy to see it restored.

We've rebuilt the walls to the Cities and the streets where a large part of the early settlers of the World Wide Web used to live in. You can still find them where they were before, but not all of the houses have been rebuilt yet.

As time passes, we will try to recover more and more of what was lost, at least as much as is technically possible. If you wish to help with this effort, and you have your old GeoCities content backed up, then please email us at [email protected], but *not* before we've stopped importing the data that we have right now.

- and the link to a "making of" page which gives some insight into what's involved in snagging a copy of something the size of GeoCities when the clock is running out:

Link: http://www.reocities...ewhome/makingof.html

#
Size

GeoCities is large. Very, very large. Not when compared to, say, the likes of MySpace or Facebook. But compared to your average garden variety website, it is huge. Given that, when GeoCities first launched in 1994, the average hard drive was somewhere around 500 MB, to store multiple hundreds of gigabytes must have been a complicated technological feat to achieve.

RAID was already around, but those 'inexpensive' disks were, for the most part, not that inexpensive. Storage technology was several orders of magnitude slower and had a smaller capacity than today. In spite of all that, you can't just go and make a copy like you could do with any other set of page. Yesterday's giants are still pretty big.

#
Number Of Files

GeoCities comprises hundreds of millions of files in all kinds of formats, and the most important part of the link structure, the .html and .htm files, were made in an age when FrontPage was considered hot stuff.

To avoid overflowing the directory structure on the machines that GeoCities was using, they opted for a tree based format. This meant that any one of the Cities was subdivided into Neighborhoods, and each one had 10 000 accounts, maximum.

How's this for a website backup toolkit?

The ingredients:

    
  • 1 iconic website about to be erased
  •      21 pots of strong tea
  •      more sugar than is probably healthy
  •      very little sleep
  •      some computing gear
  •      one solid Internet connection
  •      6 days in October 2009
  •      Some very good help (Thanks Abi!)



And if you think this project was nothing more than a raw download job, check this out:

21:00 PM, Friday, 23 October 2009 - The Secret Weapon

At this point in time there are only 44 hours to go until it is permanently curtains for GeoCities. We're talking Friday to Saturday night, and I realize that if I don't do something drastic, then this effort is going to fail.

So, enter the secret weapon. A couple of years ago I wrote a small (about 1 billion pages) search engine. For that purpose I bought a cluster of 5 machines, which have since been upgraded with 4 TB storage each, and already had a fairly beefy CPU. They're also connected to the net with some good uplinks and have a 1Gb/s connection between them to a dedicated switch. Time to get those guys involved.

Now that we know the structure of GeoCities, it is possible to farm out the fetching of pieces to each of the cluster nodes. A small program figures out who is busy with what, and each cluster can concentrate on one of the 721 shelves, and the 10 000 possible accounts on that shelf. In the past 4 days some of those shelves have already seen extensive coverage, so we mark those as done, leaving about half to be processed still. After a few more hours to get this all set up the cluster was humming along at 150 Mb/s inbound. That's a CD every 30 seconds or so!

Did he say some computing gear? A 5-machine cluster and a self-written search engine? He calls that some computing gear? Talk about a certified propeller-head!


Very cool stuff! 8)

----
P.S. Congratulations Jacques - whoever you are. Because of your efforts to salvage a piece of Internet history, you've made a place in its history books for yourself. Not too shabby for a one week project, hey?  ;D:Thmbsup:





« Last Edit: October 29, 2009, 03:31 PM by 40hz »

superboyac

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 6,347
    • View Profile
    • Donate to Member
Re: Reocities: the GeoCities one-man rescue project
« Reply #1 on: October 29, 2009, 04:54 PM »
Ah...Geocities.  I remember it.  I think I had a website on it.  The only thing I had on the site was that stupid animated gif of the hand coming out and grabbing something.  I thought that was so cool at the time.

[edit]
HAHA.  I found it...so stupid!
hands.gif
« Last Edit: October 29, 2009, 04:56 PM by superboyac »

f0dder

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 9,153
  • [Well, THAT escalated quickly!]
    • View Profile
    • f0dder's place
    • Read more about this member.
    • Donate to Member
Re: Reocities: the GeoCities one-man rescue project
« Reply #2 on: October 29, 2009, 06:18 PM »
Oh, GeoCities... last update time on my page there is ~12 years ago. The memories :)
- carpe noctem

JavaJones

  • Review 2.0 Designer
  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 2,739
    • View Profile
    • Donate to Member
Re: Reocities: the GeoCities one-man rescue project
« Reply #3 on: October 29, 2009, 07:55 PM »
What about archive.org? Was Geocities as a rule not spidered? Note: I didn't read any of the articles, so my apologies if it's explained within. ;)

- Oshyan

app103

  • That scary taskbar girl
  • Global Moderator
  • Joined in 2006
  • *****
  • Posts: 5,884
    • View Profile
    • Donate to Member
Re: Reocities: the GeoCities one-man rescue project
« Reply #4 on: October 29, 2009, 07:58 PM »
Archive Team and textfiles.com managed to save the rest of it and cramned it all into a single page. It sure is a sight to behold.  :D

40hz

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 11,857
    • View Profile
    • Donate to Member
Re: Reocities: the GeoCities one-man rescue project
« Reply #5 on: October 29, 2009, 08:37 PM »
What about archive.org? Was Geocities as a rule not spidered? Note: I didn't read any of the articles, so my apologies if it's explained within. ;)

- Oshyan

Hi Oshyan!

Yes, you are correct. Retrocities was not the only person or group attempting to archive GeoCities. Several other groups were also independently pursuing the same goal. But I never meant to imply that Retrocities was the only player in the game.

What made Jacques' story so interesting was that it was essentially a solo initiative, and took a unique approach to getting the data. Using a search engine cluster was a pretty slick approach no matter how you look at it. And that 150Mb/s inbound rate was amazing. Remember this is one guy's personal rig - not some corporate data center that was doing this.


Fortunately, most of these archiving groups are cooperating with each other to share what they've been able to get in order to produce the most complete collection possible under the circumstances.

This from the archiveteam.org website:

There have been other parallel projects also mirroring Geocities besides Archive Team. These include Archive.Org, Reocities, geocities.ws, and Internet Archaeology. All groups appear to have gotten different amounts of the Geocities collection, and most are now sharing data to track down gaps and share copies.

 8)



JavaJones

  • Review 2.0 Designer
  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 2,739
    • View Profile
    • Donate to Member
Re: Reocities: the GeoCities one-man rescue project
« Reply #6 on: October 29, 2009, 09:08 PM »
Gotcha, yeah. I didn't meant to belittle the interest or value of your post at all, most just curious why he went on an independent project when Archive.org (and apparently others?) were already setup to do it. But it does sound like he did some cool stuff. :)

- Oshyan

rgdot

  • Supporting Member
  • Joined in 2009
  • **
  • Posts: 2,192
    • View Profile
    • Donate to Member
Re: Reocities: the GeoCities one-man rescue project
« Reply #7 on: October 29, 2009, 10:38 PM »
Great story :)

Also who could ever forget the annoying midi files that ran in the background of so many of those sites

tomos

  • Charter Member
  • Joined in 2006
  • ***
  • Posts: 11,959
    • View Profile
    • Donate to Member
Re: Reocities: the GeoCities one-man rescue project
« Reply #8 on: October 30, 2009, 05:14 AM »
Also who could ever forget the annoying midi files that ran in the background of so many of those sites

it's all new to me :)  Another Midi Page
Tom

Deozaan

  • Charter Member
  • Joined in 2006
  • ***
  • Points: 1
  • Posts: 9,747
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: Reocities: the GeoCities one-man rescue project
« Reply #9 on: October 30, 2009, 12:00 PM »
Great story :)

Also who could ever forget the annoying midi files that ran in the background of so many of those sites

The only thing worse than that was when HTML in e-mail was fairly new and people started doing crazy "stationery" images and midi in the e-mail. )c:

rgdot

  • Supporting Member
  • Joined in 2009
  • **
  • Posts: 2,192
    • View Profile
    • Donate to Member
Re: Reocities: the GeoCities one-man rescue project
« Reply #10 on: October 30, 2009, 03:47 PM »
Looking at the page tomos linked one also remembers why there were so many articles and tools  on choosing color schemes.

app103

  • That scary taskbar girl
  • Global Moderator
  • Joined in 2006
  • *****
  • Posts: 5,884
    • View Profile
    • Donate to Member
Re: Reocities: the GeoCities one-man rescue project
« Reply #11 on: October 30, 2009, 04:30 PM »
My first website was hosted on AOL Hometown (also gone), primarily because it had more disk space than Geocities, unlimited bandwidth, and no hotlink protection on my files. (and for awhile, it was ad free)

I had a site that supplied backgrounds, clip art, and midi hotlink codes, for use on sites like Blogger, Live Journal, Dead Journal and Neopets, which didn't allow uploading files. This was back in the days before there were any free image hosting sites like Imageshack and Photobucket, and people had a lot of trouble finding places to host their files for use on other sites, for free.

I see that Reocities went only for the sites that had the traditional area + number urls. There was a point where Yahoo stopped doing that and site url's were based solely on username, without the area designation.

I guess all of those sites are not included in this archiving effort and it was just limited to the older ones.

Another thing that might have made it more difficult to archive these sites was that Yahoo instituted a 4mb/hour bandwidth limit. If you tried to archive a site bigger than 4mb all at once, you would end up knocking the site out for everyone and having to wait an hour to obtain more of the files. (I ran into this issue when trying to archive a friend's 10mb Geocities site)

This bandwidth limit was common knowledge and was often used to DoS attack sites by downloading them repeatedly till they hit the bandwidth limit. It was also why you couldn't submit any Geocities URL's to Digg.

40hz

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 11,857
    • View Profile
    • Donate to Member
Re: Reocities: the GeoCities one-man rescue project
« Reply #12 on: November 01, 2009, 08:20 PM »
I see that Reocities went only for the sites that had the traditional area + number urls. There was a point where Yahoo stopped doing that and site url's were based solely on username, without the area designation.

All may not be lost. They're not restricting the recovery effort to just what they could 'spider' and snag from the web. Eventually they plan to reach out to GeoCities site owners who succeeded in archiving their sites to help fill in some of the missing pages.

This from the Retrocities homepage:

As time passes, we will try to recover more and more of what was lost, at least as much as is technically possible. If you wish to help with this effort, and you have your old GeoCities content backed up, then please email us at [email protected], but *not* before we've stopped importing the data that we have right now.

 8)


app103

  • That scary taskbar girl
  • Global Moderator
  • Joined in 2006
  • *****
  • Posts: 5,884
    • View Profile
    • Donate to Member
Re: Reocities: the GeoCities one-man rescue project
« Reply #13 on: November 08, 2009, 06:15 PM »
I have been lurking in the Archive Team IRC channel and I am pretty up to date with what is going on now.

It's quite impressive, the amount of data they have collected and are sync'ing. 

There is plans on building a "best of" page, linking to some of the real gems.

They need help with tagging, so if you want to submit some tags, pop them into the box at the top of each page and click the button. What would be especially helpful is if you come across spam sites to tag them as such so they can eventually be removed.

They are already talking about the next big projects (tinyurl & twitter). I am considering volunteering to help with the twitter project and have done a little bit for them already (archived as much as I could get for the 100 most popular users).