I like the concept of "Digital Heritage" as they call it.
I think preserving digital heritage may become an important issue in the next few years. As the world begins to recognize the historic legacy of the web, it may someday become as important as preserving physical historical landmarks.
Its time to accept that the internet may be the greatest legacy 21st century civilization leaves behind. It likely isn't going anywhere, and is the most complete archive of our lives, that may live on for generations after we're dead.
For all we know, the things we type here could be preserved longer than the pyramids of giza should someone make sure to back it up regularly. I hope that someone does.
Charles Stross has an excellent novel, "Glasshouse", a major plot point of which is that due to shoddy digital preservation practices, far-future historians know virtually nothing about ≈1950 through ≈2050 or so. What little they know is pieced together from fragmentary bits and pieces of evidence, with results that are at turns hilarious and horrifying when "put into practice" by historical re-enactors. Kind of reminded me of David Macauly's "Motel of the Mysteries" in some ways.
I'll second the positive review of "Glasshouse." The lack of knowledge portrayed in the book is also an interesting commentary on one of the often unsung evils of DRM. That license server certainly won't be running 300 years from now.
Historians learn most about ancient societies by digging through their trash. We have left more trash that takes ages to degrade than all prior societies put together.
Only if they get serious about recycling, history will be at stake.
I remember back in `95 building websites that were basically just animated gifs, (no judging i was a kid). I used some under construction ones because I had seen it everywhere and I was constructing!
Heh. I'd say that comment made it to Hacker News last year, but I already linked it in the Mefi thread! And then I got server push animation working on my server, & yes, it's really as hideous as people say.
Keep in mind that the idea of a document that might be different every time you pull it up was still kinda new.
So maybe the "construction" part was a meme, but I think the idea that you needed to warn people that this homepage wasn't "finished" yet was universal.
Hard to say if it was the first or not. It may have been "Here's my list of interesting web pages (most of them were probably links to other peoples lists of interesting pages, very recursive.)" I don't think the word meme existed then either.
Jason Scott is a pretty cool dude; random story:
He was doing the JetBlue unlimited month of travel and showed up in Pittsburgh a few weeks ago to talk at CMU. He was giving a talk at Seton Hill college the next day (about 45 mins away), and posted on twitter looking for a ride. A friend of mine forwarded the tweet to me, and I got in touch with Jason and ended up giving him a ride the next day. I had very little clue who he was before I met him, but he gave me a copy of Get Lamp and some good discussion in return.
I'm actually sad that I didn't put my first website on geocities now! I had my own web hosting from my ISP, so I had a fabulously easy to remember url of "homepages.tig.com.au/~liedra" which was lost as soon as my family upgraded to a cable connection from dialup. And no, archive.org didn't manage to catch it :( I think it had a page devoted to Nick Cave and some terrible poetry! Go go websites of a 17 year old! :)
The interesting thing was that at the time my friends and I (who had ISP-based homepages) looked down on Geocities because it was "lame" comparatively. Now I'm sad that I don't have any records of that original page (possibly on an ancient CD-R though? but most of those early ones have degraded now...)
My first pages were on my ISP which offered a subdomain! I paid £2 a month extra (on top of call charges) for the privelige. My friends thought I was crazy but I showed them when that very same subdomain impressed someone enough to give me a web dev job ("You have a subdomain? Impressive").
I also looked down on geocities/angelfire sites and I still think I got the better deal out of it - my first stuff was too embarrassing to live on for eternity in the depths of a torrent.
The early web is a treasure trove of an interesting time in history. It was the first time average people could just write public documents to express themselves.
Naturally the pages were terrible, covered in things that look good the first time you see it, pointless opinions and personal shrines to obscure relics of pop-culture.
The web is still the same, but more everyday. Companies work day and night to have a web presence, and "using the internet" is synonymous with replying to status and 'liking' things.
Geocities, AOL Homepages, and tripod are landmarks of the first time in history someone could just make a page about themselves, or something they liked and _anyone_ could see it. It was society making paintings on caves.
Unfortunately, these sites don't produce revenue, and never will, so from a corporate point of view, they are worthless.
The early era of the web is like trying to find rare music. Of course there is a modern site, a torrent, or some convenient way to find most of what you want. What you find is at best, the same thing everyone else finds. The old web is full of non-technical people earnestly trying to make something, not a startup, not to sell a book, just trying to put something together which is largely lost in the ease of "List your favorite bands"
Not that it was better, or more insightful, simply that it is a huge body of primitive work that is unlikely to be recreated. These things should be stored, if for no other reason than we can see the bloviated opinions of mensans, the C-style poetry of 90's sysadmins, or just the insane ramblings of people who think like Gene Ray, but don't have the perseverence to keep up timecube.
The sites are a labor of love, no matter the revenue, and it annoys me to no end that AOL or Yahoo has the power to simply delete these old sites because they don't make business sense, to businesses that don't even know what they are doing.
Anyway, as someone who mirrored a few old HomePages and Geocities sites, and backs up pieces of the old internet whenever I can find them, this is a breath of fresh air.
Hey, everyone, Jason Scott (the textfiles.com guy) here.
Just wanted to address that reocities.com has even more than I do, and more than what's in the torrent. If you want to browse geocities, like ye old days, go visit reocities. This data release is never meant to be "all of geocities" just "a lot of geocities" (and all I have).
Skeletal Lovers
Two dead people
Embraced even in death
They lie there for eternity
Together forever
Memories turned to dust
Laughter and sin forgotten
Nothing but pale white bone
Nothing to complain about
Only the two of them
Forever together
I had someone email me a few days ago asking for code for an old webring script I'd written in 1998. I was a little amazed that 1. anyone would still want such a thing, and B. someone had found a listing for my 12yr old PHP script and thought it worth using. I haven't had the code in my possession for at least 10 years, and being so old, I'm sure it was full of all kinds of security loopholes. No idea where he came across it, but apparently some resource site somewhere on the internet still lists it.
they don't work, the second you start getting any serious traffic you get a warning. I had a video(flash movie) that got popular, and they sent me a warning after only 2 gigs of bandwidth.
Granted those 2 gigs were used up in something like 5 minutes, but never the less..you'd think they'd give you a little more to play with.
I remember creating my first site on geocities. A southpark fan site with links to download episodes (linked to another site hosting the of course). That obviously became the most popular feature. Think will have to get the torrent or at least part of it for a trip down memory lane. It is the digital equvilent of the 80's haircut
I'd love to be able to search those contents. I'm pretty sure I had a few Geocities sites, but I'm not going to download a terabyte to see if it's in there.
I imagine at least somebody will download to a server and host them all there. Might grab a new 1TB drive into one of my servers and do it if I'm bored enough...
I'd love to know what sort of infrastructure you're running this on.
I'm in a course for my Masters for library school that deals with similar sorts of problems in maintaining and preserving for the "long term" digital materials.
From your site it sounds like you wrote a script using wget to harvest the files and another to check them against versions that were still up. What do you do on the server end now to ensure that the files are still working correctly? Are you running periodic checksums on them or the like? Finally, are you looking for any help from an interested novice?
I have a large database table that stores the md5 hashes of all the files and there is a script that can compare all of the contents of the site with the hashes in the files (and with a second copy if that's what it would come to).
Some bitrot is inevitable but I think it's under control for now.
As for help, yes, but right now I'm pretty swamped in other stuff, the next round of work on reocities will likely come after the new year.
Have you considered using something like MogileFS? It'd be perfect for this sort of situation.
Let me know if you're interested in this or have any questions -- I've dealt a good bit with systems like this in the past, and would love to give you a hand.
I checked, my very first website is thus far not there. I'll wait until you're finished, and if it doesn't appear I can send the offline backup. I've been carting it around for 10 years, laughing at the animated graphics and frames.
While I think the web is probably a better place with reocities than without it, slapping ads on this content feels a tad bit slimy to me. I'm curious to hear your take on the ethical and legal implications of it.
And let's suppose for a second that they were not (as they have not been in the past), reocities has cost a fairly large amount of money to date (instead of made money, as you suggest), not a single person that has asked me to remove the content has ever commented on the presence of the ads, and neither has anybody that has found their stuff again because I backed it up.
On the contrary, the reactions have been almost 100% positive with a very few exceptions.
I apologize if that's how you read my post; I certainly didn't intend to suggest that or to belittle the amount of work and money that you've put into this project.
This is from the Google stable of morals - if people don't know you're profiting from their copyright then it's OK.
Presumably you have no legal right to redistribute any of the Geocities stuff?
The fact you spent time and money on it is neither here nor there, that's not a moral|legal argument. Presumably there have been some exceptions (you intimate that), what sort of percentage does that amount to? If you calculate that as a true reflection of the whole population then how many people's copyright do you suppose you've infringed, knowingly, against their wishes?
If someone takes your published material and slaps ads on it and republishes without your knowledge is that all good with you? (I'm guessing you may say yes here!).
It's not just about economics, it's probably as much about moral rights. People believe that their Geocities content died and was buried.
Of course they are pretty minor "offences" (or at least appears so) across a large population - akin to being a spammer or somesuch. Actually scratch that this is simply like copying others blog posts on a massive scale.
I copied my content from Geocities to another provider and then on to my own site (eventually). I'm sure I'm not alone. Very little of that content is still live anywhere but there is a little I think.
Also, why do you get to be the arbiter of whether others content can be allowed to disappear or not? If I own the copyright then it's within my rights to have all copies destroyed, for example, you secretly (as you've not notified copyright holders AFAICT) keeping a copy is infringing my moral right to control that work.
Not to toot my own horn here, but let me just give you one sample of the kind of email I get about this project, I'll leave you to judge the rest of them by this single one (it just happens to be the last one and more in this vein would be excessive and embarrassing):
--
I don't know who you are, but I just want say you are an angel. I thought I had laid the very first website I ever built on geocities to rest, and words cannot describe my utter and complete surprise to find it resurrected on reocities.
You say this project is a labor of love -- and that is exactly how I felt about my own website. It was, for at least 10 years of my life, the best expression of who I was, and it means so much to be able to relive that era again.
You are truly doing a public service by preserving early relics of Internet culture. I can only imagine that generations from now, when people are digging into the history of the web, they will be fascinated by what you've saved.
Thank you SO MUCH for doing what you do!
webmistress
reocities.com/avcorner2000
--
I'm not going to argue morality here with you any further, you apparently have a bee up your bonnet about this, but as they say, no good deed goes unpunished, there is no reason why this would be an exception.
>I'm not going to argue morality here with you any further, you apparently have a bee up your bonnet about this, but as they say, no good deed goes unpunished, there is no reason why this would be an exception.
I'm arguing the application of copyright law (I'm not for the law as it stands incidentally) and for the moral rights of producers of copyright law. You appear to be arguing that it's fine to break the law because how you do it makes some people happy. The same rationale (at a different scale) makes speeding OK for teenagers if it impresses their mates and they're lucky enough not to have killed anyone yet.
The problem is that if we allow what you've done (which I don't dislike, indeed I'd consider myself an admirer in general of what I've seen of your work) for anyone then we allow copying other's blog posts adding adverts and putting them on one's own website, we allow copying books that are still in copyright and republishing them, etc..
It's a technicality but important to the case in point IMO. The email from webmistress is grateful for you saving her from not having properly backed up her work, not from having infringed copyright. If you now copy her current website and display it as your own with ads, will she be happy? I'd warrant no, not until the point at which she deletes it all by accident and comes to you because she hasn't backed up. Is this an argument against copyright, probably, not a great one but still it is one.
The fact that you're hearing the positive results is going to be largely selection bias.
If you've bothered reading then thanks for your responses and for not going ad hominem on my ass.
(1) laws have been set aside many times when the net benefit for the common good outweighed the rights of the individual
(you can still disagree that that is a good thing though)
(2) such exemptions apply to libraries and other 'violators' that serve a different goal than piracy (for instance, preservation and access)
(3) in this case those that benefit the most are the original copyright holders
(4) there is a procedure in place to deal with those copyright holders that do not want their information out there
For the record, a fairly knowledgeable lawyer on copyright law in the netherlands here has reviewed the whole thing and think there is absolutely no problem defending my actions (just in case there would have been, I would have done whatever his advice would have been).
It's been up for a year, the one time someone threatened to sue (of course, some hotshot lawyer with a corporate page on geocities :) ), he backed off and became real nice once he realized that no judge was ever going to sign off on him suing for damages and whatnot without first asking politely to remove the stuff.
Laws are there to be respected. In exceptional cases - such as the going out of business of a repository of this size - you can break them if you go about it nicely and try to limit the damage as much as you can.
There are other people out there that have also made copies of all this data that have turned the whole thing in to an adsense fest complete with SEO spam tactics. That might be a better target for your anger.
Lastly, how much would you give for a copy of the library of Alexandria ?
I'm sure that geocities can not on average be compared with the quality of what was stored there but you'd be surprised by some of the stuff that I've found amongst the wreckage and we'd all be culturally poorer if it had gone to waste.
>Laws are there to be respected. In exceptional cases - such as the going out of business of a repository of this size - you can break them if you go about it nicely and try to limit the damage as much as you can.
That's not how the law works here. You break the law whether you're held to account for it or not. Copyright law in Europe is stricter in many ways than in the US (WRT personal use for example).
>Lastly, how much would you give for a copy of the library of Alexandria ?
A lot. Probably not my first born though. This hits at the correct route for attacking poor law. Obviously in Alexandria there was no copyright, it was all PD.
People are welcome to mark their pages PD (or some other liberal license; this is the legal procedure for your #4) and HTML5 should (does? via microformats?) allow a license (CC, PD, C, CL, FDL, whatever) to be applied and readily parsed so that you could stay within the law and still do your white knight deal-y.
Moreover those who wish for people to be able to copy without restriction should petition for a change in the law.
The law is an ass but you're stuck with it. I don't consider the value of the stuff you've saved (as much as I've seen, certainly not an in depth study) to be that high that civil disobedience should be practised in order to preserve it.
> I don't consider the value of the stuff you've saved (as much as I've seen, certainly not an in depth study) to be that high that civil disobedience should be practised in order to preserve it.
And that's where we disagree. Talk to a researcher in 500 years or so to get the better reasons why compared to the ones that I can give to you today.
But what I would not give to have the nasa pages about the spaceshuttle flights that I helped put out on the net back.
Those are gone forever, I wished someone had broken copyright law to preserve them.
Yes, I put them up there as a joke initially because Zachary came up with his animated gifs but then I thought oh, what the heck and left them up. I'm sure he's not making much money of them - if any - but it looked like a perfect fit.
>I'm sure he's not making much money of[f] them - if any - but it looked like a perfect fit.
I think you're being disingenuous here. You've argued that you have a right to the content and to put your ads on it so why bother trying to spin that action as a minor benefit.
If it's no benefit then save yourself and everyone else the bandwidth and remove the adframe. If it is a benefit then keep it and stand by your conviction that that is justified.
PS: for your mate, the .logo {padding-left: } looks to be about 7px too much so it doesn't align with "Bring happiness [...]" strapline and the body text. I'm using FF3.6.11 on Kubuntu.
I would be very surprised if there is anything in there that is not in reocities yet but I will certainly do a comparison.
I'll have to run that against all the deletion requests as well for this content because chances are fairly large that lots of it has already been removed at the request of the owners.
Have you tried de-duping files to save space? I imagine there's a finite number of MIDI files and "under construction" GIFs that could just be symlinked to save a ton of space...
Space is not the problem, I've got plenty of that. The bigger problem is that the underlying filesystem is having some problems dealing with the total number of files.
Total disk space usage on the array that holds reocities is about 10.5 T (that includes a master copy though) and runs to 233 million files.
It was a situation where the common good outweighed the benefits of respecting copyright law handily, besides that the only thing that changed is the machine where the content was hosted.
I do not pass it off as my own, have a very clearly established procedure for removing the data at the request of the copyright holder.
Think of it as a hosting provider going out of business and a new one taking over the corpse of the old one. That it happened without contacting the owners is simply because the vast majority of the owners does not have contact information to begin with.
I remember starting out in Geocities and then in Angelfire. Those where the days when you had to submit your site to the Yahoo directories :D I actually made money selling ads from Commission Junction back then. It wasn't much but it felt great.
I can't for the life of me remember what I put out on geocities but it was probably something to do with Star Wars. I think my username back then was Fett82. Good times...
For anyone in Australia, it would be cheaper to pay them to put the data on a HDD, wrap it in bacon and hand deliver it - rather than download the torrent.
and the discussion: http://news.ycombinator.com/item?id=903567
Does this torrent contain the fruits of the collaboration planed in that discussion?