this space intentionally left blank

March 9, 2010

Filed under: journalism»new_media

Forget About It

Via Aleks Krotoski, web developer Jeremy Keith discusses the "truism" that The Internets Never Forget:

We seem to have a collective fundamental attribution error when it comes to the longevity of data on the web. While we are very quick to recall the instances when a resource remains addressable for a long enough time period to cause embarrassment or shame later on, we completely ignore all the link rot and 404s that is the fate of most data on the web.

There is an inverse relationship between the age of a resource and its longevity. You are one hundred times more likely to find an embarrassing picture of you on the web uploaded in the last year than to find an embarrassing picture of you uploaded ten years ago.

From there, Keith muses a bit on domain names, which are rented from the ICANN: you can own your data (or own your name), but you can't own your domain in perpetuity. We've been dealing with content-management questions a bit at work lately, as any news organization transitioning from print to web must, so this kind of thing has been on my mind anyway. And I've reached the point, personally, where I take a fairly radical stance: not just that the web does lose content over time, but that it should do so. Permanence is unrealistic, if not actively harmful.

Now, I say this as someone who likes URLs, and who believes that basic URL literacy is not too much to expect from people. I also think URLs should be stable for a reasonable period of time--inversely proportional to their depth in the directory tree, for example, so that "www.domain.com/stories" should be much more stable than "www.domain.com/content/stories/about/buildings/and/food.html" or something like that. But the idea that you can have URLs that are stable forever? Or that you should expect all content to be equally preservation-worthy? That's just foolish.

Take a news organization, for example. Your average news site produces a truly galling amount of content every day: not just text stories, but also images, video, audio, slideshows, interactives, and so on. Keeping track of all of this is a monumental task, and the general feeling I get is that these companies are failing miserably at it. I cannot think of a single newspaper website (including CQ, no less) where it is easier for me to find a given item through their own navigation or search than it is to go to Google and type "washington post mccain obama timeline" (to pick a recent example).

And that's not a bad thing. Google spends a lot of time learning how to read your mind (effectively). They (and their competitors at Bing, or wherever) employ a lot of smart people to do nothing but help you find what you're looking for, even if you don't spell it right or if the URL has changed. I say, let them do that. If it were up to me, I'd replace every in-site search engine with a custom Google query and then forget about it: the results would probably be better (they could hardly be worse) and newsroom tech departments could spend their time and money on actual journalism-related activities.

The thing is, the vast majority of content (particularly in journalism) has a set lifespan, and we should respect that. The window of time when stable URLs are crucial is limited to a couple of months or so: enough time for bloggers (micro- or otherwise) and social networkers to discuss those rare few articles that catch on with the Internet audience and have legs. After that, searchability is more important than stability, because people aren't going to dig up the old links. They're going to locate what they want via someone else's search engine. That's if they search for it at all, of course. Because realistically speaking, most news has little in the way of legs, especially on the Internet where readers expect breaking stories to adopt a blog-style hit-and-run update pattern. It's intensely valuable for about a day, and then it's digital hamster-cage lining. Don't throw it away haphazardly--but don't fool yourself about its long-term value, either.

This may sound like I'm saying that we should give up on archiving. I'm not--after all, I'm the world's biggest fan of Lexis-Nexis. I simply propose that fighting linkrot can't be our top priority. When it comes to content management, my question is not "how do we store this at the same location forever?" but "how easy will it be to port this medium-term storage solution into another with a minimum of degradation for the content that actually matters?" It's that content that I really care about, not its address, because Google (or Bing, or whatever) will always be able to find the new location. That makes ease of migration much more important than URL fidelity. If you're thinking about a news CMS timeframe longer than, say, two years, I think you risk losing sight of that fact.

Ultimately, links break. Let them. Attempting to engineer for eternity is a great way to never finish building--or to lock yourself into a poor foundation when the technological ground shifts. And honestly, we're far enough behind as an industry now. We don't need to bury ourselves any more.

Future - Present - Past