When I only had 30 entries on Blosxom, this blog ran great. Over the last five years, however, I've written almost 2,000 posts, and the original Perl script has started to bog down a bit. Adding an plugin to cache the filesystem helped some, but it still takes a little more than 2 seconds to render the page. I think it has something to do with the Perl interpreter--there's some discussion online about how Blosxom doesn't like running under Apache's mod_perl, or something like that. As far as I'm concerned, Apache is a practical joke played on DYI-types by malicious shell coders, so I can't really be bothered to find out.
Long story short, last night I rewrote Blosxom as a PHP script. That sounds really hardcore, until you realize that A) it's only 16KB to begin with, and B) I took out all the features I don't use, like static rendering and complete plugin support (mine only supports entry and document processing plugins). It's running now at index.php instead of the old index.cgi, and I've redirected the domain default. I'll be leaving the old Perl script in place so that legacy links will continue to function, but anything that didn't specify an index script should now benefit from the speed boost.
If you're reading this via Google Reader, you probably don't need to do anything--Google doesn't care how slow my server is. On other RSS readers, you may notice a faster refresh and more accurate post times by switching to the new feed. And if you read via the actual page (or link to it) at the .cgi URL, you'll notice a real difference by switching to the new URL--it's about an order of magnitude faster, going from ~3 seconds to ~300ms in my tests, even without optimizations.
Four years ago, when I converted a spare domain into this blog, I started putting posts on technology into the random/tech directory (since it runs on Blosxom, which builds this page from the UNIX file system, that means that these posts are in a "tech" subcategory under the "random" top-level category). In retrospect, this was not a good idea. I've written a fair amount in that subfolder, to the point where it's not really random anymore. Today I moved it out to a top-level category of its own.
A category-based system on a blog with any kind of topical range has serious flaws. For example, there are a fair number of posts under the "bank" directory (for World Bank) about technology in development, and these could probably be better posted under "tech." But at the time when I wrote them, I didn't have that category, so I put them where they're best suited. And since the heirarchical system can't "share" posts between categories, I can't really cross-post them--especially since, if the URLs for permalinks were category based (and most of mine are), those links will break when the URL changes.
If I used a tag-based system, that wouldn't be a problem. Everything would live in a big database soup, and moving things around would be as simple as adding an additional tag--if I hadn't already done so. Indeed, people like Clay Shirky often use this as a reason for why tag-based systems are superior to heirarchies, because you don't have to perfectly plan out your ontology, and because it becomes responsive to your subject matter instead of the reverse.
I understand that reasoning, but I was kind of counting on a system that dictated my categories--they're mine, after all, so I can't really complain about the imposition. I find that when I use tags for personal categorization I tend to use them in silly ways. I can't take them seriously. And I had hoped, frankly, that this wouldn't be a tech blog: putting "tech" under "random" was meant to keep me honest. It didn't work, but I still think it was worth a shot.
Actually moving the posts to their new place is pretty easy, but now anything linking to /random/tech is broken. There are three ways to deal with this. I could write a script that would go through and correct the links. I might do that, but it's not high on my list of priorities, for obvious reasons. I could leave the posts where they are, in which case browsing by category (something I do much more often than I link to myself) would remain unintuitive. Or I could just break the system (links by date, which until very recently was the default for my RSS feed, will continue to work correctly).
I'm going to do the latter. First, because I've already been through a server change that probably destroyed half of my date-based URLs, and second, because I just don't really care that much. Honestly, I don't think it matters. A permalink and a good category is a wonderful thing. But 99% of the time, if I'm looking for a specific post either here or elsewhere, I use an third-party search engine to get there anyway. The benefits of categories or tagging are, I suspect, more for the writer than the reader.
If nothing else, think of it this way: Have you ever wished the Internet could forget? I certainly have. So I'm giving the Web a little bit of forgetfulness, limited to my tiny patch of it. Every link's an adventure, and possibly a problem-solving challenge!
Found a solution to the server move that nuked Mile Zero a few months back. Turns out that the tar command preserves filestamps, which makes sense, seeing as how it dates back to the days of tape backup. An archiver that actually archives file information! Who knew?
Of course, I wouldn't need to know this if the server admin had used a .tar to transfer the files to the new server in the first place. I really hope there was a good reason that they used the regular copy command, since almost every page I've read includes a note along the following lines:
Tar is a great way to copy directories recursively.
Because that might as well just continue:
Especially on Thomas's server, so he doesn't spend three weeks fixing the dates on something like 400 files.
Well, it's been ten days since Nadia started guarding the comments section, and I've had no spam during that time. I guess it's working.
Ladies and gentlemen, allow me to present the worst comment protection system ever, Nadia the hamster:
I'm just kind of curious whether spammers really have text recognition for nonsense words in faux-bold, real bold, italic Book Antigua. I guess we'll find out.
Someone listing themselves as "jonny" and leaving bogus gmail accounts has started dropping redirect scripts into my comment threads. Right now it just opens Google, but it could have been used to launch malicious code for all I know. I added a line to Pollxn that destroys script tags, and searched for the relevant comments, so it should be safe now.
Anyone else seen this kind of behavior pop up?
The scripts are hosted at usuc.us, which is listed as belonging to a James Sullivan living in Colorado Springs. He runs designcolorado.com--don't visit, it's a porn gateway. Looks pretty seedy to me. And now I'm paranoid about leaving security holes in Pollxn's code. I hate being paranoid.
So I did what everyone should do when a spammer is dumb enough to leave their tracks out in the open, and I called him. A woman answered the phone, said he's out of town until Thursday. I'll try again then, and ask him why he's trying to obstruct my content and mess with my server. I'm sure the answers will be enlightening.
I just got off the phone with Mr. James Sullivan, whose name has been used to plaster malicious cross-site scripts across the Internet and especially on Movable Type-based blogs. Unless he's an exceptional actor, Mr. Sullivan is not actually responsible for this spam. He's a victim of a particularly vicious identity theft, one which he seemed barely able to comprehend.
I introduced myself as a tech journalist from Washington, DC--technically true, and it's much less confrontational. Do you own usuc.us? I asked him. "I don't even know how to put up a web site," he said. "Why are all these people calling me?" Briefly, I tried to explain what was going on, including the porn site. "Don't visit it," I said, "it just opens straight to dirty pictures." Mr. Sullivan noted that he had no intentions of visiting a porn site--although, granted, his wife was in the room.
This led to the question of how he was going to fix this. He's going to the cops tomorrow, he said. "Well," I said, "this may be a federal matter, to be honest with you." "The feds?" he exclaimed with a big-government skepticism that I'm sure does Colorado proud. Yes, the feds, I said, and also said he'd probably have to check with InterNIC and ICANN.
"I thought I was going to have to call Al Gore!"
"No, Mr. Sullivan. He only invented the Internet, he doesn't fix it."
So there you have it. I'm sure it's a small consolation for the people who, unlike me, faced serious problems as a result of the scriptbots working in Mr. Sullivan's name. But at least all the scripts did is mess up a few web pages. Mr. Sullivan will probably be getting phone calls off and on for a while to come. I think he's gotten the shorter end of the stick.
More than a year now, I've been writing here. I still couldn't tell you why. But I remember thinking that at some point I'd want to use it as a way to chart my obsessive cycles. I haven't quite figured out how to gather that information from the filesystem, but I have managed to create a category count, so at least we can see where I'm focusing my time.
Here's a handy visual guide:
Obviously, this doesn't tell us actually how much text I've written for each category (although I could measure that by kB if I needed to), and if we looked at it that way the slice for Random entries would be far smaller. But I do think it's interesting to see how I spend 1/5 of my time writing about music, almost as much about games, and almost 1/10 each on fiction, the World Bank, and the medium itself.
I also think it's interesting to look at this and think about how much the categories actually overlap. After all, a significant number of those gaming posts are actually linked to music (at least 11 of them, by the extended count), and it wouldn't surprise me to find other ways that the categories are really just a starting point for organization. A lot like my desk, this is a messy system. I think that's one of the reasons I like it.