About a year back, Mile Zero started to seriously drag in terms of performance, taking more than two seconds to render the page. The problem seemed to be a combination of things: the CGI interface was slow, it didn't run under mod_perl, and I had accumulated a vast number of posts it was having to sift through--which, given that my Blosxom CMS uses the file system as its database, meant lots of drive I/O.
Since I needed to dip my feet into server-side coding anyway, I rewrote Blosxom in PHP. There are a few PHP versions of the script online, but they seemed like a hassle to install, and none of them had support for the plugins I was using--it was almost easier to just do it myself. The result was faster, smaller, and proved to be a great first programming project. Since it's also proved basically stable over the last year, I've decided to go ahead and post the source code in case anyone else wants it (consider it released under the WTF Public License). I suspect the market for file-based blogging scripts is fairly small at this point, but you never know.
HOW IT WORKS
Essentially, both versions of Blosxom work the same way: they recurse through the contents of your blog folder, looking for text files with a certain extension (.txt by default) and building a list. Then they sort the list by reverse-chron date and insert the contents of the first N entries into a set of templates (head, foot, story, and date). Using a REST-like URL scheme, you can change templates (helpful for mobile or RSS) or filter entries by subfolder. It's primitive, but it's also practically unhackable, and it's an awesome way to blog if you like text files. Turns out that I like text files a lot.
Original Blosxom boasted an impressive plugin collection, which it implemented via a package system: plugins exposed a function for each stage of the page assembly process where they wanted to get involved, and the main script would call out to them during those actions, passing in various parameters depending on the task. This being Perl, the whole thing was a weird approximation of object-oriented code that looked like a string of constant cartoon profanity.
PHP provided, I think, better tools. So a plugin for my new version of Blosxom does three things: it sets up its class definition, which should include the appropriate methods for its type, as well as any class or static properties it might need, then it instantiates itself, and finally adds itself to one of several global arrays by plugin type. During execution, the main script iterates through these arrays at the proper time, calling each plugin object's processing method in turn. At least, that's how it works in theory. In practice, I've only implemented plugins for entry text manipulation, because that's all I needed. But the pattern should carry forward without problems to other parts of the process, although you might want to rename the existing process() API method to something more specific, like processEntry(). That way a single plugin could register to handle multiple stages of rendering.
ENOUGH OF THAT, HOW DO I INSTALL IT?
Just copy the script to a publicly-accessible directory, and edit the configuration variables to point it toward your content directory. The part that tends to be confusing is the $datadir variable, which needs to be set to your internal server path (what you see if you log in via FTP or SSH), not the external URL path.
Next, you'll need to set up your templates. For each flavor, Blosxom loads a series of template files and inserts your content. These files are:
At that point, when you put text files in the data directory, they'll be assembled into blog entries based on the file modification time. The first line becomes the title of the entry. You can categorize entries by putting them in subfolders, or subfolders of subfolders, and then appending the path after the Blosxom script URL.
I've included a couple of entry plugins, as well, just to show how they generally work. One is a comment counter for the old Pollxn comment system that I still use--the CGI script works fine for comments, but the Perl can't interface with the new PHP script to say how many comments there are on any given entry. The other is a port of the directorybrowse plugin, which creates the little broken-up paths at the end of each entry, so people can jump up to a different level of the category heirarchy. They're short and mostly self-explanatory.
At this point, I've been blogging on Blosxom, either the original or this custom version, for slightly more than five years. During that time, I've had all the dates wiped out during a bad server transition, I've moved hosts two or three times, and I've tweaked the site constantly. I think the level of effort is comparable to people I know on more traditional blogging platforms like Wordpress or Movable Type. Of course, the art of writing online isn't really about the tools. But there are some ways that Blosxom has its own quirks--for better and for worse.
The big hassle has been the folder system, especially for a personal blog like this one, where I may ramble across any number of loosely-connected topics from day to day. Basing a taxonomy on folders means that posts can't span multiple categories--I can't have something that's both /journalism and /gaming, for example, which is unfortunate when writing about something like incentive systems for news sites. And once it's been created, you're pretty much stuck with a category, since most of the old links will target the old folder. There aren't many reasons I would want to switch to a database CMS, but the ability to categorize by tags tops the list.
On the other hand, there's something to be said as a writer for the flat-file approach. It has an immediacy to it that a database layer can't duplicate. I don't have to sign into an admin page, visit the "create post" section, type my code into one of those text-mangling rich editing forms, select "publish," and then watch it validate and republish the whole blog. I just open a text editor and start typing, and when I save it somewhere, it's live. Working this way is great for eliminating distractions and obstacles. There's no abstraction between what my fingers and the end product.
And while working via individual files is probably less safe or reliable compared to a SQL store, it benefits from easy hackability. I don't have to understand the CMS schema to fix anything that goes wrong, or add new features, or make wide changes. If I decided to change where my linked images are located tomorrow, I'd have all the power of UNIX's text-obsessed command line at my fingertips for propagating those changes. For an organization, that'd be insane. But it works pretty well as long as it's just me tinkering around on the server in my spare time.
I don't recommend that anyone else try running a web page this way. But as a learning experience, writing your own tiny server framework serves pretty well. It's a good challenge that covers the broadest parts of Internet programming--file access, data structures, sorting, filtering, caching, HTTP requests, and output. And hey, it works for me. Maybe it'll work for you too.
When I only had 30 entries on Blosxom, this blog ran great. Over the last five years, however, I've written almost 2,000 posts, and the original Perl script has started to bog down a bit. Adding an plugin to cache the filesystem helped some, but it still takes a little more than 2 seconds to render the page. I think it has something to do with the Perl interpreter--there's some discussion online about how Blosxom doesn't like running under Apache's mod_perl, or something like that. As far as I'm concerned, Apache is a practical joke played on DYI-types by malicious shell coders, so I can't really be bothered to find out.
Long story short, last night I rewrote Blosxom as a PHP script. That sounds really hardcore, until you realize that A) it's only 16KB to begin with, and B) I took out all the features I don't use, like static rendering and complete plugin support (mine only supports entry and document processing plugins). It's running now at index.php instead of the old index.cgi, and I've redirected the domain default. I'll be leaving the old Perl script in place so that legacy links will continue to function, but anything that didn't specify an index script should now benefit from the speed boost.
If you're reading this via Google Reader, you probably don't need to do anything--Google doesn't care how slow my server is. On other RSS readers, you may notice a faster refresh and more accurate post times by switching to the new feed. And if you read via the actual page (or link to it) at the .cgi URL, you'll notice a real difference by switching to the new URL--it's about an order of magnitude faster, going from ~3 seconds to ~300ms in my tests, even without optimizations.
Four years ago, when I converted a spare domain into this blog, I started putting posts on technology into the random/tech directory (since it runs on Blosxom, which builds this page from the UNIX file system, that means that these posts are in a "tech" subcategory under the "random" top-level category). In retrospect, this was not a good idea. I've written a fair amount in that subfolder, to the point where it's not really random anymore. Today I moved it out to a top-level category of its own.
A category-based system on a blog with any kind of topical range has serious flaws. For example, there are a fair number of posts under the "bank" directory (for World Bank) about technology in development, and these could probably be better posted under "tech." But at the time when I wrote them, I didn't have that category, so I put them where they're best suited. And since the heirarchical system can't "share" posts between categories, I can't really cross-post them--especially since, if the URLs for permalinks were category based (and most of mine are), those links will break when the URL changes.
If I used a tag-based system, that wouldn't be a problem. Everything would live in a big database soup, and moving things around would be as simple as adding an additional tag--if I hadn't already done so. Indeed, people like Clay Shirky often use this as a reason for why tag-based systems are superior to heirarchies, because you don't have to perfectly plan out your ontology, and because it becomes responsive to your subject matter instead of the reverse.
I understand that reasoning, but I was kind of counting on a system that dictated my categories--they're mine, after all, so I can't really complain about the imposition. I find that when I use tags for personal categorization I tend to use them in silly ways. I can't take them seriously. And I had hoped, frankly, that this wouldn't be a tech blog: putting "tech" under "random" was meant to keep me honest. It didn't work, but I still think it was worth a shot.
Actually moving the posts to their new place is pretty easy, but now anything linking to /random/tech is broken. There are three ways to deal with this. I could write a script that would go through and correct the links. I might do that, but it's not high on my list of priorities, for obvious reasons. I could leave the posts where they are, in which case browsing by category (something I do much more often than I link to myself) would remain unintuitive. Or I could just break the system (links by date, which until very recently was the default for my RSS feed, will continue to work correctly).
I'm going to do the latter. First, because I've already been through a server change that probably destroyed half of my date-based URLs, and second, because I just don't really care that much. Honestly, I don't think it matters. A permalink and a good category is a wonderful thing. But 99% of the time, if I'm looking for a specific post either here or elsewhere, I use an third-party search engine to get there anyway. The benefits of categories or tagging are, I suspect, more for the writer than the reader.
If nothing else, think of it this way: Have you ever wished the Internet could forget? I certainly have. So I'm giving the Web a little bit of forgetfulness, limited to my tiny patch of it. Every link's an adventure, and possibly a problem-solving challenge!
Found a solution to the server move that nuked Mile Zero a few months back. Turns out that the tar command preserves filestamps, which makes sense, seeing as how it dates back to the days of tape backup. An archiver that actually archives file information! Who knew?
Of course, I wouldn't need to know this if the server admin had used a .tar to transfer the files to the new server in the first place. I really hope there was a good reason that they used the regular copy command, since almost every page I've read includes a note along the following lines:
Tar is a great way to copy directories recursively.
Because that might as well just continue:
Especially on Thomas's server, so he doesn't spend three weeks fixing the dates on something like 400 files.
Well, it's been ten days since Nadia started guarding the comments section, and I've had no spam during that time. I guess it's working.
Ladies and gentlemen, allow me to present the worst comment protection system ever, Nadia the hamster:
I'm just kind of curious whether spammers really have text recognition for nonsense words in faux-bold, real bold, italic Book Antigua. I guess we'll find out.
Someone listing themselves as "jonny" and leaving bogus gmail accounts has started dropping redirect scripts into my comment threads. Right now it just opens Google, but it could have been used to launch malicious code for all I know. I added a line to Pollxn that destroys script tags, and searched for the relevant comments, so it should be safe now.
Anyone else seen this kind of behavior pop up?
The scripts are hosted at usuc.us, which is listed as belonging to a James Sullivan living in Colorado Springs. He runs designcolorado.com--don't visit, it's a porn gateway. Looks pretty seedy to me. And now I'm paranoid about leaving security holes in Pollxn's code. I hate being paranoid.
So I did what everyone should do when a spammer is dumb enough to leave their tracks out in the open, and I called him. A woman answered the phone, said he's out of town until Thursday. I'll try again then, and ask him why he's trying to obstruct my content and mess with my server. I'm sure the answers will be enlightening.
I just got off the phone with Mr. James Sullivan, whose name has been used to plaster malicious cross-site scripts across the Internet and especially on Movable Type-based blogs. Unless he's an exceptional actor, Mr. Sullivan is not actually responsible for this spam. He's a victim of a particularly vicious identity theft, one which he seemed barely able to comprehend.
I introduced myself as a tech journalist from Washington, DC--technically true, and it's much less confrontational. Do you own usuc.us? I asked him. "I don't even know how to put up a web site," he said. "Why are all these people calling me?" Briefly, I tried to explain what was going on, including the porn site. "Don't visit it," I said, "it just opens straight to dirty pictures." Mr. Sullivan noted that he had no intentions of visiting a porn site--although, granted, his wife was in the room.
This led to the question of how he was going to fix this. He's going to the cops tomorrow, he said. "Well," I said, "this may be a federal matter, to be honest with you." "The feds?" he exclaimed with a big-government skepticism that I'm sure does Colorado proud. Yes, the feds, I said, and also said he'd probably have to check with InterNIC and ICANN.
"I thought I was going to have to call Al Gore!"
"No, Mr. Sullivan. He only invented the Internet, he doesn't fix it."
So there you have it. I'm sure it's a small consolation for the people who, unlike me, faced serious problems as a result of the scriptbots working in Mr. Sullivan's name. But at least all the scripts did is mess up a few web pages. Mr. Sullivan will probably be getting phone calls off and on for a while to come. I think he's gotten the shorter end of the stick.
More than a year now, I've been writing here. I still couldn't tell you why. But I remember thinking that at some point I'd want to use it as a way to chart my obsessive cycles. I haven't quite figured out how to gather that information from the filesystem, but I have managed to create a category count, so at least we can see where I'm focusing my time.
Here's a handy visual guide:
Obviously, this doesn't tell us actually how much text I've written for each category (although I could measure that by kB if I needed to), and if we looked at it that way the slice for Random entries would be far smaller. But I do think it's interesting to see how I spend 1/5 of my time writing about music, almost as much about games, and almost 1/10 each on fiction, the World Bank, and the medium itself.
I also think it's interesting to look at this and think about how much the categories actually overlap. After all, a significant number of those gaming posts are actually linked to music (at least 11 of them, by the extended count), and it wouldn't surprise me to find other ways that the categories are really just a starting point for organization. A lot like my desk, this is a messy system. I think that's one of the reasons I like it.