With Super Tuesday wrapped up, I feel pretty confident in writing about Betty, the new ArchieML parser that powered NPR's new election liveblogs. Language parsers are a pretty fundamental computer science discipline, which of course means that I never formally learned about them. Betty isn't a very advanced parser, compared to something that can handle a real programming language, but it's still pretty neat — and you can't say it's not battle-tested, given the tens of thousands of concurrent readers who unknowingly consumed its output last week.
ArchieML is a markup language created at the New York Times a few years back. It's designed to be easy to learn, error-tolerant, and well-suited to simultaneous editing in Google Docs. I've used it for several bigger story projects, and like it well enough. There are some genuinely smart features in there, and on a slower development cycle, it's easy enough to hand-fix any document bugs that come up.
Unfortunately, in the context of the NPR liveblog system, which deploys updated content on a constant loop, the original ArchieML had some weaknesses that weren't immediately obvious. For example, its system for marking up multi-line strings — signalling them with an :end token — proved fragile in the face of reporters and editors who were typing as fast as they could into a shared document. ArchieML's key-value syntax is identical to common journalistic structures like Sanders: 1,000, which would accidentally turn what the reporter thought was an itemized list into unexpected new data fields and an empty post body. I was spending a lot of time writing document pre-processors using regular expressions to try to catch errors at the input level, instead of processing them at the data level, where it would make sense.
To fix these errors, I wanted to introduce a more explicit multi-line string syntax, as well as offer hooks for input validation and transformation (for example, a way to convert the default string values into native types during parsing). My original impulse was to patch the module offered by the Times to add these features, but it turned out to be more difficult than I'd thought:
Okay, I thought, how hard can it be to write my own parser? I was a fool. Four days later, I emerged from a trance state with Betty, which manages to pass all the original tests in the repo as well as some of my own for my new syntax. I'm also much more confident in our ability to maintain and patch Betty over time (the ArchieML module on NPM hasn't been updated since 2016).
Betty (who Wikipedia tells me was the mechanic in the comics, appropriately enough) is about twice as large as the original parser was. That size comes from the additional structure in its design: instead of a single pass through the text, Betty builds the final output from three escalating passes.
Essentially, Betty trades concision for clarity: during debugging, it was handy to be able to look at the intermediate outputs of each stage to see where something went wrong. Each pipeline section is also much more readable, since it only needs to be concerned with one stage of the process, so it uses less global state and does less bookkeeping. The parser, for example, doesn't need to worry about the current object scope or array types, but can simply defer those to the assembler.
If you'd told me a few years ago that I'd be writing something this complicated, I would have been extremely surprised. I don't have a formal CS background, nor did I ever want one. Parsing is often seen as black magic by self-taught developers. But as I have argued in the past, being able to write even simple parsers is an incredibly valuable skill for data journalism, where odd or proprietary data formats are not uncommon. I hope Betty will not just be a useful library for my work at NPR, but also a valuable teaching tool in the community.