Call Me Al

March 7, 2020

Call Me Al

With Super Tuesday wrapped up, I feel pretty confident in writing about Betty, the new ArchieML parser that powered NPR's new election liveblogs. Language parsers are a pretty fundamental computer science discipline, which of course means that I never formally learned about them. Betty isn't a very advanced parser, compared to something that can handle a real programming language, but it's still pretty neat — and you can't say it's not battle-tested, given the tens of thousands of concurrent readers who unknowingly consumed its output last week.

ArchieML is a markup language created at the New York Times a few years back. It's designed to be easy to learn, error-tolerant, and well-suited to simultaneous editing in Google Docs. I've used it for several bigger story projects, and like it well enough. There are some genuinely smart features in there, and on a slower development cycle, it's easy enough to hand-fix any document bugs that come up.

Unfortunately, in the context of the NPR liveblog system, which deploys updated content on a constant loop, the original ArchieML had some weaknesses that weren't immediately obvious. For example, its system for marking up multi-line strings — signalling them with an :end token — proved fragile in the face of reporters and editors who were typing as fast as they could into a shared document. ArchieML's key-value syntax is identical to common journalistic structures like Sanders: 1,000, which would accidentally turn what the reporter thought was an itemized list into unexpected new data fields and an empty post body. I was spending a lot of time writing document pre-processors using regular expressions to try to catch errors at the input level, instead of processing them at the data level, where it would make sense.

To fix these errors, I wanted to introduce a more explicit multi-line string syntax, as well as offer hooks for input validation and transformation (for example, a way to convert the default string values into native types during parsing). My original impulse was to patch the module offered by the Times to add these features, but it turned out to be more difficult than I'd thought:

The parser used many, many regular expressions to process individual lines, which made it more difficult to "read" what it was doing at any given time, or to add to the parsing in an extensible way.
It also used a lazy buffering strategy in a single pass, in which different values were added to the output only when the next key was encountered in the document, which made for short but extremely dense code.
Finally, it relied on a lot of global scope and nested conditionals in a way that seemed dangerous. In the worst case you'd end up with a condition like !isSkipping && arrayElement.exec(input) && stackScope && stackScope.array && (stackScope.arrayType !== 'complex' && stackScope.arrayType !== 'freeform') && stackScope.flags.indexOf('+') < 0, which I did not particularly want to untangle.

Okay, I thought, how hard can it be to write my own parser? I was a fool. Four days later, I emerged from a trance state with Betty, which manages to pass all the original tests in the repo as well as some of my own for my new syntax. I'm also much more confident in our ability to maintain and patch Betty over time (the ArchieML module on NPM hasn't been updated since 2016).

Betty (who Wikipedia tells me was the mechanic in the comics, appropriately enough) is about twice as large as the original parser was. That size comes from the additional structure in its design: instead of a single pass through the text, Betty builds the final output from three escalating passes.

The tokenizer runs through the document character by character, and outputs a stream of tokens containing one or more characters of text bucketed into different types.
The parser takes that stream of tokens, reassembles them into lines (as is required by the ArchieML spec), and then matches those against syntax patterns to emit a list of assembly instructions (such as setting a key, creating an array, or buffering text).
Finally, the assembler runs the instructions through a final cleanup (consolidating and trimming values) and then uses them to build the output object.

Essentially, Betty trades concision for clarity: during debugging, it was handy to be able to look at the intermediate outputs of each stage to see where something went wrong. Each pipeline section is also much more readable, since it only needs to be concerned with one stage of the process, so it uses less global state and does less bookkeeping. The parser, for example, doesn't need to worry about the current object scope or array types, but can simply defer those to the assembler.

But the real wins are the simplicity of adding new syntax to ArchieML, in ways that the original parser was not extensible. Our new multi-line type means that editors and reporters can write plain English in posts and not have to worry about colliding with the document syntax in unexpected ways. Switching to Betty cleaned up the liveblog code substantially, since we can also take advantage of the assembler's pipeline hooks: keys are automatically camel-cased (Google Docs likes to sentence-case keys if you're not careful), and values can be converted automatically to JavaScript Date objects or numbers at the earliest stage, rather than during output or templating.

If you'd told me a few years ago that I'd be writing something this complicated, I would have been extremely surprised. I don't have a formal CS background, nor did I ever want one. Parsing is often seen as black magic by self-taught developers. But as I have argued in the past, being able to write even simple parsers is an incredibly valuable skill for data journalism, where odd or proprietary data formats are not uncommon. I hope Betty will not just be a useful library for my work at NPR, but also a valuable teaching tool in the community.

15:35 x permalink

Past - Present

Mile Zero

March 7, 2020

Call Me Al

Mile
Zero