this space intentionally left blank

October 17, 2016

Filed under: culture»america»usa

Kill 'Em and Leave 'Em, by James McBride

"Can we hit it and quit?"

When it came to rhetorical questions, nobody beat James Brown. And the more you learn about him, the more layered his shouts during "Sex Machine" become: behind the "unscripted" banter, he was a harsh and unforgiving despot to his band, driving them relentlessly through a tightly-rehearsed show. No wonder their answer is always "yeah!" Yeah, you can hit it and quit. Whatever you say, James.

It is hard, as a white person born in the 80's, to fully appreciate the impact that James Brown had on America. Michael Jackson is easier for me to grasp: I grew up in a poor, racially-diverse neighborhood in Lexington, Kentucky, and I can still remember going over to a friend's house with my brother and seeing portraits of Jackson on the wall, in much the same way that the father in The Commitments keeps a picture of Elvis hung in the living room (right above the pope). James Brown was before my time and out of my sphere, so my appreciation, while sincere, is always more cerebral than heartfelt.

But if you want to get a little closer to understanding, this decade has been a good one for books about the Godfather of Soul. R.J. Smith's endlessly quotable The One came out in 2012, and now James McBride has tackled his legacy with Kill 'Em and Leave 'Em. Despite McBride's background as a musician, this isn't a deep dive into soul music. It's also not really a biography, in part because as McBride discovered, James Brown didn't let anyone into his confidence. He kept everyone at a distance, both fans and friends. The title is a quote, from Brown to the Rev. Sharpton: "kill 'em and leave 'em," he'd say, before slipping out from his shows, unseen by the fans waiting outside.

McBride interviews everyone he can find who knew Brown, ranging from his distant cousin, to his tax lawyer, to Alfred "Pee Wee" Ellis, the bandleader during Brown's greatest hits. Many of them are reluctant to talk about him, either because the memories are painful, or because he kept them at arms length, or both. Instead, what emerges is a kind of portrait of how James Brown left his mark on American culture, by way of his friends, family, and business partners.

It's a tribute to McBride's skill that he's able to weave these disjointed, scattered viewpoints into a compelling narrative. But part of what makes the tale so gripping is that unlike a traditional biography, McBride doesn't stop with his subject's death. In his will, James Brown left millions to fund education scholarships in South Carolina and Georgia, but not a dime was spent: lawsuits from disgruntled family members, and interference from the South Carolina government, immediately tied up the fortune and eventually all but depleted it.

Before he died, Brown told his friends that they wouldn't want to be anywhere near his estate when the end came. Indeed, the fallout was colossal. His attorney and accountant, both men who had helped haul him out from under IRS investigation, were ruined in the process. For all his faults, Brown deeply cared about giving poor kids like him a leg up, and to watch the estate disintegrate this way is painful. It's a tough pill to swallow at the end of a biography.

But let's be clear: any biography that claims to frame its subject neatly for the reader is kind of a fraud anyway. Who was the real James Brown? I don't think McBride really knows — I think he'd say that nobody really knew, not even the man himself. More importantly, he hints that it may be the wrong question to ask. Kill 'Em and Leave 'Em chronicles the impact that James Brown had on those around him, how that rippled out through communities (black and otherwise), and how it continues to inspire Americans today. James Brown is gone, McBride argues, but he's still telling our story.

October 3, 2016

Filed under: journalism»articles

Designing news apps for humanity

It's been a busy few weeks, but I do at least have an article up on Source with an overview of my SRCCON session on creating more humane digital journalism.

September 9, 2016

Filed under: tech»web

Classless components

In early August, I delivered my talk on "custom elements in production" to the CascadiaFest crowd. We've been using these new web platform features at the Seattle Times for more than two years now, and I wanted to share the lessons we've learned, and encourage others to give them a shot. Apart from some awkward technical problems with the projector, I actually think the talk went pretty well:

One of the big changes in the web component world, which I touched on briefly, is the transition from the V0 API that originally shipped in Chrome to the V1 spec currently being finalized. For the most part, the changeover is not a difficult one: some callbacks have been renamed, and there's a new function used to register the element definition.

There is, however, one aspect of the new spec that is deeply problematic. In V0, to avoid complicated questions around parser timing and integration, elements were only defined using a prototype object, with the constructor handled internally and inheritance specified in the options hash. V1 relies instead on an ES6 class definition, like so: class CustomElement extends HTMLElement { constructor() { super(); } } customElements.define("custom-element", CustomElement);

When I wrote my presentation, I didn't think that this would be a huge problem. The conventional wisdom on classes in JavaScript is that they're just syntactic sugar for the existing prototype system — it should be possible to write a standard constructor function that's effectively identical, albeit more verbose.

The conventional wisdom, sadly, is wrong, as became clear once I started testing the V1 API currently available behind a flag in Chrome Canary. In fact, ES6 classes are not just a wrapper for prototypes: specifically, the super() call is not a straightforward translation to older inheritance models, especially when used to extend browser built-ins as it does here. No matter what workarounds I tried, Chrome's V1 custom elements implementation threw errors when passed an ES5 constructor with an otherwise valid prototype chain.

In a perfect world, we would just use the new syntax. But at the Seattle Times, we target Internet Explorer 10 and up, which doesn't support the class keyword. That means that we need to be able to write (or transpile to) an ES5 constructor that will work in both environments. Since the specification is written only in terms of classes, I did what you're supposed to do and filed a bug against the spec, asking how to write a backwards-compatible element definition.

It shouldn't surprise me, but the responses from the spec authors were wildly unhelpful. Apple's representative flounced off, insisting that it's not his job to teach people how to use new features. Google's rep closed the bug as irrelevant, stating that supporting older browsers isn't their problem.

Both of these statements are wrong, although only the second is wrong in an interesting way. Obviously, if you work on standards specifications, it is part of your job to educate developers. A spec isn't just for browsers to implement — if it were, it'd be written in a machine-readable language like WebIDL, or as a series of automated tests, not in stilted (but still recognizable) English. Indeed, the same Google representative that closed my issue previously defended the "tutorial-like" introductory sections elsewhere. Personally, I don't think a little consistency is too much to ask.

But it is the dismissal of older browsers, and the spec's responsibility to them, that I find more jarring. Obviously, a spec for a new feature needs to be free to break from the past. But a big part of the Extensible Web Manifesto, which directly references web components and custom elements, is that the platform should be explainable, and driven by feedback from real web developers. Specifically, it states:

Making new features easy to understand and polyfill introduces a virtuous cycle:
  • Developers can ramp up more quickly on new APIs, providing quicker feedback to the platform while the APIs are still the most malleable.
  • Mistakes in APIs can be corrected quickly by the developers who use them, and library authors who serve them, providing high-fidelity, critical feedback to browser vendors and platform designers.
  • Library authors can experiment with new APIs and create more cow-paths for the platform to pave.

In the case of the V1 custom elements spec, feedback from developers is being ignored — I'm not the only person that has complained publicly about the way that the class-based definitions are a pain to use in a mixed-browser environment. But more importantly, the spec is actively hostile to polyfills in a way that the original version was not. Authors currently working to shim the V1 API into browsers have faced three problems:

  1. Calling super() invokes magic that's hard to reproduce in ES5, and needlessly so.
  2. HTMLElement isn't a callable function in older environments, and has to be awkwardly monkey-patched.
  3. Apple publicly opposes extending anything other than the generic HTMLElement, and has only allowed it into the spec so they can kill it later.

The end result is that you can write code that will work in old and new browsers, but it won't exactly look like real V1 code. It's not a true polyfill, more of a mini-framework that looks almost — but not exactly! — like the native API.

I find this frustrating in part for its inelegance, but more so because it fundamentally puts the lie to the principles of the extensible web. You can't claim that you're explaining the capabilities of the platform when your API is polyfill-hostile, since a polyfill is the mechanism by which we seek to explain and extend those capabilities.

More importantly, there is no surer way to slow adoption of a web feature than to artificially restrict its usage, and to refuse to educate developers on how to use it. The spec didn't have to be this way: they could detail ES5 semantics, and help people who are struggling, but they've chosen not to care. As someone who literally stood on a stage in front of hundreds of people and advocated for this feature, that's insulting.

Contrast the bullying attitude of the custom elements spec authors with the advocacy that's been done on behalf of Service Worker. You couldn't swing a dead cat in 2016 without hitting a developer advocate talking up their benefits, creating detailed demos, offering advice to people trying them out, and talking about how they gracefully degrade in older browsers. As a result, chances are good that Service Worker will ship in multiple browsers, and see widespread adoption, by the end of next year.

Meanwhile, custom elements will probably languish in relative obscurity, as they've done for many years now. It's a shame, because I'd argue that the benefits of custom elements are strong enough to justify using them even via the old V0 polyfill. I still think they're a wonderful way to build and declare UI, and we'll keep using them at the Times. But whatever wider success they achieve will be despite the spec, not because of it. It's a disgrace to the idea of an extensible web. And the authors have only themselves to blame.

August 10, 2016

Filed under: tech»web

RIP Chrome apps

Update: Well, that was prescient.

At least once a day, I log into the Chrome Web Store dashboard to check on support requests and see how many users I've still got. Caret has held steady for the last year or so at about 150,000 active users, give or take ten thousand, and the support and feature requests have settled into a predictable rut:

  • People who can't run Caret because their version of Chrome is too old, and I've started using new ES6 features that aren't supported six browser versions back.
  • People who want split-screen support, and are out of luck barring a major rewrite.
  • People who don't like the built-in search/replace functionality, which makes sense, because it's honestly pretty terrible.
  • People who don't like the icons, and are just going to have to get over it.

In a few cases, however, users have more interesting questions about the fundamental capabilies of developer tooling, like file system monitoring or plugging into the OS in a deeper way. And there I have bad news, because as far as I can tell, Chrome apps are no longer actively developed by the Chromium team at all, and probably never will be again.

I don't think Chrome apps are going away immediately — they're still useful and used by a lot of third-party companies — but it's pretty clear from the dev side of things that Google's heart isn't in it anymore. New APIs have ceased to roll out, and apps don't get much play at conferences. The new party line is all about progressive web apps, with browser extensions for the few cases where you need more capabilities.

Now, progressive web apps are great, and anything that moves offline applications away from a single browser and out to the wider web is a good thing. But the fact remains that while a large number of Chrome apps can become PWAs with little fuss, Caret can't. Because it interacts with the filesystem so heavily, in a way that assumes a broader ecosystem of file-based tools (like Git or Node), there's actually no path forward for it using browser-only APIs. As such, it's an interesting litmus test for just how far web apps can actually reach — not, as some people have wrongly assumed, because there's an inherent performance penalty on the web, but because of fundamental limits in the security model of the browser.

Bounding boxes

What's considered "possible" for a web app in, say, 2020? It may be easier to talk about what isn't possible, which avoids the judgment call on what is "suitable." For example, it's a safe bet that the following capabilities won't ever be added to the web, even though they've been hotly debated in and out of standards committees for years:

  • Read/write file access (died when the W3C pulled the plug on the Directories part of the Filesystem API)
  • Non-HTTP sockets and networking (an endless number of reasons, but mostly "routers are awful")

There are also a bunch of APIs that are in experimental stages, but which I seriously doubt will see stable deployment in multiple browsers, such as:

  • Web Bluetooth (enormous security and usability issues)
  • Web USB (same as Bluetooth, but with added attacks from the physical connection)
  • Battery status (privacy concerns)
  • Web MIDI

It's tough to get worked up about a lot of the initiatives in the second list, which mostly read as a bad case of mobile envy. There are good reasons not to let a web page have drive-by access to hardware, and who's hooking up a MIDI keyboard to a browser anyway? The physical web is a better answer to most of these problems.

When you look at both lists together, one thing is clear: Chrome apps have clearly been a testing ground for web features. Almost all the not-to-be-implemented web APIs have counterparts in Chrome apps. And in the end, the web did learn from it — mainly that even in a sandboxed, locked-down, centrally distributed environment, giving developers that much power with so little install friction could be really dangerous. Rogue extensions and apps are a serious problem for Chrome, as I can attest: about once a week, shady people e-mail me to ask if they can purchase Caret. They don't explicitly say that they're going to use it to distribute malware and takeover ads, but the subtext is pretty clear.

The great thing about the web is that it can run code without any installation step, but that's also the worst thing about it. Even as a huge fan of the platform, the idea that any of the uncountable pages I visit in any given week could access USB directly is pretty chilling, especially when combined with exploits for devices that are plugged in, like hacking a phone (a nice twist on the drive-by jailbreak of iOS 4). Access to the file system opens up an even bigger can of worms.

Basically, all the things that we want as developers are probably too dangerous to hand out to the web. I wish that weren't true, but it is.

Untrusted computing

Let's assume that all of the above is true, and the web can't safely expand for developer tools. You can still build powerful apps in a browser, they just have to be supported by a server. For example, you can use a service like Cloud 9 (now an AWS subsidiary) to work on a hosted VM. This is the revival of the thick-client model: offline capabilities in a pinch, but ultimately you're still going to need an internet connection to get work done.

In this vision, we are leaning more on the browser sandbox: creating a two-tier system with the web as a client runtime, and a native tier for more trust on the local machine. But is that true? Can the web be made safe? Is it safe now? The answer is, at best, "it depends." Every third-party embed or script exposes your users to risk — if you use an ad network, you don't have any real idea who could be reading their auth cookies or tracking their movements. The miracle of the web isn't that it is safe, it's that it manages to be useful despite how rampantly unsafe its defaults are.

So along with the shift back to thick clients has come a change in the browser vendors' attitude toward powerful API features. For example, you can no longer use geolocation or the camera/microphone in Chrome on pages that aren't served over HTTPS, with other browsers to follow. Safari already disallows third-party cookie access as a general rule. New APIs, like Service Worker, require HTTPS. And I don't think it's hard to imagine a world where an API also requires a strict Content Security Policy that bans third-party embeds altogether (another place where Chrome apps led the way).

The packaged app security model was that if you put these safeguards into place and verified the package contents, you could trust the code to access additional capabilities. But trusting the client was a mistake when people were writing Quakebots, and it stayed a mistake in the browser. In the new model, those controls are the minimum just to keep what you had. Anything extra that lives solely on the client is going to face a serious uphill battle.

Mind the gap

The longer that I work on Caret, the less I'm upset by the idea that its days are numbered. Working on a moderately-successful open source project is exhausting: people have no problems making demands, sending in random changes, or asking the same questions over and over again. It's like having a second boss, but one that doesn't pay me or offer me any opportunities for advancement. It's good for exposure, but people die from exposure.

The one regret that I will have is the loss of Caret's educational value. Since its early days, there's been a small but steady stream of e-mail from teachers who are using it in classrooms, both because Chromebooks are huge in education and because Caret provides a pretty good editor with almost no fuss (you don't even have to be signed in). If you're a student, or poor, or a poor student, it's a pretty good starter option, with no real competition for its market niche.

There are alternatives, but they tend to be online-only (like Mozilla's Thimble) or they're not Chromebook friendly (Atom) or they're completely unacceptable in a just world (Vim). And for that reason alone, I hope Chrome keeps packaged apps around, even if they refuse to spend any time improving the infrastructure. Google's not great at end-of-life maintenance, but there are a lot of people counting on this weird little ecosystem they've enabled. It would be a shame to let that die.

August 1, 2016

Filed under: tech»web


On Thursday, I'll be giving a talk at CascadiaFest on using custom elements in production. It's kind of a sales pitch, to convince people that adopting web components is safe to do, despite the instability of the spec and the contentious politics between browsers. After all, we've been publishing with several components at the Times for almost two years now, with good results.

When I presented an early version of this at SeattleJS, I presented by scrolling through a single text file instead of slides, because I've always wanted to do that. But for Cascadia, I wanted to do something a little more special, so I built the presentation itself out of custom elements, with the goal that it would demonstrate how to write code that works with both versions of the spec. It's also meant to be a good example for someone who's just learning how web components function — I use pretty much every custom elements feature at one point or another in 300 lines of code. You can take a look at the source for it here.

There are several strategies that I ended up emphasizing while writing the <slide-show> elements, primarily the heavy use of events to tame asynchronicity. It turns out that between V0, V1, and the two major polyfills, elements and their attributes are resolved by the parser with entirely different timing. It's really important that child elements notify their parent when they upgrade, and parents shouldn't assume that children are ready at startup.

One way to deal with asynchronous upgrades is just to put all your functionality in the parent element (our <leaflet-map> does this), but I wanted to make these slides easier to extend with new types (such as text, code, or image slides). In this case, the slide show looks for a parsedContent property on the current slide, and it's the child's job to populate and update that value. An earlier version called a parseContents() method, but using properties as "duck-typing" makes it much easier to handle un-upgraded elements, and moving the responsibility to the child also greatly simplified the process of watching slide contents for changes.

A nice side effect of using live properties and events is that it "feels" a lot more like a built-in element. The modern DOM API is built on similar primitives, so writing the glue code for the UI ended up being very pleasant, and it's possible to interact using the dev tools in a natural way. I suspect that well-built component libraries in the future will be judged on how well they leverage a declarative interface to blend in with existing elements.

Ironically, between child elements and Shadow DOM, it's actually much harder to move between different polyfills than it is to write an element definition for both the new and old specifications. We've always written for Giammarchi's registerElement shim at the Times, and it was shocking for me to find out that Polymer's shim not only diverges from its counterpart, but also differs from Chrome's native implementation. Coding around these differences took a bit of effort, but it's probably work I should have done at the start, and the result is quite a bit nicer than some of the hacks I've done for the Times. I almost feel like I need to go back now and update them with what I've learned.

Writing this presentation was a good way to make sure I was current on the new spec, and I'm actually pretty happy with the way things have turned out. When WebKit started prototyping their own API, I started to get a bit nervous, but the resulting changes are relatively minor: some property names have changed, the lifecycle is ordered a bit differently, and upgrade code is called in the constructor (to encourage using the class syntax) instead of from a createdCallback() method. Most of these are positive alterations, and while there are some losses going from V0 to V1 (no is attribute to subclass arbitrary elements), they're not dealbreakers. Overall, I'm more optimistic about the future of web components than I have in quite a while, and I'm looking forward to telling people about it at Cascadia!

July 14, 2016

Filed under: gaming»perspective

Emu Nation

It's hard to hear news of Nintendo creating a tiny, $60 NES package and not think of Frank Cifaldi's provocative GDC talk on emulation. Cifaldi, who works on game remastering and preservation (most recently on a Mega Man collection), covers a wide span of really interesting industry backstory, but his presentation is mostly infamous for the following quote:

The virtual console is nothing but emulations of Nintendo games. And in fact, if you were to download Super Mario Brothers on the Wii Virtual Console...

[shows a screenshot of two identical hex filedumps]

So on the left there is a ROM that I downloaded from a ROM site of Super Mario Brothers. It's the same file that's been there since... it's got a timestamp on it of 1996. On the right is Nintendo's Virtual Console version of Super Mario Brothers. I want you to pay particular attention to the hex values that I've highlighted here.

[the highlighted sections are identical]

That is what's called an iNES header. An iNES header is a header format developed by amateur software emulators in the 90's. What's that doing in a Nintendo product? I would posit that Nintendo downloaded Super Mario Brothers from the internet and sold it back to you.

As Cifaldi notes, while the industry has had a strong official anti-emulation stance for years, they've also turned emulation into a regular revenue stream for Nintendo in particular. In fact, Nintendo has used scaremongering about emulation to monopolize the market for any games that were published on its old consoles. In this case, the miniature NES coming to market in November is almost certainly running an emulator inside its little plastic casing. It's not so much that they're opposed to emulation, so much as they're opposed to emulation that they can't milk for cash.

To fully understand how demented this has become, consider the case of Yoshi's Island, which is one of the greatest platformers of the 16-bit era. I am terrible at platformers but I love this game so much that I've bought it at least three times: once in the Gameboy Advance port, once on the Virtual Console, and once as an actual SNES cartridge back when Belle and I lived in Arlington. Nintendo made money at least on two of those copies, at least. But now that we've sold our Wii, if I want to play Yoshi's Island again, even though I have owned three legitimate copies of the game I would still have to give Nintendo more money. Or I could grab a ROM and an emulator, which seems infinitely more likely.

By contrast, I recently bought a copy of Doom, because I'd never played through the second two episodes. It ran me about $5 on Steam, and consists of the original WAD files, the game executable, and a preconfigured version of DOSBox that hosts it. I immediately went and installed Chocolate Doom to run the game fullscreen with better sound support. If I want to play Doom on my phone, or on my Chromebook, or whatever, I won't have to buy it again. I'll just copy the WAD. And since I got it from Steam, I'll basically have a copy on any future computers, too.

(Episode 1 is definitely the best of the three, incidentally.)

Emulation is also at the core of the Internet Archive's groundbreaking work to preserve digital history. They've preserved thousands of games and pieces of software via browser ports of MAME, MESS, and DOSBox. That means I can load up a copy of Broderbund Print Shop and relive summer at my grandmother's house, if I want. But I can also pull up the Canon Cat, a legendary and extremely rare experiment from one of the original Macintosh UI designers, and see what a radically different kind of computing might look like. There's literally no other way I would ever get to experience that, other than emulating it.

The funny thing about demonizing emulation is that we're increasingly entering an era of digital entertainment that may be unpreservable with or without it. Modern games are updated over the network, plugged into remote servers, and (on mobile and new consoles) distributed through secured, mostly-inaccessible package managers on operating systems with no tradition of backward compatibility. It may be impossible, 20 years from now, to play a contemporary iOS or Android game, similar to the way that Blizzard themselves can't recreate a decade-old version of World of Warcraft.

By locking software up the way that Nintendo (and other game/device companies) have done, as a single-platform binary and not as a reusable data file, we're effectively removing them from history. Maybe in a lot of cases, that's fine — in his presentation, Cifaldi refers offhand to working on a mobile Sharknado tie-in that's no longer available, which is not exactly a loss for the ages. But at least some of it has to be worth preserving, in the same way even bad films can have lessons for directors and historians. The Canon Cat was not a great computer, but I can still learn from it.

I'm all for keeping Nintendo profitable. I like the idea that they're producing their own multi-cart NES reproduction, instead of leaving it to third-party pirates, if only because I expect their version will be slicker and better-engineered for the long haul. But the time has come to stop letting them simultaneously re-sell the same ROM to us in different formats, while insisting that emulation is solely the concern of pirates and thieves.

June 20, 2016

Filed under: culture»america»race_and_class

Under our skin

This week, we've launched a major project at the Times on the words people use when talking about race in America. Under our skin was spearheaded by a small group of journalists after the paper came under fire for some bungled coverage. I think they did a great job — the subjects are well-chosen, the editing is top-notch, and we're trying to supplement it with guest essays and carefully-curated comments (as opposed to our usual all-or-nothing approach to moderation). I mostly watched from the sidelines on this one, as our resident expert on forcing Brightcove video to behave in a somewhat-acceptable manner, and it was really fascinating watching it take shape.

May 26, 2016

Filed under: random»personal

Speaking schedule, 2016

After NICAR, I wasn't really sure I ever wanted to go to any conferences ever again — the travel, the hassle, the expense... who needs it? But I am also apparently unable to moderate my extracurricular activities in any way, even after leaving a part-time teaching gig, so: I'm happy to announce that I'll be speaking at a couple of professional conferences this summer, albeit about very different topics.

First up, I'll be facilitating a session at SRCCON in Portland about designing humane news sites. This is something I've been thinking about for a while now, mostly with regards to bots and "conversational UI" fads, but also as the debate around ads has gotten louder, and the ads themselves have gotten worse (see also). I'm hoping to talk about the ways that we can build both individual interactives and content management systems so that we can minimize the amount of accidental harm that we do to our readers, and retain their trust.

My second talk will be at CascadiaFest in beautiful Semiahmoo, WA. I'll be speaking on how we've been using custom elements in production at the Times, and encouraging people to build their own. The speaker list at Cascadia is completely bonkers: I'll be sharing a stage with people who I've been following for years, including Rebecca Murphey, Nolan Lawson, and Marcy Sutton. It's a real honor to be included, and I've been nervously rewriting my slides ever since I got in.

Of course, by the end of the summer, I may never want to speak publicly again — I may burn my laptop in a viking funeral and move to Montana, where I can join our departing editor in some kind of backwoods hermit colony. But for right now, it feels a lot like the best parts of teaching (getting to show people cool stuff and inspire them to build more) without the worst parts (grading, the school administration).

May 10, 2016

Filed under: tech»web

Behind the Times

The paper recently launched a new native app. I can't say I'm thrilled about that, but nobody made me CEO. Still, the technical approach it takes is "interesting:" its backing API converts articles into a linear stream of blocks, each of which is then hand-rendered in the app. That's the plan, at least: at this time, it doesn't support non-text inline content at all. As a result, a lot of our more creative digital content doesn't appear in the app, or is distorted when it does appear.

The justification given for this decision was speed, with the implicit statement being that a webview would be inherently too slow to use. But is that true? I can't resist a challenge, and it seemed like a great opportunity to test out some new web features I haven't used much, so I decided to try building a client. You can find the code here. It's currently structured as a Chrome app, but that's just to get around the CORS limit since our API doesn't have the Access-Control-Allow-Origin headers added.

The app uses a technique that's been popularized by Nolan Lawson's, in which almost all of the time-consuming code runs in a Web Worker, and the main thread just handles capturing UI events and re-rendering. I started out with the worker process handling network and caching in IndexedDB (the poor man's Service Worker), and then expanded it to do HTML sanitization as well. There's probably other stuff I could move in, but honestly I think it's at a good balance now.

By putting all this stuff into a second script that runs independently, it frees up the browser to maintain a smooth frame rate in animations and UI response. It's not just the fact that I'm doing work elsewhere, but also that there's hardly any garbage collection on the main thread, which means no halting while the JavaScript VM cleans up. I thought building an app this way would be difficult, but it turns out to be mostly similar to writing any page that uses a lot of AJAX — structure the worker as a "server" and the patterns are pretty much the same.

The other new technology that I learned for this project is Mithril, a virtual DOM framework that my old coworkers at ArenaNet rave about. I'm not using much of its MVC architecture, but its view rendering code is great at gradually updating the page as the worker sends back new data: I can generate the initial article list using just the titles that come from one network endpoint, and then add the thumbnails that I get from a second, lower-priority request. Readers get a faster feed of stories, and I don't have to manually synchronize the DOM with the new data.

The metrics from this version of the app are (unsurprisingly) pretty good! The biggest slowdown is the network, which would also be a problem in native code: loading the article list for a section requires one request to get the article IDs, and then one request for each article in that section (up to 21 in total). That takes a while — about a second, on average. On the other hand, it means we have every article cached by the time that the user can choose something to read, which cuts the time for requesting and loading an individual article hovers around 150ms on my Chromebook.

That's not to say that there aren't problems, although I think they're manageable. For one thing, the worker and app bundles are way too big right now (700KB and 200KB, respectively), in part because they're pulling in a bunch of big NPM modules to do their processing. These should be lazy-loaded for speed as much as possible: we don't need HTML parsing right away, for example, which would cut a good 500KB off of the worker's initial size. Every kilobyte of script is roughly 1ms of load time on a mobile device, so spreading that out will drastically speed up the app's startup time.

As an interesting side note, we could cut almost all that weight entirely if the document.implementation object was available in Web Workers. Weir, for example, does all its parsing and sanitization in an inert document. Unfortunately, the DOM isn't thread-safe, so nothing related to document is available outside the main process, and I suspect a serious sanitization pass would blow past our frame budget anyway. Oh well: htmlparser2 and friends it is.

Ironically, the other big issue is mostly a result of packaging this up as a Chrome app. While that lets me talk to the CMS without having CORS support, it also comes with a fearsome content security policy. The app shell can't directly load images or fonts from the network, so we have to load article thumbnails through JavaScript manually instead. Within Chrome's <webview> tag, we have the opposite problem: the webview can't load anything from the app, and it has a weird protocol location when loaded from a data URL, so all relative links have to be rewritten. It's not insurmountable, but you have to be pretty comfortable with the way browsers work to figure it out, and the debugging can get a little hairy.

So there you have it: a web app that performs like native, but includes support for features like DocumentCloud embeds or interactive HTML graphs. At the very least, I think you could use this to advocate for a hybrid native/web client on your news site. But there's a strong argument to be made that this could be your only app: add a Service Worker and (in Chrome and Firefox) it could load instantly and work offline after the first visit. It would even get a home screen icon and push notification support. I think the possibilities for progressive web apps in the news industry are really exciting, and building this client makes me think it's doable without a huge amount of extra work.

April 29, 2016

Filed under: journalism»education

Reporting with Python

This month, I'm teaching a class at the University of Washington on reporting with Python. This seems like an odd match for me, since I hardly ever work with Python, but I wanted to do a class that was more journalism-focused (as opposed to the front-end development that I normally teach) and teaching first-time programmers how to do data analysis in Node just isn't realistic. If you're interested in following along, the repository with the class materials is located here

I'm not the Times' data reporter, so I don't get to do this kind of analysis often, but I always really enjoy it when I do. The danger when planning a class on a fun topic is that it's easy to over-stuff the curriculum in my eagerness to cover the techniques that I think are particularly interesting. To fight that impulse, I typically make a list of material I want to cover, then cut it in half, then think about cutting it in half again. As a result, there's a lot of stuff that didn't make it in — SQL and web scraping primarily among them.

What's left, however, is a pretty solid base for reporters who are interested in starting to use code to generate and explore stories. Last week, we cleaned and searched 1,000 text files for a string, and this week we'll look at doing analysis on CSV files. In the final session, I'm planning on taking a deep dive into regular expressions: so much of reporting is based around interrogating text files, and the nice thing about an education in regex is that it will travel into almost any programming language (as well as being useful for many command line tools like grep or sed).

If I can get anything across in this class, I'm hoping to leave students with an understanding of just how big digital scale can be, and how important it is to have tools for handling it. I was talking one night with one of the Girl Develop It organizers, who works for a local analytics company. Whereas millions of rows of data is a pretty big deal for me, for her it's a couple of hours on a Saturday — she's working at a whole other order of magnitude. I wouldn't even know where to start.

Right now, most record requests and data dumps operate more at my scale. A list of all animal imports/exports in the US for the last ten years is about 7 million records, for example. That's approachable with Python, although you'd be better off learning some SQL for the heavy lifting, but it's past the point where Excel is useful, and it certainly couldn't be explored by hand. If you can't code, or you don't have access to someone who does, you can't write that story.

At some point, the leaks and government records that reporters pore over may grow to a larger kind of scale (leaks, certainly; government data has will be aggregated as long as there are privacy concerns). When that happens, reporters will have to develop the kinds of skills that I don't have. We already see hints of this in the tremendous tooling and coordination required for investigating the Panama papers. But in the meantime, I think it's tremendously important that students learn how to automate data at a basic level, and I'm really excited that this class will introduce them to it.

Future - Present - Past