this space intentionally left blank

January 5, 2011

Filed under: journalism»new_media»data_driven

Query, But Verify

About a month back, a prominent inside-the-Beltway political magazine ran a story on Tea Party candidates and earmarks, claiming that anti-earmark candidates were responsible for $1 billion in earmarks over 2010. I had just finished building a comprehensive earmark package based on OMB data, so naturally my editor sent me a link to the story and asked me to double-check their math. At first glance, the numbers generally matched--but on a second examination, the article's total double- and triple-counted earmarks co-sponsored by members of the Tea Party "caucus." Adjusting my query to remove non-distinct earmark IDs knocked about $100 million off the total--not really that much in the big picture (the sum still ran more than $900 million), but enough to fall below the headline-ready "more than $1 billion" mark. It was also enough to make it clear that the authors hadn't really understood what they were writing about.

In general, I am in favor of journalists learning how to leverage databases for better analysis, but it's an easy technology to misuse, accidentally--or even on purpose. There's a truism that the skills required to interpret statistics go hand in hand with the skills used to misrepresent them, and nowhere is that more pertinent than in the newsroom. Reporters and editors entering the world of data journalism need to hold onto the same critical skills they would use for any other source, not be blinded by the ease with which they can reach a catchy figure.

That said, journalists would do well to learn about these tools, especially in beats like economics and politics, if only to be able to spot their abuses. And there are three strong arguments for using databases (carefully!) for reporting: improving newsroom mathematical literacy, asking questions at modern scale, and making connections easier.

First, it's no secret that journalists and math are often uneasy bedfellows--a recent Washington Post ombudsman piece explored some of the reasons why numerical corrections are so common. In short: we're an industry of English majors whose eyes cross when confronted with simple sums, and so we tend to take numbers at face value even during the regular copy-editing process.

These anxieties are signs of a deeper problem that needs to be addressed, and there's nothing magical about SQL that will fix them overnight. But I think database training serves two purposes. First, it acclimatizes users to dealing with large sets of numbers, like treating nosocomephobia with a nice long hospital stay. Second, it reveals the dirty secret of programming, which is that it involves a lot of math process, but relatively little actual adding or subtracting, especially in query languages. Databases are a good way to get comfortable with numbers without having to actually touch them directly.

Ultimately, journalists need to be comfortable with numbers, because they're becoming an institutional hazard. While the state of government (and private-sector) data may still leave a lot to be desired from a programmer's point of view, it's practically flooded out over the last few years, with machine-readable formats becoming more common. This mass of data is increasingly unmanageable via spreadsheet: there are too many rows, too many edge cases, and too much filtering required. Doing it by hand is a pipe-dream. A database, on the other hand, is designed to handle queries across hundreds of thousands of rows or more. Languages like SQL let us start asking questions at the necessary scale.

Finally, once we've gotten over a fear of numbers and begun to take large data sets for granted, we can start using relational databases to make connections between data sets. This synthesis is a common visualization task that is difficult to do by hand--mapping health spending against immigration patterns, for example--but it's reasonably simple to do with a query in a relational database. The results of these kinds of investigations may not even be publishable, but they are useful--searching for correlation is a great jumping-off point for further reporting. One of the best things I've done for my team lately is set up a spare box running PostgreSQL, which we use for uploading, combining, searching, and then outputting translated versions of data, even in static form.

As always when I write these kinds of posts, remember that there is no Product X for saving journalism. Adding a database does not make your newsroom Web 2.0, and (see the example I opened with) it's not a magic bullet for better journalism. But new technology does bring opportunities for our industry, if we can avoid the Product X hype. The web doesn't save newspapers, but it can (and should) make sourcing better. Mobile apps can't save subscription revenues, but they offer better ways to think about presentation. And databases can't replace an informed, experienced editor, but they can give those journalists better tools to interrogate the world.

Past - Present