Once more with feeling: today, I'm happy to bring you my last CQ vote study interactive. This version is something special: although it lacks the fancy animations of its predecessor, it offers a full nine years of voting data, and it does so faster and in more detail. Previously, we had only offered data going back to 2009, or a separate interactive showing the Bush era composite scores.
We had talked about this three-pane presentation at CQ as far back as two years ago, in a discussion with the UX team on how they could work together with my multimedia team. Our goal was to lower the degree to which a user had to switch manually between views, and to visually reinforce what the scatter plot represents: a spatial view of party discipline. I think it does a pretty good job, although I do miss the pretty transitions between different graph types.
Technically speaking, loading nine years of votestudy data was a challenge: that's almost 5,000 scores to collect, organize, and display. The source files necessarily separate member biodata (name, district, party, etc) from the votestudy data, since putting the two into the same data structure would bloat the file size from repetition (many members served in multiple years). But keeping them separate causes a lag problem while interacting with the graphic: doing lookups based on XML queries tends to be very slow, particularly over 500K of XML.
I tried a few tricks to find a balance between real-time lookup (slow interaction, quick initial load) and a full preprocessing step (slow initial load, quick interactions). In the end, I went with an approach that processes each year when it's first displayed, adding biodata to the votestudy data structure at that time, and caching member IDs to minimize the lookup time on members who persist between years. The result is a slight lag when flipping between years or chambers for the first time, but it's not enough to be annoying and the startup time remains quick.
(In a funny side note, working with just the score data is obscenely quick. It's fast enough, in fact, that I can run through all nine years to find the bounds for the unity part of graph to keep it consistent from year to yearin less than a millisecond. That's fast enough that I can be lazy and do that before every re-render--as long as I don't need any names. Don't optimize prematurely, indeed.)
The resulting graphic is typical of CQ interactives, in that it's a direct view on our data without a strong editorial perspective--we don't try to hammer a story through here. That said, I think there's some interesting information that emerges when you can look at single years of data going back to 2002:
Finally, I did mention that this is my last CQ votestudy interactive. It's been a fantastic ride at Congressional Quarterly, and I'm grateful for the opportunities and education I received there. But it's time to move on, and to find something closer to home here in Seattle: at the end of this month, I'll be starting in a new position, doing web development at Big Fish Games. Wish me luck!
As the deadlines creep forward for the Joint Special Committee on Deficit Reduction, my team at CQ has put together a package of new and recent debt interactives covering the automatically-triggered budget cuts, the proposals on the table, the schedule set for committee action, and more.
The centerpiece of the package is a "reactive document" showing how the automatic cuts will go into effect if Congress does not pass cuts totalling $1.2 trillion by January 15. A series of sliders set the size of the hypothetical cuts, and the text and diagrams of the document adjust themselves to match. It's a neat idea, and one that's kind of a natural match for CQ: wordy, but still wonky.
Like a lot of people, I encountered the idea of reactive documents through Bret Victor's essay Explorable Explanations. Victor is an ex-Apple UI designer who wants to re-think the way people teach math, and reactive documents are one of the tools he wants to use. His explorations of learning design via reactive documents, such as Up and Down the Ladder of Abstraction, are breathtaking. As he writes,
There's nothing new about scenario modeling. The authors of this proposition surely had an Excel spreadsheet which answered the same questions. But a spreadsheet is not an explanation. It is merely a dataset and model; it cannot be read. An explanation requires an author, to interpret the results of the model, and present them to the reader via language and graphics.
The reactive document integrates spreadsheet-like models into authored text. It can be read at multiple levels, depending on the reader's level of interest. The hurried reader can skim it. The casual reader can read it as-is. The curious reader can adjust the author's scenarios. The engaged reader can explore scenarios of his own devising.
Unlike a spreadsheet, the barrier to exploration here is extremely low -- simply click and drag. This invites casual readers to become engaged and start exploring. It transforms readers from passive to active.
Victor's idea is a clever one, and as someone who often describes interactives using the same "layered reading" mechanism, it appeals to my storytelling sense. I also like that it embraces the original purpose of the web--to present hypertext documents--without sacrificing the rich interactions that browser applications have developed. That said, I'm not entirely convinced that reactive documents like this are actually terribly useful or novel.
The main problem with this method of presenting interactive information is that it's actually really burdensome for the playful user. It's easy to read, but if you change anything, you have to basically either read and process the entire paragraph again, or you have to learn to pick out individual changes and their meaning from a jumble of words. Besides, sometimes words are not a very good description of an effect or process--imagine describing complex machinery only in paragraph form.
Victor also has some examples that avoid this flaw by making the reactive document incorporate diagrams and graphs alongside his formulas. These are great, but they also illustrate the fact that, once you make reactive "documents" more visual and take away the intertextual trickery, they're really just regular interactives. They're stunningly designed, and I'm always in favor of more multimedia, but there's nothing new about them.
This probably comes off as a little more adversarial to the concept of reactive documents than I actually am, most of which is just my rhetorical background leaking out. I think they're neat, and I would guess that Victor himself thinks of them less as a complete solution and more as a different shade in his teaching palette. In some places, they're helpful, in others not so much.
As an Excel enthusiast, though, I do take exception to Victor's description of spreadsheets as something that "cannot be read," with a high barrier to entry. People read and create spreadsheets all the time, although (to my frustration) they often use them as layout tools. But a spreadsheet that's already set up for someone and locked up to prevent mistakes is barely any more difficult to use than his draggable text--the only real difference is the need to type a number. Regular people may find spreadsheet formulas difficult to connect with cells, but those same people are unlikely to be creating Victor's reactive documents either.
Ultimately, I'm wary of claims that any tool is a silver bullet for education or explainer journalism. It's easy to be blinded by slick UX, and to forget that we're basically just re-inventing storytelling tools used by great teachers for centuries. That shouldn't eliminate interactive games and illustrations from our kit. But reading Victor's site, it's easy to give the technology credit for its thought-provoking qualities, when the credit really goes to his lucid, considered reasoning and clear writing (both of which mean that the technology is well-applied). Sadly, there's no script for that.
Recently my team worked on an interactive for a CQ Weekly Outlook on contracts. Government contracting is, of course, a big deal in these economic times, and the government spent $538 billion on contractors in FY2010. We wanted to show people where the money went.
I don't think this is one of our best interactives, to be honest. But it did raise some interesting challenges for us, simply because the data set was so huge: the basic table of all government contracts for a single fiscal year from USA Spending is around 3.5 million rows, or about 2.5GB of CSV. That's a lot of data for the basic version: the complete set (which includes classification details for each contract, such as whether it goes to minority-owned companies) is far larger. When the input files are that big, forget querying them: just getting them into the database becomes a production.
My first attempt was to write a quick PHP script that looped through the file and loaded it into the table. This ended up taking literally ten or more hours for each file--we'd never get it done in time. So I went back to the drawing board and tried using PostgreSQL's COPY command. COPY is very fast, but the destination has to match the source exactly--you can't skip columns--which is a pain, especially when the table in question has so many columns.
To avoid hand-typing 40-plus columns for the table definition, I used a combination of some command line tools, head and sed mostly, to dump the header line of the CSV into a text file, and then added enough language for a working CREATE TABLE command, everything typed as text. With a staging table in place, COPY loaded millions of rows in just a few minutes, and then I converted a few necessary columns to more appropriate formats, such as the dollar amounts and the dates. We did a second pass to clean up the data a little (correcting misspelled or inconsistent company names, for example).
Once we had the database in place, and added some indexes so that it wouldn't spin its wheels forever, we could start to pull some useful data, like the state-by-state totals for a basic map. It's not surprising that the beltway bandits in DC, Maryland, and Virginia pull an incredible portion of contracting money--I had to clamp the maximum values on the map to keep DC's roughly $42,000 contract dollars per resident from blowing out the rest of the country--but there are some other interesting high-total states, such as New Mexico and Connecticut.
Now we wanted to see where the money went inside each state: what were the top five companies, funding agencies, and product codes? My inital attempts, using a series of subqueries and count() functions, were tying up the server with nothing to show for it, so I tossed the problem over to another team member and went back to working on the map, thinking I wanted to have something to show for our work. He came back with a great solution--PostgreSQL's PARTITION command, which splits a table into component parts, combined with the rank() function for filtering--and we were able to find the top categories easily. A variation on that template gave us per-agency totals and top fives.
There are a couple of interesting lessons to be learned from this experience, the most obvious of which is the challenges of journalism at scale. There are certain stories, particularly on huge subjects like the federal budget, where they're too big to be feasibly investigated without engaging in computer-assisted reporting, and yet they require skills beyond the usual spreadsheet-juggling.
I don't think that's going away. In fact, I think scale may be the defining quality of the modern information age. A computer is just a machine for performing simple operations at incredibly high speeds, to the point where they seem truly miraculous--changing thousands (or millions) of pixels each second in response to input, for example. The Internet expands that scale further, to millions of people and computers interacting with each other. Likewise, our reach has grown with our grasp. It seems obvious to me that our governance and commerce have become far more complex as a result of our ability to track and interact with huge quantities of data, from contracting to high-speed trading to patent abuse. Journalists who want to cover these topics are going to need to be able to explore them at scale, or be reliant on others who can do so.
Which brings us to the second takeaway from this project: in computer-assisted journalism, speed matters. If hours are required to return a query, asking questions becomes too expensive to waste on undirected investigation, and fact-checking becomes similarly burdensome. Getting answers needs to be quick, so that you can easily continue your train of thought: "Who are the top foreign contractors? One of them is the Canadian government? What are we buying from them? Oh, airplane parts--interesting. I wonder why that is?"
None of this is a substitute for domain knowledge, of course. I am lucky to work with a great graphics reporter and an incredibly knowledgeable editor, the combination of often saves me from embarrassing myself by "discovering" stories in the data that are better explained by external factors. It is very easy to see an anomaly, such as the high level of funding in New Mexico from the Department of Energy, and begin to speculate wildly, while someone with a little more knowledge would immediately know why it's so (in this case, the DoE controls funding for nuclear weapons, including the Los Alamos research lab in New Mexico).
Performing journalism with large datasets is therefore a three-fold problem. First, it's difficult to prepare and process. Second, it's tough to investigate without being overwhelmed. And finally, the sheer size of the data makes false patterns easier to find, requiring extra care and vigilance. I complain a lot about the general state of data journalism education, but this kind of exercise shows why it's a legitimately challenging mix of journalism and raw technical hackery. If I'm having trouble getting good results from sources with this kind of scale, and I'm a little obsessed with it, what's the chance that the average, fresh-out-of-J-school graduate will be effective in a world of big, messy data?
If I have a self-criticism of the work I'm doing at CQ, it's that I mostly make flat tools for data-excavation. We rarely set out with a narrative that we want to tell--instead, we present people with a window into a dataset and give them the opportunity to uncover their own conclusions. This is partly due to CQ's newsroom culture: I like to think we frown a bit on sensationalism here. But it is also because, to a certain extent, my team is building the kinds of interactives we would want to use. We are data-as-playground people, less data-as-theme-park.
It's also easier to create general purpose tools than it is to create a carefully-curated narrative. But that sounds less flattering.
In any case, our newest project does not buck this trend, but I think it's pretty fascinating anyway. "Against the Grain" is a browseable database of dissent on party unity votes in the House and Senate (party unity votes are defined by CQ as those votes where a majority of Republicans and a majority of Democrats took opposing sides on a bill). Go ahead, take a look at it, and then I'd like to talk about the two sides of something like this: the editorial and the technical.
Even when you're building a relatively straightforward data-exploration application like this one, there's still an editorial process in play. It comes through in the flow of interaction, in the filters that are made available to the user, and the items given particular emphasis by the visual design.
Inescapably, there are parallels here to the concept of "objective" journalism. People are tempted to think of data as "objective," and I guess at its most pure level it might be, but from a practical standpoint we don't ever deal with absolutely raw data. Raw data isn't useful--it has to be aggregated to have value (and boy, if there's a more perilous-but-true phrase in journalism these days than "aggregation has value," I haven't heard it). Once you start making decisions about how to combine, organize, and display your set, you've inevitably committed to an editorial viewpoint on what you want that data to mean. That's not a bad thing, but it has to be acknowledged.
Regardless, from an editorial perspective, we had a pretty specific goal with "Against the Grain." It began as an offshoot of a common print graphic using our votestudy data, but we wanted to be able to take advantage of the web's unlimited column inches. What quickly emerged as our showcase feature--what made people say "ooooh" when we talked it up in the newsroom--was to organize a given member's dissenting votes by subject code. What are the policy areas on which Member X most often breaks from the party line? Is it regulation, energy, or financial services? How are those different between parties, or between chambers? With an interactive presentation, we could even let people drill down from there into individual bills--and jump from there back out to other subject codes or specific members.
To present this process, I went with a panel-oriented navigation method, modeled on mobile interaction patterns (although, unfortunately, it still doesn't work on mobile--if anyone can tell me why the panels stack instead of floating next to each other on both Webkit and Mobile Firefox, I'd love to know). By presenting users with a series of rich menu options, while keeping the previous filters onscreen if there's space, I tried to strike a balance between query-building and giving room for exploration. Users can either start from the top and work down, by viewing the top members and exploring their dissent; from the bottom up, by viewing the most contentious votes and seeing who split from the party; or somewhere in the middle, by filtering the two main views through a vote's subject code.
We succeeded, I think, in giving people the ability to look at patterns of dissent at a member and subject level, but there's more that could be done. Congressional voting is CQ's raison d'etre, and we store a mind-boggling amount of legislative information that could be exploited. I'd like to add arbitrary member lookup, so people could find their own senator or representative. And I think it might be interesting to slice dissent by vote type--to see if there's a stage in the legislative process where discipline is particularly low or high.
So sure, now that we've got this foundation, there are lots of stories we'd like it to handle, and certain views that seem clunkier than necessary. It's certainly got its flaws and its oddities. But on the other hand, this is a way of browsing through CQ's vote database that nobody outside of CQ (and most of the people inside) have never had before. Whatever its limitations, it enables people to answer questions they couldn't have asked prior to its creation. That makes me happy, because I think a certain portion of my job is simply to push the organization forward in terms of what we consider possible.
So with that out of the way, how did I do it?
"Against the Grain" is probably the biggest JavaScript application I've written to date. It's certainly the best-written--our live election night interactive might have been bigger, but it was a mess of display code and XML parsing. With this project, I wanted to stop writing JavaScript as if it was the poor man's ActionScript (even if it is), and really engage on its own peculiar terms: closures, prototypal inheritance, and all.
I also wanted to write an application that would be maintainable and extensible, so at first I gave Backbone.js a shot. Backbone is a Model-View-Controller library of the type that's been all the rage with the startup hipster crowd, particularly those who use obstinately-MVC frameworks like Ruby on Rails. I've always thought that MVC--like most design patterns--feels like a desparate attempt to convert common sense into jargon, but the basic goal of it seemed admirable: to separate display code from internal logic, so that your code remains clean and abstracted from its own presentation.
Long story short, Backbone seems designed to be completely incomprehensible to someone who hasn't been writing formal MVC applications before. The documentation is terrible, there's no error reporting to speak of, and the sample application is next to useless. I tried to figure it out for a couple of hours, then ended up coding my own display/data layer. But it gave me a conceptual model to aim for, and I did use Backbone's underlying collections library, Underscore.js, to handle some of the filtering and sorting duties, so it wasn't a total loss.
One feature I appreciated in Backbone was the templating it inherits from Underscore (and which they got in turn from jQuery's John Resig). It takes advantage of the fact that browsers will ignore the contents of <script> tags with a type set to something other than "text/javascript"--if you set it to, say, "text/html" or "template," you can put arbitrary HTML in there. I created a version with Mustache-style support for replacing tags from an optional hash, and it made populating my panels a lot easier. Instead of manually searching for <span> IDs and replacing them in a JavaScript soup, I could simply pass my data objects to the template and have panels populated automatically. Most of the vote detail display is done this way.
I also wanted to implement some kind of inheritance to simplify my code. After all, each panel in the interactive shares a lot of functionality: they're basically all lists, most of them have a cascading "close" button, and they trigger new panels of information based on interaction. Panels are managed by a (wait for it...) PanelManager singleton that handles adding, removing, and positioning them within the viewport. The panels themselves take care of instantiating and populating their descendants, but in future versions I'd like to move that into the PanelManager as well and trigger it using custom events.
Unfortunately, out-of-the-box JavaScript inheritance is deeply weird, and it's tangled up in the biggest flaw of the language: terrible variable scoping. I never realized how important scope is until I saw how many frustrations JavaScript's bad implementation creates (no real namespaces! overuse of the "this" keyword! closures over loop values! ARGH IT BURNS).
Scope in JavaScript is eerily like Inception: at every turn, the language drops into a leaky subcontext, except that instead of slow-motion vans and antigravity hotels and Leonardo DiCaprio's dead wife, every level change is a new function scope. With each closure, the meaning of the "this" keyword changes to something different (often to something ridiculous like the Window object), a tendency worsened in a functional library like Underscore. In ActionScript, the use of well-defined Event objects and real namespaces meant I'd never had trouble untangling scope from itself, but in JavaScript it was a major source of bugs. In the end I found it helpful, in any function that uses "this" (read: practically everything you'll write in JavaScript), to immediately cache it in another variable and then only use that variable if possible, so that even inside callbacks and anonymous functions I could still reliably refer to the parent scope.
After this experience, I still like JavaScript, but some of the shine has worn off. The language has some incredibly powerful features, particularly its first-class functions, that the community uses to paper over the huge gaps in its design. Like Lisp, it's a small language that everyone can extend--and like Lisp, the downside is that everyone has to do so in order to get anything done. The result is a million non-standard libraries re-implementing basic necessities like classes and dependencies, and no sign that we'll ever get those gaps filled in the language itself. Like it or not, we're largely stuck with JavaScript, and I can't quite be thrilled about that.
This has been a long post, so I'll try to wrap up quickly. I learned a lot creating "Against the Grain," not all of it technical. I'm intrigued by the way these kinds of interactives fit into our wider concept of journalism: by operating less as story presentations and more as tools, do they represent an abandonment of narrative, of expertise, or even a kind of "sponsored" citizen journalism? Is their appearance of transparency and neutrality dangerous or even deceptive? And is that really any less true of traditional journalism, which has seen its fair share of abused "objectivity" over the years?
I don't know the answers to those questions. We're still figuring them out as an industry. I do believe that an important part of data journalism in the future is transparency of methodology, possibly incorporating open source. After all, this style of interactive is (obviously, given the verbosity on display above) increasingly complex and difficult for laymen to understand. Some way for the public to check our math is important, and open source may offer that. At the same time, the role of the journalist is to understand the dataset, including its limitations and possible misuses, and there is no technological fix for that. Yet.
Here are a few challenges I've started tossing out to prospective new hires, all of which are based on common, real-world multimedia tasks:
I learned this the hard way over the last four years. When I started working with ActionScript in 2007, it was the first serious programming I'd done since college, not counting some playful Excel macros. Consequently I had a lot of bad habits: I left a lot of variables in the global scope, stored data in ad-hoc parallel arrays, and embedded a lot of "magic number" constants in my code. Some of those are easy to correct, but the shift in thinking from "write a program that does X" to "design data structure Y, then write a program to operate on it" is surprisingly profound. And yet it makes a huge difference: when we created the Economic Indicators project, the most problematic areas in our code were the ones where the underlying data structures were badly-designed (or at least, in the case of the housing statistics, organized in a completely different fashion from the other tables).
Oddly enough, I think what caused the biggest change in my thinking was learning to use JQuery. Much like other query languages, the result of almost any JQuery API call is a collection of zero or more objects. You can iterate over these as if they were arrays, but the language provides a lot of functional constructs (each(), map(), filter(), etc.) that encourage users to think more in terms of generic operations over units of data (the fact that those units are expressed in JavaScript's lovely hashmap-like dynamic objects is just a bonus).
I suspect that data-orientation makes for better programmers in any field (and I'm not alone), but I'm particularly interested in it on my team because what we do is essentially to turn large chunks of data (governmental or otherwise) into stories. From a broad philosophical perspective, I want my team thinking about what can be extracted and explained via data, and not how to optimize their loops. Data first, code second--and if concentrating on the former improves the latter, so much for the better.
I have argued vociferously in the recent past that the journalistic craze for native clients--an enthusiasm seemingly rekindled by Rupert Murdoch's ridiculous Daily iPad publication--is a bad idea from a technical standpoint. They're clumsy, require a lot of platform-specific work, and they're not exactly burning up the newstands. It continues to amaze me that, despite the ubiquity of Webkit as a capable cross-platform hypertext runtime, people are still excited about recreating the Multimedia CD-ROM.
But beyond the technical barriers, publishing your news in a walled-garden application market raises some serious questions of professional journalistic ethics. Curation (read: a mandatory, arbitrary approval process) exacerbates the dilemma, but even relatively open app stores are, in my opinion, on shaky ground. These problems emerge along three axes: accountability, editorial independence, and (perhaps most importantly) the ideology of good journalism.
Accountability
One of the hallmarks of the modern web is intercommunication based on a set of simple, high-level protocols. From a system of URLs and HTTP, a whole Internet culture of blog commentary, trackbacks, Rickrolls, mashups, and embedded video emerged. Most recently, Twitter created a new version of the linkblog (and added a layer of indirection via link shortening). For a journalist, this should be exciting: it's a rich soup of comments and community swarming around your work. More importantly, it's a constant source of accountability. What, you thought corrections went away when we went online?
But that whole ecosystem of viral sharing and review gets disconnected when you lock your content into a native client. At least on Android, you can send content to other applications via the powerful Intent mechanism (the iOS situation is much less well-constructed, and I have no idea how Windows Mobile now handles this), but even that has unpredictable results--what are you sharing, after all? A URL to the web version? The article text? Can the user choose? And when it comes to submitting corrections or feedback, native apps default to difficult: of the five major news clients I tried on Android this morning (NPR, CBS, Fox, New York Times, and USA Today), not one of them had an in-app way to submit a correction. Regret the error, indeed.
Editorial Independence
Accountability is an important part of professional ethics in journalism. But so is editorial independence, and in both cases the perception of misbehavior can be even more damaging than any actual foul play. The issue as I see it is: how independent can you be, if your software must be approved during each update by a single, fickle gatekeeper?
As Dan Gillmor points out, selling journalism through an app store is a partnership, and that raises serious questions of independence. Are news organizations less likely to be critical of Google, Apple, and Microsoft when their access to the platform could be pulled at any time from the virtual shelves? Do the content-restrictions on both mobile app stores change the stories that they're likely to publish? Will app stores stand behind journalists operating under governments with low press freedom, or will they buckle to a "terms of service" attack? On the web, a paper or media outlet can largely write whatever they want. Physical distribution is so diverse, a single retail entity can't really shut you down. But in an app store, you publish at the pleasure of the platform owner--terms subject to revision. That kind of scenario should give journalists pause.
Ideology and Solidarity
Organizing the news industry is like herding cats: it's a cutthroat business traditionally fueled by intra-city competition, and it naturally attracts argumentative, over-critical personality types. But it's time that newsrooms start to stick up for the basic ideology of journalism. That means that when the owners of an app store start censoring applications based on content, as happened to political cartoonist Mark Fiore or the Eucalyptus e-book reader, we need to make it clear that we consider that behavior unacceptable--pulling apps, refusing to partner for big launch events, and pursuing alternative publication channels.
There's a reason that freedom of the press is included next to speech, religion, and assembly in the Bill of Rights' first amendment. It's an important part of the feedback loop between people, events, and government in a democracy. And journalists have traditionally been pretty hardcore about freedom of the press: see, for example, the lawsuit over the publication of the Pentagon Papers, as well as the entirety of Reporters Without Borders. If the App Store were a country, its ranking for press freedom would be middling at best, and newspapers wouldn't be nearly as eager to jump into bed with it. The fact that these curated markets retain widespread publication support, despite their history of censorship and instability, is an shame for the industry as a whole.
Act, Don't React
Journalists have a responsibility to react against censorship when they see it, but we should also consider going on the offensive. While I don't actually think native news clients make sense when compared to a good mobile web experience, it is still possible to minimize or eliminate some of the ethical concerns they raise, through careful design and developer lobbying.
While it's unlikely that a native application could easily offer the same kind of open engagement as a website, designers can at least address accountability. News clients should offer a way to either leave comments or send corrections to the editors entirely within the application. A side effect of this would be cross-industry innovation in computerized correction tracking and display, something that few publications are really taking advantage of right now.
Simultaneously, journalists should be using their access to tech companies (who love to use newspapers and networks as keynote demos) to push for better policies. This includes more open, uncensored app stores, but it also means pushing for tools that make web apps first-class citizens in an app-centric world, such as:
We have so many interesting debates surrounding the business of American journalism--paywalls, ad revenue, user-generated content--can't we just call this one off? The HTML document, originally designed to publish academic papers, may be a frustrating technology for rich UIs, but it's perfectly suited for the task of presenting the news. It's as close as you can get to write-once-run-anywhere, making it the cheapest and most efficient option for mobile development. And it's ethically sound! Isn't it time we stood up for ourselves, and as an industry backed a platform that doesn't leave us feeling like we've sold out our principles for short-term gains? Come on, folks: let's leave that to the op-ed writers.
About a month back, a prominent inside-the-Beltway political magazine ran a story on Tea Party candidates and earmarks, claiming that anti-earmark candidates were responsible for $1 billion in earmarks over 2010. I had just finished building a comprehensive earmark package based on OMB data, so naturally my editor sent me a link to the story and asked me to double-check their math. At first glance, the numbers generally matched--but on a second examination, the article's total double- and triple-counted earmarks co-sponsored by members of the Tea Party "caucus." Adjusting my query to remove non-distinct earmark IDs knocked about $100 million off the total--not really that much in the big picture (the sum still ran more than $900 million), but enough to fall below the headline-ready "more than $1 billion" mark. It was also enough to make it clear that the authors hadn't really understood what they were writing about.
In general, I am in favor of journalists learning how to leverage databases for better analysis, but it's an easy technology to misuse, accidentally--or even on purpose. There's a truism that the skills required to interpret statistics go hand in hand with the skills used to misrepresent them, and nowhere is that more pertinent than in the newsroom. Reporters and editors entering the world of data journalism need to hold onto the same critical skills they would use for any other source, not be blinded by the ease with which they can reach a catchy figure.
That said, journalists would do well to learn about these tools, especially in beats like economics and politics, if only to be able to spot their abuses. And there are three strong arguments for using databases (carefully!) for reporting: improving newsroom mathematical literacy, asking questions at modern scale, and making connections easier.
First, it's no secret that journalists and math are often uneasy bedfellows--a recent Washington Post ombudsman piece explored some of the reasons why numerical corrections are so common. In short: we're an industry of English majors whose eyes cross when confronted with simple sums, and so we tend to take numbers at face value even during the regular copy-editing process.
These anxieties are signs of a deeper problem that needs to be addressed, and there's nothing magical about SQL that will fix them overnight. But I think database training serves two purposes. First, it acclimatizes users to dealing with large sets of numbers, like treating nosocomephobia with a nice long hospital stay. Second, it reveals the dirty secret of programming, which is that it involves a lot of math process, but relatively little actual adding or subtracting, especially in query languages. Databases are a good way to get comfortable with numbers without having to actually touch them directly.
Ultimately, journalists need to be comfortable with numbers, because they're becoming an institutional hazard. While the state of government (and private-sector) data may still leave a lot to be desired from a programmer's point of view, it's practically flooded out over the last few years, with machine-readable formats becoming more common. This mass of data is increasingly unmanageable via spreadsheet: there are too many rows, too many edge cases, and too much filtering required. Doing it by hand is a pipe-dream. A database, on the other hand, is designed to handle queries across hundreds of thousands of rows or more. Languages like SQL let us start asking questions at the necessary scale.
Finally, once we've gotten over a fear of numbers and begun to take large data sets for granted, we can start using relational databases to make connections between data sets. This synthesis is a common visualization task that is difficult to do by hand--mapping health spending against immigration patterns, for example--but it's reasonably simple to do with a query in a relational database. The results of these kinds of investigations may not even be publishable, but they are useful--searching for correlation is a great jumping-off point for further reporting. One of the best things I've done for my team lately is set up a spare box running PostgreSQL, which we use for uploading, combining, searching, and then outputting translated versions of data, even in static form.
As always when I write these kinds of posts, remember that there is no Product X for saving journalism. Adding a database does not make your newsroom Web 2.0, and (see the example I opened with) it's not a magic bullet for better journalism. But new technology does bring opportunities for our industry, if we can avoid the Product X hype. The web doesn't save newspapers, but it can (and should) make sourcing better. Mobile apps can't save subscription revenues, but they offer better ways to think about presentation. And databases can't replace an informed, experienced editor, but they can give those journalists better tools to interrogate the world.
Once again, I present CQ's annual vote studies in handy visualization form, now updated with the figures for 2010. This version includes some interesting changes from last year:
The vote studies are one of those quintessentially CQ products: reliable, wonky, and relentlessly non-partisan. We're still probably not doing justice to it with this visualization, but we'll keep building out until we get there. Take a look, and let me know what you think.
If you're interested in working in data-driven journalism, or you know someone who is, my team at CQ is hiring. You can check out the listing at Ars. For additional context, this opening is for the server-side/database role on the team--someone who can set up a database for a reporting project, mine it for relevant data, and then present that information to either the newsroom or the public as a modern, standard-compliant web page.
To be honest, we're having a really difficult time filling this position. It's an odd duck: we need someone who's comfortable with computer science-y stuff like data structures and SQL, but also someone who can apply those skills towards journalism, which has its own distinct character traits: news sense, storytelling, and a peculiar tendency to pull at intellectual loose ends. A tough combination to begin with, even without taking into account the fact that anyone with both aptitudes can probably make a lot more money with the former than with the latter. So let's add a third requirement: they've got to be a true believer about what we do here.
As far as I can tell, the most reliable way to get someone with these three traits is to start with a journalist, then teach them how to code. In theory, that should be exactly what happens in a journalism school's "new media" or "interactive" program. And yet my experience with graduates of these MA programs is that they're woefully unprepared for the job my team is trying to do.
I should note here, I think, that I never attended J-school myself. GMU didn't have a journalism program, and I ended up in a different specialization in the communication department anyway. So it's possible that I'm a little bitter, given that I had to work my way into the news business via extensive freelancing, entry-level web production, and a lot of bloody-minded persistence. But I think my gripes are reasonable, and they're shared with coworkers from more traditional journalistic backgrounds.
Here's the crux of the problem, as I see it: programs in new media journalism are still teaching the Internet in the context of traditional print or television news, which stalls their graduates in two ways. First, it means the programs approach online media as outsiders, teaching classes in "blogging for journalists" or "media website design" as if they were alien artifacts to be unpuzzled instead of the native publishing platform for a whole generation now. It's the web, people: it's not going anywhere, and it's not something you should have to spend a semester introducing to your students. A whole class on blogging isn't education--it's coddling.
Second, these schools seem to be too focused on specific technologies or platforms instead of teaching rudimentary, generalizable computer engineering. There are classes on Flash, or on basic HTML, or using a given blog platform--and those are all good skills to have, but they're not sufficient. What we really need are people who know the general principles behind those skills: how do you structure data effectively for the story? How do you debug something? What's object-oriented design? Technology moves so fast in this business, someone without those fundamentals won't be able to keep up with the pace of change we need to maintain.
Maybe I'm just hardcore, but when I look at something like the Medill Graduate Curriculum (just to pick on someone at random), the interactive track looks lightweight to me. There's a lot of emphasis on industry inside baseball ("How 21st Century Media Works" or "Building Networked Audiences"), and not nearly enough on getting your hands dirty. "Digital Frameworks for Reporting" is only taught in DC? (Are government websites not available in Chicago?) "Database Reporting" is an optional elective? Not a single class taken from the graduate or undergraduate computer science curriculum, like "Fundamentals of Computer Programming I?" It looks to me like a program where you could emerge as a valuable data journalist, but it's just as likely that you'd be another Innovation Editor. And trust me, the world does not need any more of those.
I sympathize with the people who have to design these programs, I really do. The web is a big topic to cover. And worse, it's hard to teach people how to think critically--to understand about how they think, instead of just telling them what to think--but good programming has a lot in common with that level of metacognition. For the kind of data journalism we're trying to do at CQ, you've got to at least be able to think a little like a programmer, a little like a journalist, and a little like something new. If you think you can do that, we'd love to hear from you.
I don't know how long this'll be available to the general public, so take a look while you can: CQ Economy Tracker (formerly the Economic Indicators project) is now live. It's the product of more than a year of off-and-on development, and I'm thrilled to finally have it out in the wild.
Economy Tracker collects six big economic data sets (GDP, inflation, employment and labor, personal income and savings, home sales and pricing, and foreclosure rates) across the national, regional, and state levels, extended back as far as we could get data--sometimes almost a hundred years. The data is graphed, mapped, available in a sortable table, and also made available as Excel spreadsheets. As far as we're aware, we're the only organization that's collecting all of this information and putting it together in one easy-to-read package. It's a great resource for our own reporters when they go looking for vetted economic data, as well as a handy tool for readers.
But more than that, Economy Tracker has been my team's bid for some fundamental ideas about data journalism. The back end is a fairly simple PHP/PostgreSQL database, with the emphasis on A) making it easy for non-technical reporters to update by accepting Excel spreadsheets in a very tolerant way, and B) returning results in the web-standard JSON format for consumption by either Flash or Javascript. The current dashboard applet is a full-service showcase for the collection, but using a standards-based API, it should be easy for my team to build new visualizations based on our economic data--including smaller, single-purpose widgets or mash-ups with political or demographic data--or for our customers and readers to do so.
I think the last few years have shown how this strategy--building a news API for both internal and external use--has had real benefits for the newsrooms that have boldly let the way, like NPR and the New York Times. Not only does it engage the segment of the audience that's willing to dig into their data (free publicity!), but it grants newsroom developers a fleetness of foot that's hard to beat. It's a lot easier, for example, for NPR to turn on a dime and toss off a tablet-optimized website, or create a new native mobile client, because their content is already mostly decoupled from presentation and available in a machine-readable format. That's kind of a big deal, especially as we wait to see how this whole mobile Internet thing is going to shake out.
Whether or not this approach takes off, I'm enormously proud of the work that my team has done on this project. It's been a massive undertaking: building our own custom graphing framework, creating an internal event scheme for coordinating the two panels (pick a year on the National pane and it synchronizes with the Regional/State pane, and vice versa), and figuring out how to remain responsive while still displaying up to 40,000 rows of labor statistics (a combination of caching and delayed processing). Most importantly, the Economy Tracker stands as a monument to a partnership between the multimedia team, researchers, and our economics editor, in the best tradition of CQ journalism.