this space intentionally left blank

April 26, 2011

Filed under: journalism»new_media»data_driven

Structural Adjustment

Here are a few challenges I've started tossing out to prospective new hires, all of which are based on common, real-world multimedia tasks:

  • Pretend you're building a live election graphic. You need to be able to show the new state-by-state rosters, as well as the impact on each committee. Also, you need to be able to show an updated list of current members who have lost their races for reelection. You'll get this data in a series of XML feeds, but you have the ability to dictate their format. How do you want them structured?
  • You have a JSON array of objects detailing state GDP data (nominal, real, and delta) over the last 40 years. Using that data, give me a series of state-by-state lists of years for each state in which they experienced positive GDP growth.
  • The newsroom has produced a spreadsheet of member voting scores. You have a separate XML file of member biographical data--i.e., name, seat, date of birth, party affiliation, etc. How would you transform the spreadsheet into a machine-readable structure that can be matched against the biodata list?
What do these have in common? They're aimed at ferreting out the process by which people deal with datasets, not asking them to demonstrate knowledge of a specific programming language or library. I'm increasingly convinced, as we have tried to hire people to do data journalism at CQ, that the difference between a mediocre coder and a good one is that the good ones start from quality data structures and build their program outward, instead of starting with program flow and tacking data on like decorations on a Christmas tree.

I learned this the hard way over the last four years. When I started working with ActionScript in 2007, it was the first serious programming I'd done since college, not counting some playful Excel macros. Consequently I had a lot of bad habits: I left a lot of variables in the global scope, stored data in ad-hoc parallel arrays, and embedded a lot of "magic number" constants in my code. Some of those are easy to correct, but the shift in thinking from "write a program that does X" to "design data structure Y, then write a program to operate on it" is surprisingly profound. And yet it makes a huge difference: when we created the Economic Indicators project, the most problematic areas in our code were the ones where the underlying data structures were badly-designed (or at least, in the case of the housing statistics, organized in a completely different fashion from the other tables).

Oddly enough, I think what caused the biggest change in my thinking was learning to use JQuery. Much like other query languages, the result of almost any JQuery API call is a collection of zero or more objects. You can iterate over these as if they were arrays, but the language provides a lot of functional constructs (each(), map(), filter(), etc.) that encourage users to think more in terms of generic operations over units of data (the fact that those units are expressed in JavaScript's lovely hashmap-like dynamic objects is just a bonus).

I suspect that data-orientation makes for better programmers in any field (and I'm not alone), but I'm particularly interested in it on my team because what we do is essentially to turn large chunks of data (governmental or otherwise) into stories. From a broad philosophical perspective, I want my team thinking about what can be extracted and explained via data, and not how to optimize their loops. Data first, code second--and if concentrating on the former improves the latter, so much for the better.

Future - Present - Past