Mile Zero :: Latest posts in /journalism/new

July 23, 2014

The Landslide

We've just released a new interactive I've been working on for a couple of weeks, this time exploring the Oso landslide earlier this year. Our timeline (source) shows... well, I'll let the intro text explain it:

The decades preceding the deadly landslide near Oso reflect a shifting landscape with one human constant: Even as warnings mounted, people kept moving in. This interactive graphic tells that story, starting in 1887. Thirteen aerial photographs from the 1930s on capture the geographical changes; the hill is scarred by a succession of major slides while the river at its base gets pushed away, only to fight its way back. This graphic lets you go back in time and track the warnings from scientists; the failed attempts to stabilize the hill; the logging on or near the unstable slope; and the 37 homes that were built below the hill only to be destroyed.

The design of this news app is one of those cases where inspiration struck after letting its idea percolate for a while. We really wanted to showcase the aerial photos, originally intending to sync them up with a horizontal timeline. I don't particularly care for timelines — they're basically listicles that you can't scan easily — so I wasn't thrilled with this solution. It also didn't work well on mobile, and that's a no-go for my Seattle Times projects.

One day, while reading through the patterns at Bocoup's Mobile Vis site, it occurred to me that a vertical timeline would answer many of these problems. On mobile, a vertical scroll is a natural, inviting motion. On desktop, it was easier to arrange the elements side-by-side than stacked vertically. Swapping the axes turned out to be a huge breakthrough for the "feel" of the interactive — on phones and tablets that support inertial scrolling for overflow (Chrome and IE), users can even "throw" the timeline up the page to rapidly jump through the images, almost like a flipbook. On desktop, the mouse wheel serves much the same purpose.

On a technical level, this project made heavy use of the app template's ability to read and process CSV files. The reporters could work in Excel, mostly, and their changes would be seamlessly integrated into the presentation, which made copy editing a cinch. I also added live reload to the scaffolding on this project — it's a small tweak, but in group design sessions it's much easier to keep the focus on my editor for tweaks, but let the browser refresh on another monitor for feedback. I used Ractive to build the timeline itself, but that was mostly just for ease of templating and to get a feel for it — my next projects will probably return to Angular.

All in all, I'm extremely happy with the way this feature turned out. The reporting is deep (in a traditional story, it would probably be at least 5,000 words), but we've managed to tell this story visually in an intuitive, at-a-glance format, across multiple device formats. Casual readers can flip through the photos and see the movement of the river (as well as the 2014 devastation), while the curious can dig into individual construction events and warning signs. It's a pretty serious chunk of interactive storytelling, but we're just getting started. If you or someone you know would like to work on projects like this, feel free to apply to our open news app designer and developer positions.

11:33 x permalink

June 19, 2014

Move Fast, Make News

As I mentioned last week, the project scaffolding I'm using for news apps at the Seattle Times has been open sourced. It assumes some proficiency with NodeJS, and is built on top of the grunt-init command.

There are many other newsrooms that have their own scaffolding: NPR has one, and the Tribune often builds its projects on top of Tarbell. Common threads include the ability to load data from CSV or Google Sheets, minifying and templating HTMl with that data, and publishing to S3. My template also does those things, but with some slight differences.

It runs on top of NodeJS. This means, in turn, that it runs everywhere, unlike Tarbell, which will not work on Windows.
It has no dependencies outside of itself. I find this helpful--where the NPR template has to call out to an external lessc command to do its CSS processing, I can just load the LESS module directly.
It is opinionated, but flexible. It assumes you're using AMD modules for your JavaScript, and starts its build at predetermined paths. But it comes with no libraries, for example: instead, Bower is set up so that each project can pull only what it needs, and always have the latest versions.

What do you get from the scaffolding? Out of the box, it sets up a project folder that loads local data, feeds it to powerful templating, starts up a local development server, and watches all your files, rebuilding them whenever you make changes. It'll compile your JavaScript into a single file, with practically no work on your part, and do the same for your LESS files. Once you're done, it'll publish it to S3 for you, too. I've been using it for a project this week, and honestly: it's pretty slick.

If you're working on newroom development, or static app development in general, please feel free to check it out, and I'd appreciate any feedback you might have.

16:14 x permalink

June 10, 2014

Top Companies 2014

My first interactive feature for the Seattle Times just went live: our Top Northwest Companies features some of the most successful companies from the Pacific Northwest. It's not anything mind-blowing, but it's a good start, and it helped me test out some of the processes I'm planning on using for future news applications. It also has a few interesting technical tricks of its own.

When this piece was originally prototyped by one of the web producers, it used an off-the-shelf library to do the parallax effect via CSS background positions. We quickly found out that it didn't let us position the backgrounds effectively so that you could see the whole image, partly because of the plugin and partly because CSS backgrounds are a pain. We thought about just dropping the parallax, but that bugged me. So I went home, looked around at how other sites (particularly Medium) were accomplishing similar effects, and came up with a different, potentially more interesting solution.

When you load the page in a modern desktop browser now, there aren't actually any images at all. Instead, there's a fixed-position canvas backdrop, and the images are drawn to it via JavaScript as you scroll. Since these are simple blits, with no filtering or fancy effects, this is generally fast enough for a smooth experience, although it churns a little when transferring between two images. I suspect I could have faster rendering in those portions if I updated the code to only render the portions of the image that are visible, or rescaled the image beforehand, but considering that it works well enough on a Chromebook, I'm willing to leave well enough alone.

The table at the bottom of the page is written as an Angular app, and is kind of a perfect showcase for what Angular does well. Wiring up the table to be sortable and filterable was literally only a few minutes of work. The sparklines in the last column are custom elements, and Angular's filters make presenting formatted data a snap. Development for this table was incredibly fast, and the performance is really very good. There are still some issues with this presentation, such as the annoying sticky header, but it was by far the most painless part of the development process.

The most important part of this graphic, however, is not in the scroll or in the table. It's in the workflow. As I've said before, one of the things that I really learned at ArenaNet was the importance of a good build process. You don't build games like Guild Wars 2 without a serious build toolchain, and the web team was no different. The build tool that we used, Dullard, was used to compile templates, create CSS from LESS, hash filenames for CDN busting, start and stop development servers, and generate the module definitions for our JavaScript loader. When all that happens automatically, you get better pages and faster development.

I'm not planning on using Dullard at the Times (sorry, Pat!) only because I want to be able to bring people onboard quickly. So I'm going with the standard Grunt task runner, but breaking up its tasks in a very Dullard-like way and using it to automate as much as possible. There's no hand-edited code in the Top Companies graphic — only templates and data merged via the build process. Reproducing these stories, or updating them later, is as simple as pulling the repo (or, in this case, both repos) and running the Grunt task again.

That simplicity also extends to the publication process. Fast deployment means fast development and fewer mistakes hanging out in the wild when bugs occur. For Seattle Times news apps, I'm planning to host them as flat files on Amazon S3, which is dirt-cheap and rock-solid (NPR and the Chicago Tribune use the same model). Running a deployment is as simple as grunt publish. In testing last night, I could deploy a fixed version of the page faster than people could switch to their browser and press refresh. As a client-side kind of person, I'm a huge fan of the static app model anyway, but the speed and simplicity of this solution exceeded even my expectations.

Going forward, I want my all news apps to benefit from this kind of automation, without having to copy a bunch of files around. I looked at Yeoman for creating app skeletons, but it seemed like overkill, so I'm setting up a template with Grunt's project scaffolding with all the boilerplate already installed. Once that's done, I'll be able to run one command and create a blank project for news apps that includes LESS compilation, JavaScript concatenation, minification, templating, and S3 publishing. Automating all of that boilerplate means faster startup time, and that means more projects to make the newsroom happy.

As I work on these story templates, I'll be open-sourcing them and sharing my ideas. The long and the short of it is that working in a newsroom is unpredictable: crazy deadlines, no requirements to speak of, and wildly different subject matter. This kind of technical architecture may seem unrelated to the act of journalism, but its goal is to lay the groundwork so that there are no distractions from the hard part: telling creative news stories online. I want to worry about making our online journalism better, not debugging servers. And while I don't know what the final solution for that is, I think we're off to a good start.

20:01 x permalink

January 10, 2014

App-y New Year

At the end of January, I'll be teaching a workshop at the University of Washington on "news apps," thanks to an offer from the outgoing news app editor at the Seattle Times. It's a great opportunity, and a chance to revisit my more editorial skills. From the description:

This bootcamp will introduce students to the basic components of creating news applications, which are data-powered digital stories tied together through design, programming and journalism. We’ll walk through all the components of creating a news application, look at industry examples of what works and what doesn’t, and learn the basic coding skills required to build a news app.

Sounds cool, but it's still a wide-open field — "data-powered digital stories" covers a huge range of approaches. What do you even teach, and how do you do it in two 4-hour workshops?

It turns out that for almost any definition of "news app," there's an exception. NPR's presidential election board is a data-powered news app, but it's not interactive beyond an auto-update. Snow Fall is certainly a news app, but it's hard to call it "data-powered." How can we craft a category that includes these, but also includes traditional, data-oriented interactives like The Atlantic's Netflix Genre Generator and the Seattle Times mayoral race comparison? More importantly, how do we get young journalists to be able to think both expansively and productively about telling stories online?

That said, I think there is, actually, a unifying principle for news apps. In fact, I think it cuts to the heart of what draws me to web journalism, and the web in general. News apps are journalistic stories told via hypermedia — or, to put it simply, they have links.

A link seems like a small thing after years on the web, so it's good to revisit just how fundamentally groundbreaking they are. Links can support or subvert their anchor, creating new rhetorical devices of their own. At the most basic level, they contextualize a story. More abstractly, they create non-linearity: users explore a news app at their own pace and with their own priorities, rather than the direct stream of narrative from a text story.

A link is a simple starting place. But it starts us down a path of thinking about more complicated applications and usage. I'm fond of saying that an interactive visualization is constructed in many layers, with users peeling open the onion as far as they may want. If we're thinking in terms of other hypertext documents (a.k.a., the TV Tropes Rabbit Hole) from the start, we're already prepared when readers use similar interaction patterns to browse data-based interactives — either by shallowly skipping around, or diving in depth for a specific feature.

By reconceptualizing news apps as being hypermedia instead of a specific technology or group of technologies, such as mapping or graphing, introducing students to web storytelling gets a lot easier — particularly since I won't have time to teach them much beyond some basic HTML and CSS (in the first workshop) and a little scripting (in the second).

It also leaves them plenty of room to think creatively when presenting stories. I'd love for budding news app developers to be as interested in wikis and Twine as they are in D3 and PostGIS. Most importantly, I'd love for an appreciation of hypertext to leak into their writing in general, if only to reduce the number of print die-hards in newsrooms around the country. You don't have to end up a programmer to create new, interesting journalism that's really native to the web.

18:58 x permalink

July 24, 2013

The Narrative

I had planned on writing a post about Nate Silver's departure from the New York Times this week, but Lance pretty much beat me to it:

Silver is now legendary for being a numbers guy. But there aren't going to be any useful numbers for analyzing the next Presidential election until the middle of 2015 at the earliest. The circumstances under which the election will take place---the state of the economy, whether we're at war or peace, the President's popularity and if and how that will transfer to the Democratic nominee, what issues are galvanizing which voters, etc.---won't make themselves known and so won't show up as numbers in polls at least until then. And until then, everything said about the election is idle speculation, and we know how Silver feels about idly speculating.
But we also know that the most incorrigible idle speculators believe idle speculation is the point.

It's well worth the time to read the whole thing.

I've seen some people assert, in light of this departure, that lots of people could do what Silver did for the Times: his models weren't that complicated, after all, and how hard can it be to write about them? I think this dramatically underestimates the uniqueness of FiveThirtyEight and, to some extent, signifies how threatening it really was to political pundits.

There are, no doubt, a few journalists who could put together Nate Silver's models, and then write about them with clarity. I don't think anyone doubted that evidence-driven political reporting was possible. What he did was show that it could be successful, and that it could draw eyeballs. I think it was John Rogers who said that the best thing about blogging was not the enabling effect for amateurs, but for experts. Suddenly people with actual skills--economists, historians, political scientists, statisticians--could have the kind of audience that op-ed pages commanded.

This should not have been a surprise for newspapers, except that the industry has spent years convincing itself that investigative teams and deep expertise in a beat aren't worth funding. To be fair, the New York Times has put money behind a lot of data journalism in the past few years. If they can't keep the attention of someone like Silver, who can? I guess we're going to find out.

23:36 x permalink

November 13, 2012

Nate Silver: Not a Witch

In retrospect, Joe Scarborough must be pretty thrilled he never took Nate Silver's $1,000 bet on the outcome of the election. Silver's statistical model went 50 for 50 states, and came close to the precise number of electoral votes, even as Scarborough insisted that the presidential campaign was a tossup. In doing so, Silver became an inadvertent hero to people who (unlike Joe Scarborough) are not bad at math, inspiring a New Yorker humor article and a Twitter joke tag ("#drunknatesilver", who only attends the 50% of weddings that don't end in divorce).

There are two things that are interesting about this. The first is the somewhat amusing fact that Silver's statistical model, strictly speaking, isn't actually that sophisticated. That's not to take anything away from the hard work and mathematical skills it took to create that model, or (probably more importantly) Silver's ability to write clearly and intelligently about it. I couldn't do it, myself. But when it all comes down to it, FiveThirtyEight's methodology is just to track state polls, compare them to past results, and organize the results (you can find a detailed--and quite readable--explanation of the entire methodology here). If nobody has done this before, it's not because the idea was an unthinkable revolution or the result of novel information technology. It's because they couldn't be bothered to figure out how.

The second interesting thing about Silver's predictions is how incredibly hard the pundits railed against them. Scarborough was most visible, but Politico's Dylan Byers took a few potshots himself, calling Silver a possible "one-term celebrity." You can almost smell sour grapes rising from Byers' piece, which presents on the one side Silver's math, and on the other side David Brooks. It says a lot about Byers that he quoted Brooks, the rodent-like New York Times columnist best known for a series of empty-headed books about "the American character," instead of contacting a single statistician for comment.

Why was Politico so keen on pulling down Silver's model? Andrew Beaujon at Poynter wrote that the difference was in journalism's distaste for the unknown--that reporters hate writing about things they can't know. There's an element of truth to that sentiment, but in this case I suspect it's exactly wrong: Politico attacked because its business model is based entirely on the cultivation of uncertainty. A world where authority derives from more than the loudest megaphone is a bad world for their business model.

Let's review, just for a second, how Politico (and a whole host of online, right-leaning opinion journals that followed in its wake) actually work. The oft-repeated motto, coming from Gabriel Sherman's 2009 profile, is "win the morning"--meaning, Politico wants to break controversial stories early in order to work its brand into the cable and blog chatter for the rest of the day. Everything else--accuracy, depth, other journalistic virtues--comes second to speed and infectiousness.

To that end, a lot of people cite Mike Allen's Playbook, a gossipy e-mail compendium of aggregated fluff and nonsense, as the exemplar of the Politico model. Every morning and throughout the day, the paper unleashes a steady stream of short, insider-ey stories. It's a rumor mill, in other words, one that's interested in politics over policy--but most of all, it's interested in Politico. Because if these stories get people talking, Politico will be mentioned, and that increases the brand's value to advertisers and sources.

(There is, by the way, no small amount of irony in the news industry's complaints about "aggregators" online, given the long presence of newsletters like Playbook around DC. Everyone has one of these mobile-friendly link factories, and has for years. CQ's is Behind the Lines, and when I first started there it was sent to editors as a monstrous Word document, filled with blue-underlined hyperlink text, early every morning for rebroadcast. Remember this the next time some publisher starts complaining about Gawker "stealing" their stories.)

Politico's motivations are blatant, but they're not substantially different from any number of talking heads on cable news, which has a 24-hour news hole to fill. Just as the paper wants people talking about Politico to keep revenue flowing, pundits want to be branded as commentators on every topic under the sun so they can stay in the public eye as much as possible. In a sane universe, David Brooks wouldn't be trusted to run a frozen yoghurt stand, because he knows nothing about anything. Expertise--the idea that speaking knowledgably requires study, sometimes in non-trivial amounts--is a threat to this entire industry (probably not a serious threat, but then they're not known for underreaction).

Election journalism has been a godsend to punditry precisely because it is so chaotic: who can say what will happen, unless you are a Very Important Person with a Trusted Name and a whole host of connections? Accountability has not traditionally been a concern, and because elections hinge on any number of complicated policy questions, this means that nothing is out of bounds for the political pundit. No matter how many times William Kristol or Megan McArdle are wrong on a wide range of important issues, they will never be fired (let's not even start on poor Tom Friedman, a man whose career consists of endlessly sorting the wheat from the chaff and then throwing away the wheat). But FiveThirtyEight undermines that thought process, by saying that there is a level of rigor to politics, that you can be wrong, and that accountability is important.

The optimistic take on this disruption is, as Nieman Journalism Lab's Jonathan Stray argues, that specialist experts will become more common in journalism, including in horse race election coverage. I'm not optimistic, personally, because I think the current state of political commentary owes as much to industry nepotism as it does to public opinion, and because I think political data is prone to intentional obfuscation. But it's a nice thought.

The real positive takeaway, I think, is that Brooks, Byers, Scarborough, and other people of little substance took such a strong public stance against Silver. By all means, let's have an open conversation about who was wrong in predicting this election--and whose track record is better. Let's talk about how often Silver is right, and how often that compares to everyone calling him (as Brooks did) "a wizard" whose predictions were "not possible." Let's talk about accountability, and expertise, and whether we should expect better. I suspect Silver's happy to have that talk. Are his accusers?

20:42 x permalink

January 18, 2012

Your Scattered Congresses

Once more with feeling: today, I'm happy to bring you my last CQ vote study interactive. This version is something special: although it lacks the fancy animations of its predecessor, it offers a full nine years of voting data, and it does so faster and in more detail. Previously, we had only offered data going back to 2009, or a separate interactive showing the Bush era composite scores.

We had talked about this three-pane presentation at CQ as far back as two years ago, in a discussion with the UX team on how they could work together with my multimedia team. Our goal was to lower the degree to which a user had to switch manually between views, and to visually reinforce what the scatter plot represents: a spatial view of party discipline. I think it does a pretty good job, although I do miss the pretty transitions between different graph types.

Technically speaking, loading nine years of votestudy data was a challenge: that's almost 5,000 scores to collect, organize, and display. The source files necessarily separate member biodata (name, district, party, etc) from the votestudy data, since putting the two into the same data structure would bloat the file size from repetition (many members served in multiple years). But keeping them separate causes a lag problem while interacting with the graphic: doing lookups based on XML queries tends to be very slow, particularly over 500K of XML.

I tried a few tricks to find a balance between real-time lookup (slow interaction, quick initial load) and a full preprocessing step (slow initial load, quick interactions). In the end, I went with an approach that processes each year when it's first displayed, adding biodata to the votestudy data structure at that time, and caching member IDs to minimize the lookup time on members who persist between years. The result is a slight lag when flipping between years or chambers for the first time, but it's not enough to be annoying and the startup time remains quick.

(In a funny side note, working with just the score data is obscenely quick. It's fast enough, in fact, that I can run through all nine years to find the bounds for the unity part of graph to keep it consistent from year to yearin less than a millisecond. That's fast enough that I can be lazy and do that before every re-render--as long as I don't need any names. Don't optimize prematurely, indeed.)

The resulting graphic is typical of CQ interactives, in that it's a direct view on our data without a strong editorial perspective--we don't try to hammer a story through here. That said, I think there's some interesting information that emerges when you can look at single years of data going back to 2002:

The Senate is generally much more supportive of the president than the House is. While you can't directly compare scores across chambers (because the votes are different), the trend is striking. It's well known that House members tend to be more radical than senators, but I suspect the difference is also procedural: in the House, the leadership controls the agenda much more tightly than in the Senate, which can be held up by filibuster. As a result, the House may vote on bills that would never reach the Senate floor, just because the majority party can force the issue.
Although the conventional wisdom on the left since the Gingrich years has been that Republican discipline is stronger for political reasons, I'm not sure that's entirely borne out by these graphics. Party unity over the last nine years appears roughly symmetrical most of the time, while presidential support (and opposition) appears to shift in direct response to the strength of the White House due to popularity and/or election status. 2007-2009 was a particularly strong time for the Democrats in terms of uniting around or against a presidential agenda, for obvious reasons. This year the Republicans rallied significantly, particularly in the House.
There is one person who's explicitly taken out of the graphs (and not removed due to lack of participation or other technical reasons). That person is Zell Miller, everyone's favorite Bush-era iconoclast. If you're like me, you haven't thought about Zell Miller in 6 or 7 years, but there he was when I loaded the Senate file for the first time. Miller voted against his party so often that he had ridiculously low scores in 2003 and 2004, resulting in a vast expanse of white space on the plots with one lonely blue dot at the bottom. Rather than let him make everyone too small to click, I dropped him from the dataset as an outlier.

All of this, of course, is just my amateur political analysis. While I'm arguably more informed (possibly too informed!) about congressional practice than the average person, I'm no expert. For that, you may want to check out CQ's always-fantastic editorial graphics on the votestudies, which show in more detail the legislative trends of the last few decades. It's very cool stuff.

Finally, I did mention that this is my last CQ votestudy interactive. It's been a fantastic ride at Congressional Quarterly, and I'm grateful for the opportunities and education I received there. But it's time to move on, and to find something closer to home here in Seattle: at the end of this month, I'll be starting in a new position, doing web development at Big Fish Games. Wish me luck!

11:35 x permalink

November 9, 2011

Reaction

As the deadlines creep forward for the Joint Special Committee on Deficit Reduction, my team at CQ has put together a package of new and recent debt interactives covering the automatically-triggered budget cuts, the proposals on the table, the schedule set for committee action, and more.

The centerpiece of the package is a "reactive document" showing how the automatic cuts will go into effect if Congress does not pass cuts totalling $1.2 trillion by January 15. A series of sliders set the size of the hypothetical cuts, and the text and diagrams of the document adjust themselves to match. It's a neat idea, and one that's kind of a natural match for CQ: wordy, but still wonky.

Like a lot of people, I encountered the idea of reactive documents through Bret Victor's essay Explorable Explanations. Victor is an ex-Apple UI designer who wants to re-think the way people teach math, and reactive documents are one of the tools he wants to use. His explorations of learning design via reactive documents, such as Up and Down the Ladder of Abstraction, are breathtaking. As he writes,

There's nothing new about scenario modeling. The authors of this proposition surely had an Excel spreadsheet which answered the same questions. But a spreadsheet is not an explanation. It is merely a dataset and model; it cannot be read. An explanation requires an author, to interpret the results of the model, and present them to the reader via language and graphics.
The reactive document integrates spreadsheet-like models into authored text. It can be read at multiple levels, depending on the reader's level of interest. The hurried reader can skim it. The casual reader can read it as-is. The curious reader can adjust the author's scenarios. The engaged reader can explore scenarios of his own devising.
Unlike a spreadsheet, the barrier to exploration here is extremely low -- simply click and drag. This invites casual readers to become engaged and start exploring. It transforms readers from passive to active.

Victor's idea is a clever one, and as someone who often describes interactives using the same "layered reading" mechanism, it appeals to my storytelling sense. I also like that it embraces the original purpose of the web--to present hypertext documents--without sacrificing the rich interactions that browser applications have developed. That said, I'm not entirely convinced that reactive documents like this are actually terribly useful or novel.

The main problem with this method of presenting interactive information is that it's actually really burdensome for the playful user. It's easy to read, but if you change anything, you have to basically either read and process the entire paragraph again, or you have to learn to pick out individual changes and their meaning from a jumble of words. Besides, sometimes words are not a very good description of an effect or process--imagine describing complex machinery only in paragraph form.

Victor also has some examples that avoid this flaw by making the reactive document incorporate diagrams and graphs alongside his formulas. These are great, but they also illustrate the fact that, once you make reactive "documents" more visual and take away the intertextual trickery, they're really just regular interactives. They're stunningly designed, and I'm always in favor of more multimedia, but there's nothing new about them.

This probably comes off as a little more adversarial to the concept of reactive documents than I actually am, most of which is just my rhetorical background leaking out. I think they're neat, and I would guess that Victor himself thinks of them less as a complete solution and more as a different shade in his teaching palette. In some places, they're helpful, in others not so much.

As an Excel enthusiast, though, I do take exception to Victor's description of spreadsheets as something that "cannot be read," with a high barrier to entry. People read and create spreadsheets all the time, although (to my frustration) they often use them as layout tools. But a spreadsheet that's already set up for someone and locked up to prevent mistakes is barely any more difficult to use than his draggable text--the only real difference is the need to type a number. Regular people may find spreadsheet formulas difficult to connect with cells, but those same people are unlikely to be creating Victor's reactive documents either.

Ultimately, I'm wary of claims that any tool is a silver bullet for education or explainer journalism. It's easy to be blinded by slick UX, and to forget that we're basically just re-inventing storytelling tools used by great teachers for centuries. That shouldn't eliminate interactive games and illustrations from our kit. But reading Victor's site, it's easy to give the technology credit for its thought-provoking qualities, when the credit really goes to his lucid, considered reasoning and clear writing (both of which mean that the technology is well-applied). Sadly, there's no script for that.

11:17 x permalink

October 12, 2011

The Big Contract

Recently my team worked on an interactive for a CQ Weekly Outlook on contracts. Government contracting is, of course, a big deal in these economic times, and the government spent $538 billion on contractors in FY2010. We wanted to show people where the money went.

I don't think this is one of our best interactives, to be honest. But it did raise some interesting challenges for us, simply because the data set was so huge: the basic table of all government contracts for a single fiscal year from USA Spending is around 3.5 million rows, or about 2.5GB of CSV. That's a lot of data for the basic version: the complete set (which includes classification details for each contract, such as whether it goes to minority-owned companies) is far larger. When the input files are that big, forget querying them: just getting them into the database becomes a production.

My first attempt was to write a quick PHP script that looped through the file and loaded it into the table. This ended up taking literally ten or more hours for each file--we'd never get it done in time. So I went back to the drawing board and tried using PostgreSQL's COPY command. COPY is very fast, but the destination has to match the source exactly--you can't skip columns--which is a pain, especially when the table in question has so many columns.

To avoid hand-typing 40-plus columns for the table definition, I used a combination of some command line tools, head and sed mostly, to dump the header line of the CSV into a text file, and then added enough language for a working CREATE TABLE command, everything typed as text. With a staging table in place, COPY loaded millions of rows in just a few minutes, and then I converted a few necessary columns to more appropriate formats, such as the dollar amounts and the dates. We did a second pass to clean up the data a little (correcting misspelled or inconsistent company names, for example).

Once we had the database in place, and added some indexes so that it wouldn't spin its wheels forever, we could start to pull some useful data, like the state-by-state totals for a basic map. It's not surprising that the beltway bandits in DC, Maryland, and Virginia pull an incredible portion of contracting money--I had to clamp the maximum values on the map to keep DC's roughly $42,000 contract dollars per resident from blowing out the rest of the country--but there are some other interesting high-total states, such as New Mexico and Connecticut.

Now we wanted to see where the money went inside each state: what were the top five companies, funding agencies, and product codes? My inital attempts, using a series of subqueries and count() functions, were tying up the server with nothing to show for it, so I tossed the problem over to another team member and went back to working on the map, thinking I wanted to have something to show for our work. He came back with a great solution--PostgreSQL's PARTITION command, which splits a table into component parts, combined with the rank() function for filtering--and we were able to find the top categories easily. A variation on that template gave us per-agency totals and top fives.

There are a couple of interesting lessons to be learned from this experience, the most obvious of which is the challenges of journalism at scale. There are certain stories, particularly on huge subjects like the federal budget, where they're too big to be feasibly investigated without engaging in computer-assisted reporting, and yet they require skills beyond the usual spreadsheet-juggling.

I don't think that's going away. In fact, I think scale may be the defining quality of the modern information age. A computer is just a machine for performing simple operations at incredibly high speeds, to the point where they seem truly miraculous--changing thousands (or millions) of pixels each second in response to input, for example. The Internet expands that scale further, to millions of people and computers interacting with each other. Likewise, our reach has grown with our grasp. It seems obvious to me that our governance and commerce have become far more complex as a result of our ability to track and interact with huge quantities of data, from contracting to high-speed trading to patent abuse. Journalists who want to cover these topics are going to need to be able to explore them at scale, or be reliant on others who can do so.

Which brings us to the second takeaway from this project: in computer-assisted journalism, speed matters. If hours are required to return a query, asking questions becomes too expensive to waste on undirected investigation, and fact-checking becomes similarly burdensome. Getting answers needs to be quick, so that you can easily continue your train of thought: "Who are the top foreign contractors? One of them is the Canadian government? What are we buying from them? Oh, airplane parts--interesting. I wonder why that is?"

None of this is a substitute for domain knowledge, of course. I am lucky to work with a great graphics reporter and an incredibly knowledgeable editor, the combination of often saves me from embarrassing myself by "discovering" stories in the data that are better explained by external factors. It is very easy to see an anomaly, such as the high level of funding in New Mexico from the Department of Energy, and begin to speculate wildly, while someone with a little more knowledge would immediately know why it's so (in this case, the DoE controls funding for nuclear weapons, including the Los Alamos research lab in New Mexico).

Performing journalism with large datasets is therefore a three-fold problem. First, it's difficult to prepare and process. Second, it's tough to investigate without being overwhelmed. And finally, the sheer size of the data makes false patterns easier to find, requiring extra care and vigilance. I complain a lot about the general state of data journalism education, but this kind of exercise shows why it's a legitimately challenging mix of journalism and raw technical hackery. If I'm having trouble getting good results from sources with this kind of scale, and I'm a little obsessed with it, what's the chance that the average, fresh-out-of-J-school graduate will be effective in a world of big, messy data?

9:34 x permalink

June 22, 2011

Against the Grain

If I have a self-criticism of the work I'm doing at CQ, it's that I mostly make flat tools for data-excavation. We rarely set out with a narrative that we want to tell--instead, we present people with a window into a dataset and give them the opportunity to uncover their own conclusions. This is partly due to CQ's newsroom culture: I like to think we frown a bit on sensationalism here. But it is also because, to a certain extent, my team is building the kinds of interactives we would want to use. We are data-as-playground people, less data-as-theme-park.

It's also easier to create general purpose tools than it is to create a carefully-curated narrative. But that sounds less flattering.

In any case, our newest project does not buck this trend, but I think it's pretty fascinating anyway. "Against the Grain" is a browseable database of dissent on party unity votes in the House and Senate (party unity votes are defined by CQ as those votes where a majority of Republicans and a majority of Democrats took opposing sides on a bill). Go ahead, take a look at it, and then I'd like to talk about the two sides of something like this: the editorial and the technical.

The Editorial

Even when you're building a relatively straightforward data-exploration application like this one, there's still an editorial process in play. It comes through in the flow of interaction, in the filters that are made available to the user, and the items given particular emphasis by the visual design.

Inescapably, there are parallels here to the concept of "objective" journalism. People are tempted to think of data as "objective," and I guess at its most pure level it might be, but from a practical standpoint we don't ever deal with absolutely raw data. Raw data isn't useful--it has to be aggregated to have value (and boy, if there's a more perilous-but-true phrase in journalism these days than "aggregation has value," I haven't heard it). Once you start making decisions about how to combine, organize, and display your set, you've inevitably committed to an editorial viewpoint on what you want that data to mean. That's not a bad thing, but it has to be acknowledged.

Regardless, from an editorial perspective, we had a pretty specific goal with "Against the Grain." It began as an offshoot of a common print graphic using our votestudy data, but we wanted to be able to take advantage of the web's unlimited column inches. What quickly emerged as our showcase feature--what made people say "ooooh" when we talked it up in the newsroom--was to organize a given member's dissenting votes by subject code. What are the policy areas on which Member X most often breaks from the party line? Is it regulation, energy, or financial services? How are those different between parties, or between chambers? With an interactive presentation, we could even let people drill down from there into individual bills--and jump from there back out to other subject codes or specific members.

To present this process, I went with a panel-oriented navigation method, modeled on mobile interaction patterns (although, unfortunately, it still doesn't work on mobile--if anyone can tell me why the panels stack instead of floating next to each other on both Webkit and Mobile Firefox, I'd love to know). By presenting users with a series of rich menu options, while keeping the previous filters onscreen if there's space, I tried to strike a balance between query-building and giving room for exploration. Users can either start from the top and work down, by viewing the top members and exploring their dissent; from the bottom up, by viewing the most contentious votes and seeing who split from the party; or somewhere in the middle, by filtering the two main views through a vote's subject code.

We succeeded, I think, in giving people the ability to look at patterns of dissent at a member and subject level, but there's more that could be done. Congressional voting is CQ's raison d'etre, and we store a mind-boggling amount of legislative information that could be exploited. I'd like to add arbitrary member lookup, so people could find their own senator or representative. And I think it might be interesting to slice dissent by vote type--to see if there's a stage in the legislative process where discipline is particularly low or high.

So sure, now that we've got this foundation, there are lots of stories we'd like it to handle, and certain views that seem clunkier than necessary. It's certainly got its flaws and its oddities. But on the other hand, this is a way of browsing through CQ's vote database that nobody outside of CQ (and most of the people inside) have never had before. Whatever its limitations, it enables people to answer questions they couldn't have asked prior to its creation. That makes me happy, because I think a certain portion of my job is simply to push the organization forward in terms of what we consider possible.

So with that out of the way, how did I do it?

The Technical

"Against the Grain" is probably the biggest JavaScript application I've written to date. It's certainly the best-written--our live election night interactive might have been bigger, but it was a mess of display code and XML parsing. With this project, I wanted to stop writing JavaScript as if it was the poor man's ActionScript (even if it is), and really engage on its own peculiar terms: closures, prototypal inheritance, and all.

I also wanted to write an application that would be maintainable and extensible, so at first I gave Backbone.js a shot. Backbone is a Model-View-Controller library of the type that's been all the rage with the startup hipster crowd, particularly those who use obstinately-MVC frameworks like Ruby on Rails. I've always thought that MVC--like most design patterns--feels like a desparate attempt to convert common sense into jargon, but the basic goal of it seemed admirable: to separate display code from internal logic, so that your code remains clean and abstracted from its own presentation.

Long story short, Backbone seems designed to be completely incomprehensible to someone who hasn't been writing formal MVC applications before. The documentation is terrible, there's no error reporting to speak of, and the sample application is next to useless. I tried to figure it out for a couple of hours, then ended up coding my own display/data layer. But it gave me a conceptual model to aim for, and I did use Backbone's underlying collections library, Underscore.js, to handle some of the filtering and sorting duties, so it wasn't a total loss.

One feature I appreciated in Backbone was the templating it inherits from Underscore (and which they got in turn from jQuery's John Resig). It takes advantage of the fact that browsers will ignore the contents of <script> tags with a type set to something other than "text/javascript"--if you set it to, say, "text/html" or "template," you can put arbitrary HTML in there. I created a version with Mustache-style support for replacing tags from an optional hash, and it made populating my panels a lot easier. Instead of manually searching for <span> IDs and replacing them in a JavaScript soup, I could simply pass my data objects to the template and have panels populated automatically. Most of the vote detail display is done this way.

I also wanted to implement some kind of inheritance to simplify my code. After all, each panel in the interactive shares a lot of functionality: they're basically all lists, most of them have a cascading "close" button, and they trigger new panels of information based on interaction. Panels are managed by a (wait for it...) PanelManager singleton that handles adding, removing, and positioning them within the viewport. The panels themselves take care of instantiating and populating their descendants, but in future versions I'd like to move that into the PanelManager as well and trigger it using custom events.

Unfortunately, out-of-the-box JavaScript inheritance is deeply weird, and it's tangled up in the biggest flaw of the language: terrible variable scoping. I never realized how important scope is until I saw how many frustrations JavaScript's bad implementation creates (no real namespaces! overuse of the "this" keyword! closures over loop values! ARGH IT BURNS).

Scope in JavaScript is eerily like Inception: at every turn, the language drops into a leaky subcontext, except that instead of slow-motion vans and antigravity hotels and Leonardo DiCaprio's dead wife, every level change is a new function scope. With each closure, the meaning of the "this" keyword changes to something different (often to something ridiculous like the Window object), a tendency worsened in a functional library like Underscore. In ActionScript, the use of well-defined Event objects and real namespaces meant I'd never had trouble untangling scope from itself, but in JavaScript it was a major source of bugs. In the end I found it helpful, in any function that uses "this" (read: practically everything you'll write in JavaScript), to immediately cache it in another variable and then only use that variable if possible, so that even inside callbacks and anonymous functions I could still reliably refer to the parent scope.

After this experience, I still like JavaScript, but some of the shine has worn off. The language has some incredibly powerful features, particularly its first-class functions, that the community uses to paper over the huge gaps in its design. Like Lisp, it's a small language that everyone can extend--and like Lisp, the downside is that everyone has to do so in order to get anything done. The result is a million non-standard libraries re-implementing basic necessities like classes and dependencies, and no sign that we'll ever get those gaps filled in the language itself. Like it or not, we're largely stuck with JavaScript, and I can't quite be thrilled about that.

Conclusions

This has been a long post, so I'll try to wrap up quickly. I learned a lot creating "Against the Grain," not all of it technical. I'm intrigued by the way these kinds of interactives fit into our wider concept of journalism: by operating less as story presentations and more as tools, do they represent an abandonment of narrative, of expertise, or even a kind of "sponsored" citizen journalism? Is their appearance of transparency and neutrality dangerous or even deceptive? And is that really any less true of traditional journalism, which has seen its fair share of abused "objectivity" over the years?

I don't know the answers to those questions. We're still figuring them out as an industry. I do believe that an important part of data journalism in the future is transparency of methodology, possibly incorporating open source. After all, this style of interactive is (obviously, given the verbosity on display above) increasingly complex and difficult for laymen to understand. Some way for the public to check our math is important, and open source may offer that. At the same time, the role of the journalist is to understand the dataset, including its limitations and possible misuses, and there is no technological fix for that. Yet.

15:31 x permalink

Past - Present - Future