this space intentionally left blank

April 26, 2011

Filed under: journalism»new_media»data_driven

Structural Adjustment

Here are a few challenges I've started tossing out to prospective new hires, all of which are based on common, real-world multimedia tasks:

  • Pretend you're building a live election graphic. You need to be able to show the new state-by-state rosters, as well as the impact on each committee. Also, you need to be able to show an updated list of current members who have lost their races for reelection. You'll get this data in a series of XML feeds, but you have the ability to dictate their format. How do you want them structured?
  • You have a JSON array of objects detailing state GDP data (nominal, real, and delta) over the last 40 years. Using that data, give me a series of state-by-state lists of years for each state in which they experienced positive GDP growth.
  • The newsroom has produced a spreadsheet of member voting scores. You have a separate XML file of member biographical data--i.e., name, seat, date of birth, party affiliation, etc. How would you transform the spreadsheet into a machine-readable structure that can be matched against the biodata list?
What do these have in common? They're aimed at ferreting out the process by which people deal with datasets, not asking them to demonstrate knowledge of a specific programming language or library. I'm increasingly convinced, as we have tried to hire people to do data journalism at CQ, that the difference between a mediocre coder and a good one is that the good ones start from quality data structures and build their program outward, instead of starting with program flow and tacking data on like decorations on a Christmas tree.

I learned this the hard way over the last four years. When I started working with ActionScript in 2007, it was the first serious programming I'd done since college, not counting some playful Excel macros. Consequently I had a lot of bad habits: I left a lot of variables in the global scope, stored data in ad-hoc parallel arrays, and embedded a lot of "magic number" constants in my code. Some of those are easy to correct, but the shift in thinking from "write a program that does X" to "design data structure Y, then write a program to operate on it" is surprisingly profound. And yet it makes a huge difference: when we created the Economic Indicators project, the most problematic areas in our code were the ones where the underlying data structures were badly-designed (or at least, in the case of the housing statistics, organized in a completely different fashion from the other tables).

Oddly enough, I think what caused the biggest change in my thinking was learning to use JQuery. Much like other query languages, the result of almost any JQuery API call is a collection of zero or more objects. You can iterate over these as if they were arrays, but the language provides a lot of functional constructs (each(), map(), filter(), etc.) that encourage users to think more in terms of generic operations over units of data (the fact that those units are expressed in JavaScript's lovely hashmap-like dynamic objects is just a bonus).

I suspect that data-orientation makes for better programmers in any field (and I'm not alone), but I'm particularly interested in it on my team because what we do is essentially to turn large chunks of data (governmental or otherwise) into stories. From a broad philosophical perspective, I want my team thinking about what can be extracted and explained via data, and not how to optimize their loops. Data first, code second--and if concentrating on the former improves the latter, so much for the better.

January 5, 2011

Filed under: journalism»new_media»data_driven

Query, But Verify

About a month back, a prominent inside-the-Beltway political magazine ran a story on Tea Party candidates and earmarks, claiming that anti-earmark candidates were responsible for $1 billion in earmarks over 2010. I had just finished building a comprehensive earmark package based on OMB data, so naturally my editor sent me a link to the story and asked me to double-check their math. At first glance, the numbers generally matched--but on a second examination, the article's total double- and triple-counted earmarks co-sponsored by members of the Tea Party "caucus." Adjusting my query to remove non-distinct earmark IDs knocked about $100 million off the total--not really that much in the big picture (the sum still ran more than $900 million), but enough to fall below the headline-ready "more than $1 billion" mark. It was also enough to make it clear that the authors hadn't really understood what they were writing about.

In general, I am in favor of journalists learning how to leverage databases for better analysis, but it's an easy technology to misuse, accidentally--or even on purpose. There's a truism that the skills required to interpret statistics go hand in hand with the skills used to misrepresent them, and nowhere is that more pertinent than in the newsroom. Reporters and editors entering the world of data journalism need to hold onto the same critical skills they would use for any other source, not be blinded by the ease with which they can reach a catchy figure.

That said, journalists would do well to learn about these tools, especially in beats like economics and politics, if only to be able to spot their abuses. And there are three strong arguments for using databases (carefully!) for reporting: improving newsroom mathematical literacy, asking questions at modern scale, and making connections easier.

First, it's no secret that journalists and math are often uneasy bedfellows--a recent Washington Post ombudsman piece explored some of the reasons why numerical corrections are so common. In short: we're an industry of English majors whose eyes cross when confronted with simple sums, and so we tend to take numbers at face value even during the regular copy-editing process.

These anxieties are signs of a deeper problem that needs to be addressed, and there's nothing magical about SQL that will fix them overnight. But I think database training serves two purposes. First, it acclimatizes users to dealing with large sets of numbers, like treating nosocomephobia with a nice long hospital stay. Second, it reveals the dirty secret of programming, which is that it involves a lot of math process, but relatively little actual adding or subtracting, especially in query languages. Databases are a good way to get comfortable with numbers without having to actually touch them directly.

Ultimately, journalists need to be comfortable with numbers, because they're becoming an institutional hazard. While the state of government (and private-sector) data may still leave a lot to be desired from a programmer's point of view, it's practically flooded out over the last few years, with machine-readable formats becoming more common. This mass of data is increasingly unmanageable via spreadsheet: there are too many rows, too many edge cases, and too much filtering required. Doing it by hand is a pipe-dream. A database, on the other hand, is designed to handle queries across hundreds of thousands of rows or more. Languages like SQL let us start asking questions at the necessary scale.

Finally, once we've gotten over a fear of numbers and begun to take large data sets for granted, we can start using relational databases to make connections between data sets. This synthesis is a common visualization task that is difficult to do by hand--mapping health spending against immigration patterns, for example--but it's reasonably simple to do with a query in a relational database. The results of these kinds of investigations may not even be publishable, but they are useful--searching for correlation is a great jumping-off point for further reporting. One of the best things I've done for my team lately is set up a spare box running PostgreSQL, which we use for uploading, combining, searching, and then outputting translated versions of data, even in static form.

As always when I write these kinds of posts, remember that there is no Product X for saving journalism. Adding a database does not make your newsroom Web 2.0, and (see the example I opened with) it's not a magic bullet for better journalism. But new technology does bring opportunities for our industry, if we can avoid the Product X hype. The web doesn't save newspapers, but it can (and should) make sourcing better. Mobile apps can't save subscription revenues, but they offer better ways to think about presentation. And databases can't replace an informed, experienced editor, but they can give those journalists better tools to interrogate the world.

January 4, 2011

Filed under: journalism»new_media»data_driven

Your Scattered Congress 2010

Once again, I present CQ's annual vote studies in handy visualization form, now updated with the figures for 2010. This version includes some interesting changes from last year:

  • Load times should now be markedly faster. I decoupled the XML parsing pseudo-thread from the framerate by allowing it to run for up to 10ms before yielding back to the VM for rendering. Previously, it processed only a single member and then waited for the next timer tick, which probably meant at least 16ms per member even on machines capable of running much faster.
  • Clicking on a dot for a single member now loads that member's CQ profile page (subscribers only). Clicking on a dot representing multiple members will bring up a table listing all members, and clicking on one of these rows (or a row in the full Data Table view) will open the profile page in a new window.
  • Tooltips now respect the boundaries of the Flash embed, which makes them a lot more readable.
  • Most importantly, the visualization now collects multiple years in a single graphic, allowing you to actually flip between 2009 and 2010 for comparison. We have plans to add data going back to at least 2003 (CQ's vote studies actually go back more than 50 years, but the data isn't always easy to access). When that's done, you'll be able to visually observe shifts in partisanship and party unity over time.
Notably not changed: it's still in Flash. My apologies to the HTML5 crowd, but the idea of rendering and interacting with more than 500 alpha-blended display objects (four-fifths of which may be onscreen at any time), each linked to multiple XML collections, is not something I really consider feasible in cross-browser Javascript at this time.

The vote studies are one of those quintessentially CQ products: reliable, wonky, and relentlessly non-partisan. We're still probably not doing justice to it with this visualization, but we'll keep building out until we get there. Take a look, and let me know what you think.

October 18, 2010

Filed under: journalism»new_media

How J-Schools Are Failing New Media 101

If you're interested in working in data-driven journalism, or you know someone who is, my team at CQ is hiring. You can check out the listing at Ars. For additional context, this opening is for the server-side/database role on the team--someone who can set up a database for a reporting project, mine it for relevant data, and then present that information to either the newsroom or the public as a modern, standard-compliant web page.

To be honest, we're having a really difficult time filling this position. It's an odd duck: we need someone who's comfortable with computer science-y stuff like data structures and SQL, but also someone who can apply those skills towards journalism, which has its own distinct character traits: news sense, storytelling, and a peculiar tendency to pull at intellectual loose ends. A tough combination to begin with, even without taking into account the fact that anyone with both aptitudes can probably make a lot more money with the former than with the latter. So let's add a third requirement: they've got to be a true believer about what we do here.

As far as I can tell, the most reliable way to get someone with these three traits is to start with a journalist, then teach them how to code. In theory, that should be exactly what happens in a journalism school's "new media" or "interactive" program. And yet my experience with graduates of these MA programs is that they're woefully unprepared for the job my team is trying to do.

I should note here, I think, that I never attended J-school myself. GMU didn't have a journalism program, and I ended up in a different specialization in the communication department anyway. So it's possible that I'm a little bitter, given that I had to work my way into the news business via extensive freelancing, entry-level web production, and a lot of bloody-minded persistence. But I think my gripes are reasonable, and they're shared with coworkers from more traditional journalistic backgrounds.

Here's the crux of the problem, as I see it: programs in new media journalism are still teaching the Internet in the context of traditional print or television news, which stalls their graduates in two ways. First, it means the programs approach online media as outsiders, teaching classes in "blogging for journalists" or "media website design" as if they were alien artifacts to be unpuzzled instead of the native publishing platform for a whole generation now. It's the web, people: it's not going anywhere, and it's not something you should have to spend a semester introducing to your students. A whole class on blogging isn't education--it's coddling.

Second, these schools seem to be too focused on specific technologies or platforms instead of teaching rudimentary, generalizable computer engineering. There are classes on Flash, or on basic HTML, or using a given blog platform--and those are all good skills to have, but they're not sufficient. What we really need are people who know the general principles behind those skills: how do you structure data effectively for the story? How do you debug something? What's object-oriented design? Technology moves so fast in this business, someone without those fundamentals won't be able to keep up with the pace of change we need to maintain.

Maybe I'm just hardcore, but when I look at something like the Medill Graduate Curriculum (just to pick on someone at random), the interactive track looks lightweight to me. There's a lot of emphasis on industry inside baseball ("How 21st Century Media Works" or "Building Networked Audiences"), and not nearly enough on getting your hands dirty. "Digital Frameworks for Reporting" is only taught in DC? (Are government websites not available in Chicago?) "Database Reporting" is an optional elective? Not a single class taken from the graduate or undergraduate computer science curriculum, like "Fundamentals of Computer Programming I?" It looks to me like a program where you could emerge as a valuable data journalist, but it's just as likely that you'd be another Innovation Editor. And trust me, the world does not need any more of those.

I sympathize with the people who have to design these programs, I really do. The web is a big topic to cover. And worse, it's hard to teach people how to think critically--to understand about how they think, instead of just telling them what to think--but good programming has a lot in common with that level of metacognition. For the kind of data journalism we're trying to do at CQ, you've got to at least be able to think a little like a programmer, a little like a journalist, and a little like something new. If you think you can do that, we'd love to hear from you.

October 5, 2010

Filed under: journalism»new_media

CQ Economy Tracker

I don't know how long this'll be available to the general public, so take a look while you can: CQ Economy Tracker (formerly the Economic Indicators project) is now live. It's the product of more than a year of off-and-on development, and I'm thrilled to finally have it out in the wild.

Economy Tracker collects six big economic data sets (GDP, inflation, employment and labor, personal income and savings, home sales and pricing, and foreclosure rates) across the national, regional, and state levels, extended back as far as we could get data--sometimes almost a hundred years. The data is graphed, mapped, available in a sortable table, and also made available as Excel spreadsheets. As far as we're aware, we're the only organization that's collecting all of this information and putting it together in one easy-to-read package. It's a great resource for our own reporters when they go looking for vetted economic data, as well as a handy tool for readers.

But more than that, Economy Tracker has been my team's bid for some fundamental ideas about data journalism. The back end is a fairly simple PHP/PostgreSQL database, with the emphasis on A) making it easy for non-technical reporters to update by accepting Excel spreadsheets in a very tolerant way, and B) returning results in the web-standard JSON format for consumption by either Flash or Javascript. The current dashboard applet is a full-service showcase for the collection, but using a standards-based API, it should be easy for my team to build new visualizations based on our economic data--including smaller, single-purpose widgets or mash-ups with political or demographic data--or for our customers and readers to do so.

I think the last few years have shown how this strategy--building a news API for both internal and external use--has had real benefits for the newsrooms that have boldly let the way, like NPR and the New York Times. Not only does it engage the segment of the audience that's willing to dig into their data (free publicity!), but it grants newsroom developers a fleetness of foot that's hard to beat. It's a lot easier, for example, for NPR to turn on a dime and toss off a tablet-optimized website, or create a new native mobile client, because their content is already mostly decoupled from presentation and available in a machine-readable format. That's kind of a big deal, especially as we wait to see how this whole mobile Internet thing is going to shake out.

Whether or not this approach takes off, I'm enormously proud of the work that my team has done on this project. It's been a massive undertaking: building our own custom graphing framework, creating an internal event scheme for coordinating the two panels (pick a year on the National pane and it synchronizes with the Regional/State pane, and vice versa), and figuring out how to remain responsive while still displaying up to 40,000 rows of labor statistics (a combination of caching and delayed processing). Most importantly, the Economy Tracker stands as a monument to a partnership between the multimedia team, researchers, and our economics editor, in the best tradition of CQ journalism.

September 21, 2010

Filed under: journalism»new_media

We Choose Both

So you're a modern digital media company, and you want to present some information online. The fervor around Flash has died down a little bit--it started showing up on phones and somehow that wasn't the end of the world, apparently--but you're still curious about the choice between HTML and Flash. What technology should you use for your slideshow/data visualization/brilliant work of explainer journalism? Here's my take on it: choose both.

You don't hear this kind of thing much from tech pundits, because tech pundits are not actually in the business of effectively communicating, and they would prefer to pit all technologies against each other in some kind of far-fetched, traffic-generating deathmatch. But when it comes to new media, my team's watchword is "pragmatism." We try to pick the best tools for any given project, where "best" is a balance between development speed, compatibility, user experience, and visual richness. While it's true, for example, that you can often create the same kind of experience in HTML5* as in Flash, both have strengths and weaknesses. And lately we've begun to mix the two together within a single package--giving us the best of both worlds. It's just the most efficient way to work, especially on a team where the skillsets aren't identical from person to person.

What follows are some of the criteria that we use to pick our building blocks. None of these are set in stone, but we've found that they offer a good heuristic for creating polished experiences under deadline. And ultimately that--not some kind of ideological browser purity test--is all we care about.

Animation and Graphics

Long story short, if it has an animation more complicated than jQuery.slideDown(), we use Flash. HTML animation has become more and more sophisticated, but it's still not as smooth as Flash's 2D engine. More importantly, performance can vary widely from browser to browser: what runs brilliantly in Chrome is going to chug along in IE or (to a lesser extent) Firefox. One of the big advantages of Flash is that speed is relatively constant between browsers, even on expensive operations like BitmapFilters and alpha transparency.

Likewise, anything that involves generating arbitrary shapes and moving them around a canvas is a strong candidate for Flash. This is especially true for any kind of graphing or for flashy bespoke UIs. It's possible to create some impressive things with CSS and HTML, especially if you throw caution to the wind and use HTML5's canvas tag, but it's slower and requires a lot more developer time to get polished results across browsers. A lot of this comes down to the APIs that ActionScript exposes. Once you've gotten used to having a heavily-optimized 2D display tree and event dispatcher, it's hard to go back--and there's definitely no way I'm going to try to train a team of journalists how to push and pop canvas transformations.

Text

On the other hand, if we're looking for the best text presentation, we go with HTML every time. While it's true that Flash has support for a wider range of embedded fonts, they've been tricky to debug properly, and Flash text handling otherwise has always left a lot to be desired. It's anti-aliased poorly, doesn't wrap or reflow well, and is trapped in the embed window regardless of length. Also, its CSS implementation is weird and frustrating, to say the least. Even if our text is originally loaded in Flash, we increasingly toss it over to HTML via the ExternalInterface for rendering.

Where this really becomes a painful issue is when dealing with tabular data. Flash's DataGrid component is orders of magnitude faster than JavaScript when it comes to sorting, filtering, and updating large datasets, but it comes with a lot of limitations: rows must be uniform in height, formatting is wonky, and nobody's happy with the mousewheel behavior. If you're a genius in one runtime or the other, you can mitigate a lot of its weaknesses with clever hacks, but who has the time? We usually make our choice based on size: anything up to a couple hundred rows goes into HTML, and everything else gets the Flash treatment.

Speed

In some cases, particularly the new JIT-enabled browser VMs, JavaScript may already be faster than Actionscript. But the key is "some cases," since most browsers are not yet running those kinds of souped-up interpreters. In my experience, heavy number-crunching works better in Flash--to the extent that it should be done on the client at all. We try to handle most of our computational work on the server side in PHP and SQL, where the results can be done once and then cached. For something like race ratings, this works pretty well. In the rare cases that we do need to burn a lot of cycles on the client side, Flash is often the best way to get it done without script timeouts in older browsers.

I also think Flash is easier to optimize, but that probably has to do with my level of experience, and we don't usually make decisions based on voodoo optimization techniques. My personal take is that client-side speed is only a priority if it impacts responsiveness, which is primarily a UX problem. We have run into problems with delays in response to user input on both technologies, and the solution is less about raw speed and more about giving good user feedback. We also use strategies like lazy loading and caching no matter where we're coding--they're just good practice.

XML and JSON

This is another minor factor, since we're in control (usually) of our own data formats here, but it's worth considering if all else is equal. Flash has excellent native XML support, but its JSON library (from Adobe's core library package) proved slow for us when loading more than a few thousand rows from a database. JavaScript obviously has good JSON support, but I always dread using it for XML. We've gradually started moving to JSON for both, because we're trying to set a good example for web API design at CQ, and it seems like the least of two evils.

It should be noted that one of the primary roles of XML and JSON in the browser are for AJAX-style web apps, and Flash does have a real advantage in this area: it can do cross-domain HTTP requests in all browsers, as opposed to JavaScript's heavy-handed sandboxing.

Code Reuse

There are doubtless tools and techniques for building reusable JavaScript components and APIs, but at the end of the day it's just been easier to do for our Flash/Flex projects. The combination of namespaces, traditional object inheritance, and a more consistent API mean that it's easier to get my team members up to speed, and we now have a small library of reusable ActionScript components for graphing, slideshows, mapping, and data display. So far, my experience is that when we build a Flash project, if done properly, the code ends up being pretty portable by defaultMastering reusable JavaScript, on the other hand, seems to require deep knowledge of things like closures and scope, and those don't come easy to most journalists-turned-coders.

I really can't overstate how important this is for our team. Like most newsroom multimedia teams, we're understaffed relative to the workload we'd really like to have. We don't really want to sink time into one-off projects, so any time we have a chance to recycle code, we take it. An additional bonus is that we can build these reusable components to fit the CQ look and feel, and it's easier to pitch a presentation to an editor if we can point to something similar we've done in the past.

Video

Video is an interesting case, and one that's representative of a mature approach to new media planning. I would say that we use a lot of JavaScript to place video on the page--but that video is typically a Flash embed from YouTube or a content delivery network. We're a long way away from a world of pure video tags.

In general, my time at B-SPAN taught me this about online video: if you're not a video hosting company, you should be hiring someone else to take care of it for you. Video is too high-bandwidth, too high-maintenance, and too finicky for non-experts to be managing it. And I think the HTML5 transition only proves that to be the case in the browser as well. Vimeo and Brightcove (just to pick two) will earn their money by working out ways for you to upload one file and deliver it via <video> or Flash on a per browser basis, freeing you up to worry about the bigger picture.

Mobile

Mobile is, of course, where this whole controversy got started, but I think most of the debate revolves around a straw man. Current mobile devices restrict your use of hover events (no more tooltips!), they limit the screen to a tiny keyhole view, and they require UI elements to be much larger for finger-friendliness. That's true for HTML and Flash both. The idea that HTML5 interactives can present a great experience on both desktop and mobile browsers without serious alterations is ridiculous--you're going to be doing two versions anyway if you want decent usability. So while it depends on your situation, I don't think of this as a Flash vs. HTML5 question. It's more like a desktop vs. mobile question, and the vast majority of our visitors still come in through a desktop browser, so that's generally what we design for.

That said, here's my prediction: Flash on Android is good enough, and is going to be common enough in a year or two, that I can easily see it being used on mobile sites going forward. Apple probably won't budge on their stance, meaning that Flash won't be quite as ubiquitous as it is on the desktop. But if small teams like mine find ourselves in a situation where Flash is a much better choice for the desktop and a sizeable chunk of smartphones, it won't be unusual--or unreasonable--to make that trade-off.

Powers of Two

But really, why should anyone have to choose either Flash or HTML5? I mean, isn't the ability to mix and match technologies a key part of modern, Web 2.0 design? In a day and age where you've got servers written in LISP, PHP, Ruby, C, and .Net all talking to each other, sometimes on the same machine, doesn't it seem a little old-fashioned to be a purist about the front-end? Whatever happened to "use the right tool for the right job?"

The key is to understand that you can choose both--that ActionScript and HTML actually make a great combination. By passing data across the ExternalInterface bridge, you can integrate Flash interactives directly into your JavaScript. Flash can transfer text out to be displayed via HTML. JavaScript can pass in data to be graphed, or can provide accessible controls for a rich media SWF component. If you code it right, ActionScript even provides a great drop-in patch for HTML5 features like <canvas>, <video>, and <audio> in older browsers.

The mania for "pure HTML" reminds me of the people in the late 90's who had off-grey websites written in Courier New "because styling is irrelevant, the text is the only thing that matters." If Flash has a place on the page, we're going to use it. We'll try to use it in a smart way, mixing it into an HTML-based interactive to leverage its strengths and minimize its weaknesses. But it'd be crazy to make more work for ourselves just because it's not fashionable to code in ActionScript these days. Leave that for the dilettantes--we're working here.

July 29, 2010

Filed under: journalism»new_media

Achievement Unlocked: Dog Bitten

At the DICE 2010 conference, a guy named Jesse Schell gave a speech about bringing reward systems from gaming (achievements, trophies, etc.) into real life as a motivational system. You've probably seen it--if you haven't, you can watch it and read designer David Sirlin's comments here.

Essentially, Schell lays out a future where there's a system that awards "points" for everyday tasks, ranging from the benign (brushing your teeth, using public transit) to the insidious (buying products, taking part in marketing schemes). Sometimes these points mean something (tax breaks, discounts), and sometimes they don't (see also: XBox GamerPoints). You can argue, as Jane McGonigal does, that this can be beneficial, especially if it leads to better personal motivational tools. I tend more towards the Sirlin viewpoint--that it's essentially a dystopia, especially once the Farmville folks latch onto it.

(The reasons that I think it's inevitably dystopian, besides the obvious unease around the panopticon model, is that a reward system would inevitably be networked. And if it's networked and exploitable, you'll end up with griefers, of either the corporate spam variety or the regular 4chan kind. It's interesting, with Facebook grafting itself more and more onto the rest of the Internet, that social games have not already started using the tools of alternate reality gaming--ARGs--to pull people in anyway. Their ability to do so was probably delayed by the enormous outcry over debacles like Facebook's Beacon debacle, but it's only a matter of time.)

That said, as an online journalist, I also found the idea a little intriguing (and I'm thinking about adding it to my own sites). Because here's the thing: news websites have a funding problem, and more specifically a motivation problem. As a largely ad-funded industry, we've resorted to all kinds of user-unfriendly strategies in order to increase imperfect (but ad network-endorsed) metrics like pageviews, including artificial pagination and interstitial ads. The common thread through all these measures is that they enforce certain behaviors (view n number of pages per visit, increase time on site by x seconds) via the publishing system, against the reader's will. It feels dishonest--from many points of view, it is dishonest. And journalism, as much or more than any other profession, can't survive the impression of dishonesty.

An achievement system, while obviously manipulative, is not dishonest. The rules for each achievement are laid out ahead of time--that's what makes them work--as are the rewards that accompany them. It doesn't have to be mandatory: I rarely complete all achievements for an XBox game, although I know people who do. More importantly, an achievement system is a way of suggesting cultural norms or desired behavior: the achievements for Mirror's Edge, for example, reward players for stringing together several of the game's movement combos. Half Life 2 encourages players to use the Gravity Gun and the environment in creative ways. You can beat either one without getting these achievements, but these rewards signal ways that the designers would like you to approach the journey.

And journalism--along with almost all Big Content providers--is struggling with the problems of establishing cultural norms. This includes the debate over allowing comments (with some papers attempting paid, non-anonymous comment sections in order to clean things up), user-generated content (CNN's iReport, various search-driven reporting schemes), and at heart, the position and perception of a newspaper in its community, whatever that might be. It's not your local paper anymore, necessarily. So what is it to you? Why do you care? Why come back?

Achievements might kill multiple birds with one stone. They provide a way to moderate users (similar to Slashdot's karma) and segregate access based on demonstrated good behavior. They create a relationship between readers and their reading material. They link well with social networks like Facebook and Twitter. And most importantly, they give people a reason to spend time on the site--one that's transparently artificial, a little goofy, and can be aligned with the editorial vision of the organization (and not just with the will of advertisers). You'd have several categories of achievements, each intended to drive a particular aspect of site use: social networking, content consumption, community engagement, and random amusements.

Here's a shallow sampling of possible News Achievements I could see (try to imagine the unlocked blip before each one):

  • Renaissance Reader: read three articles from each main section in one day.
  • Soapbox Derby Winner: have ten comments "starred" by the editorial staff.
  • Gone Viral: share four articles via the social networking widget.
  • Fear and Loathing: Read ten articles from the Travel section.
  • Capstone of the Inverted Pyramid: read an article from start to finish.
  • News Cycle: accumulate 24 hours of time on the site.
  • Your Civic Duty: take part in one of the site's daily surveys.
  • Stop the Presses!: submit a correction that's subsequently accepted and issued by the editors.
  • Daily Routine: visit the site every workday for a month.
  • Citizen Journalism: have user-generated content published on the front page, or used as the basis of a story.
  • Extra, Extra! #N in an infinite series: each week, a staffer writes a completely arbitrary, random achievement--something weird or funny in the vein of a scavenger hunt, using several stories in the paper as clues.

Is this a little ridiculous? Sure. But is it better than a lot of our existing strategies for adapting journalism to the online world? I think it might be. Despite the changes in the news landscape, we still tend to think of our audiences as passive eyeballs that we need to forcibly direct. The most effective Internet media sites, I suspect, will be the ones that treat their audiences as willing participants. And while not everyone has noble intentions, the news is not the worst place to start leveraging the psychological lessons of gaming.

June 9, 2010

Filed under: journalism»new_media

org.npr.android.news

Perhaps this is irony, coming on the heels of the previous post, but I'd like to announce that the first version of the NPR client for Android incorporating my patches has gone live. You can find it in the Market, or see it on App Brain. I get a credit and everything, as the second coder on the project. I'm pretty thrilled.

I got involved because, in keeping with the open-source spirit behind Android itself, NPR has released the source for the client at a Google Code repository under the Apache license. You can download it for yourself, if you'd like (you'll need an API key to compile, though). The NPR team would love to have contributions from other coders, designers, or even just interested listeners. You can hit them up via @nprandroid on Twitter, or send an e-mail to the app's feedback address.

This version mainly splits playback off into a background service with a notification, which is a better user experience and means the stream won't be killed if you leave the application with the Back button. We've got another version in the works that improves this functionality, incorporates some little UI tweaks, and lays the groundwork for home screen widgets. I'd like to thank Corvus for his help in spotting areas where the Android client needs improvement. The NPR design team is also finishing up an overhaul of the look-and-feel of the application, and hopefully we can get that out soon. Along with taking care of bug fixes and project cleanup, that's my priority as soon as existing revisions are cleared.

April 14, 2010

Filed under: journalism»new_media

On Script

If you ask me to describe a reporter's tools, I admit that what leaps to mind is more than a little hackneyed. Pen and pad, maybe? One of those goofy hats with the press pass lodged in the band? Typewriter (seriously, not even a laptop)? Thanks, 1930s screwball newsroom comedies! But in my day job, I can't afford to be a romantic about the newsroom's toolkit--we're far enough behind as it is. And I'd argue that when you think of newsgathering in the near future, there are a few other players you should consider: scripting languages like Perl, Python, Ruby, Javascript, and Visual Basic.

I'm biased, of course, as someone who's interested in what I call data-driven journalism. But the way I see it, the basic task of journalism is to ask questions, and with more data than ever being made available by governments, non-profits, corporations, and individuals, it becomes difficult to answer those questions--or even to know where to start--unless you can leverage a computer's ability to filter and scale.

For example: our graphics reporter is pulling together some information regarding cloture over the last century years. She's got a complete list of all the motions filed since the 66th Congress (Treaty of Versailles in 1919!). Getting a count of motions from the whole set with a given result is easy with Excel's COUNTIF function, but how do we get a count of rejected motions by individual Congress? You could do it by manually filtering the list and noting the results, or you could write a new counting function (which we then extended to check for additional criteria--say, motions which were rejected by the majority party). The latter only takes about 10 lines of code, and it saves a tremendous amount of tedium. More importantly, it let her immediately figure out which avenues of analysis would be dead ends, and concentrate our editorial efforts elsewhere.

We also do a fair amount of page-scraping here--sometimes even for our own data, given that we don't always have an API for a given database field. I'm trying to get more of our economic data loaded this way--right now, one of our researchers has to go out and get updates on the numbers from various sources manually. That's time they can't spend crunching those numbers for trends, or writing up the newest results. It's frustratingly inefficient, and really ought to be automated--this is, after all, exactly what most scripting languages were written to do.

It's true that these are all examples of fairly narrow journalism--business and economic trends, specific political analysis, metatextual reporting. Not every section of the paper will use these tools all the time, and I'm not claiming that old style, call-people-and-harass-them-for-answers reporting will go away any time soon. But I've been thinking lately about the cost of investigative reporting, and the ways that computer automation could make it more profitable. Take Pro Publica's nursing board investigation, for example. It's a mix of traditional shoe leather reporting and database pattern-matching, with the latter used to direct the former. Investigative reporting has always been expensive and slow, but could tools like this speed the process up? Could it multiply the effectiveness of investigative reporters? Could it revive the ability for local papers to act as a watchdog for their regional governments and businesses?

Well, maybe. There are a lot of reasons why it wouldn't work right now, not the least of which is the dependence of data-driven journalists on, well, data. It assumes that the people you're investigating are actually putting information somewhere you can get to it, and that the data is good--or that you have the skills and sufficient signal to distinguish between good data and bad. If I imagine trying to do this kind of thing out where my parents live in rural Virginia (a decent acid test for local American news), I'd say it's probably not living up to its potential yet.

But I think that day is coming. And I'm not the only one: Columbia just announced a dual-degree masters program in journalism and computer science (Wired has more, including examples of what the degree hopes to teach). To no small degree, the pitch for developing these skills isn't just a matter of leveraging newsroom time efficiently. It's more that in the future, this is how the world will increasingly work: rich (but disconnected) private databases, electronic governmental records, and interesting stories buried under petabytes of near-random noise. Journalists don't just need to learn their way around basic scripting because it's a faster way to research. They may need it just to keep up.

March 9, 2010

Filed under: journalism»new_media

Forget About It

Via Aleks Krotoski, web developer Jeremy Keith discusses the "truism" that The Internets Never Forget:

We seem to have a collective fundamental attribution error when it comes to the longevity of data on the web. While we are very quick to recall the instances when a resource remains addressable for a long enough time period to cause embarrassment or shame later on, we completely ignore all the link rot and 404s that is the fate of most data on the web.

There is an inverse relationship between the age of a resource and its longevity. You are one hundred times more likely to find an embarrassing picture of you on the web uploaded in the last year than to find an embarrassing picture of you uploaded ten years ago.

From there, Keith muses a bit on domain names, which are rented from the ICANN: you can own your data (or own your name), but you can't own your domain in perpetuity. We've been dealing with content-management questions a bit at work lately, as any news organization transitioning from print to web must, so this kind of thing has been on my mind anyway. And I've reached the point, personally, where I take a fairly radical stance: not just that the web does lose content over time, but that it should do so. Permanence is unrealistic, if not actively harmful.

Now, I say this as someone who likes URLs, and who believes that basic URL literacy is not too much to expect from people. I also think URLs should be stable for a reasonable period of time--inversely proportional to their depth in the directory tree, for example, so that "www.domain.com/stories" should be much more stable than "www.domain.com/content/stories/about/buildings/and/food.html" or something like that. But the idea that you can have URLs that are stable forever? Or that you should expect all content to be equally preservation-worthy? That's just foolish.

Take a news organization, for example. Your average news site produces a truly galling amount of content every day: not just text stories, but also images, video, audio, slideshows, interactives, and so on. Keeping track of all of this is a monumental task, and the general feeling I get is that these companies are failing miserably at it. I cannot think of a single newspaper website (including CQ, no less) where it is easier for me to find a given item through their own navigation or search than it is to go to Google and type "washington post mccain obama timeline" (to pick a recent example).

And that's not a bad thing. Google spends a lot of time learning how to read your mind (effectively). They (and their competitors at Bing, or wherever) employ a lot of smart people to do nothing but help you find what you're looking for, even if you don't spell it right or if the URL has changed. I say, let them do that. If it were up to me, I'd replace every in-site search engine with a custom Google query and then forget about it: the results would probably be better (they could hardly be worse) and newsroom tech departments could spend their time and money on actual journalism-related activities.

The thing is, the vast majority of content (particularly in journalism) has a set lifespan, and we should respect that. The window of time when stable URLs are crucial is limited to a couple of months or so: enough time for bloggers (micro- or otherwise) and social networkers to discuss those rare few articles that catch on with the Internet audience and have legs. After that, searchability is more important than stability, because people aren't going to dig up the old links. They're going to locate what they want via someone else's search engine. That's if they search for it at all, of course. Because realistically speaking, most news has little in the way of legs, especially on the Internet where readers expect breaking stories to adopt a blog-style hit-and-run update pattern. It's intensely valuable for about a day, and then it's digital hamster-cage lining. Don't throw it away haphazardly--but don't fool yourself about its long-term value, either.

This may sound like I'm saying that we should give up on archiving. I'm not--after all, I'm the world's biggest fan of Lexis-Nexis. I simply propose that fighting linkrot can't be our top priority. When it comes to content management, my question is not "how do we store this at the same location forever?" but "how easy will it be to port this medium-term storage solution into another with a minimum of degradation for the content that actually matters?" It's that content that I really care about, not its address, because Google (or Bing, or whatever) will always be able to find the new location. That makes ease of migration much more important than URL fidelity. If you're thinking about a news CMS timeframe longer than, say, two years, I think you risk losing sight of that fact.

Ultimately, links break. Let them. Attempting to engineer for eternity is a great way to never finish building--or to lock yourself into a poor foundation when the technological ground shifts. And honestly, we're far enough behind as an industry now. We don't need to bury ourselves any more.

Future - Present - Past