I'm biased, of course, as someone who's interested in what I call data-driven journalism. But the way I see it, the basic task of journalism is to ask questions, and with more data than ever being made available by governments, non-profits, corporations, and individuals, it becomes difficult to answer those questions--or even to know where to start--unless you can leverage a computer's ability to filter and scale.
For example: our graphics reporter is pulling together some information regarding cloture over the last century years. She's got a complete list of all the motions filed since the 66th Congress (Treaty of Versailles in 1919!). Getting a count of motions from the whole set with a given result is easy with Excel's COUNTIF function, but how do we get a count of rejected motions by individual Congress? You could do it by manually filtering the list and noting the results, or you could write a new counting function (which we then extended to check for additional criteria--say, motions which were rejected by the majority party). The latter only takes about 10 lines of code, and it saves a tremendous amount of tedium. More importantly, it let her immediately figure out which avenues of analysis would be dead ends, and concentrate our editorial efforts elsewhere.
We also do a fair amount of page-scraping here--sometimes even for our own data, given that we don't always have an API for a given database field. I'm trying to get more of our economic data loaded this way--right now, one of our researchers has to go out and get updates on the numbers from various sources manually. That's time they can't spend crunching those numbers for trends, or writing up the newest results. It's frustratingly inefficient, and really ought to be automated--this is, after all, exactly what most scripting languages were written to do.
It's true that these are all examples of fairly narrow journalism--business and economic trends, specific political analysis, metatextual reporting. Not every section of the paper will use these tools all the time, and I'm not claiming that old style, call-people-and-harass-them-for-answers reporting will go away any time soon. But I've been thinking lately about the cost of investigative reporting, and the ways that computer automation could make it more profitable. Take Pro Publica's nursing board investigation, for example. It's a mix of traditional shoe leather reporting and database pattern-matching, with the latter used to direct the former. Investigative reporting has always been expensive and slow, but could tools like this speed the process up? Could it multiply the effectiveness of investigative reporters? Could it revive the ability for local papers to act as a watchdog for their regional governments and businesses?
Well, maybe. There are a lot of reasons why it wouldn't work right now, not the least of which is the dependence of data-driven journalists on, well, data. It assumes that the people you're investigating are actually putting information somewhere you can get to it, and that the data is good--or that you have the skills and sufficient signal to distinguish between good data and bad. If I imagine trying to do this kind of thing out where my parents live in rural Virginia (a decent acid test for local American news), I'd say it's probably not living up to its potential yet.
But I think that day is coming. And I'm not the only one: Columbia just announced a dual-degree masters program in journalism and computer science (Wired has more, including examples of what the degree hopes to teach). To no small degree, the pitch for developing these skills isn't just a matter of leveraging newsroom time efficiently. It's more that in the future, this is how the world will increasingly work: rich (but disconnected) private databases, electronic governmental records, and interesting stories buried under petabytes of near-random noise. Journalists don't just need to learn their way around basic scripting because it's a faster way to research. They may need it just to keep up.