Being the hip young technologist that she is, Belle has one of those Palm Pre phones, which does something very cool: given login information for various social media accounts (Google, Facebook, etc), it collates and cross-link that information into the device's contact list. So a person's ID picture is the same as their Facebook profile image, and when they update their contact information online, it automatically changes on the phone. Handy--when it works.
My understanding is that most of the time it does, but sometimes Palm's system doesn't quite connect the dots, and then Belle has to go in and tell it that certain entries are, in fact, the same person. Frankly, I'm impressed that it works at all. It's an example of the kind of pattern recognition that people are very good at, and computers typically are not. I personally think we'll always have an edge, which makes me feel absurdly better, as if Skynet's assassin robots will never be able able to track down Sarah Connor or something.
In essence, what Palm has done is create a system for linking facts with a confidence threshold. And it's something I've been thinking about in relation to journalism, particularly after watching a presentation by the Sunlight Foundation on their data harvesting efforts during the age of data.gov, not to mention the work I've been doing lately on budget and economic indicators. There's a lot of information floating around (and more every day), but how can we coordinate it with confidence? And is it possible that the truth will get buried under its weight?
Larry Lessig, of all people, pessimistically pitched the latter earlier this month, in a New Republic essay titled "Against Transparency." Lessig ties together the open government movement, free content activists, and privacy advocates into what he calls the "salience" problem: extracting meaning in context from a soup of easily-manipulated facts, without swamping the audience in data or misinterpreting it for political gain. It's a familiar problem: I consider myself a journalist, but I spend pretty much my entire workday nowadays chin-deep in databases, figuring out how to present them to both our readers and our own editorial team for use. It is, in other words, the same confidence problem: how do we decide which bits of data are connected, and which are not?
Well, part of the answer is that you need journalists who are good subject experts. All the data in the world is meaningless unless you have someone who can interpret it. In fact, this is one of the main directions I see journalism exploring as newsrooms become more comfortable with technology. Assuming journalists can survive until that point, of course: being a deep subject expert is well and good, but it seems to be the first thing that gets cut these days when the newsroom profitability drops.
Second, as journalism and crowdsourcing become more comfortable with each other, I think we're going to have to start tagging information with a confidence rating: how sure are we that these bits of information are related? Data that's increasingly pulled from disparate--and unevenly vetted--sources will need to be identified by its reliability. I'd still like to be able to use it, but I should be able to adjust for "truthiness" and alert others about it.
But perhaps most importantly, this kind of debate really highlights how the open government movement needs to be not just about the amount of data, but also its degree of interoperability. This has really been driven home to me on the federal budget process: from what I can understand of this fantastically complicated accounting system, you can track funds from the top down (via the subcommittees), or from the bottom up (actual agency funding). But getting the numbers to meet in the middle is incredibly hard, due to the ways that money is tracked. Indeed, you can get the entire federal budget as a spreadsheet (it's something like 30,000 line items), but good luck decoding it into something understandable, much less following funding from year to year.
That's a problem for a journalist, but it's also a problem as a citizen. Without clean data, open government initiatives may be severely weakened. But contra Lessig, I don't think that makes them worthless. I think it creates an interesting problem to solve--one we can't just brute-force with computing power. Open government shouldn't just be about amount, but about quality. When both are high, I see a lot of great opportunities for future reporting.