A proposal for responsible and ethical publication of personally-identifiable information in data journalism
Thanks to Helga Salinas, Kazi Awal, and Audrey Carlsen for their feedback.
Over the last decade, one of the goals of data journalism has been to increase accountability and transparency through the release of raw data. Admonitions of "show your work" have become common enough that academics judge our work by the datasets we link to. These goals were admirable, and (in the context of legitimizing data teams within legacy organizations) even necessary at the time. But in an age of 8chan, Gamergate, and the rise of violent white nationalism, it may be time to add nuance to our approach.
This document is concerned primarily with the publication of personal data (also known as personally-identifiable information, or PII). In other words, we're talking about names, addresses or contact info, lat/long coordinates and other geodata, ID numbers (including license plates or other government ID), and other data points that can be traced back to a single individual. Much of this is available already under the public record, but that's no excuse: as the NYT Editorial Board wrote in 2018, "just because information is public doesn't mean it has to be so easy for so many people to get." It is irresponsible to amplify information without thinking about what we're amplifying and why.
Moreover, this is not a theoretical discussion: many newsroom projects start with large-scale FOIA dumps or public databases, which may include exactly this personal data. There have been movements in recent years to monetize these databases--creating a queryable database of government salaries, for example, and offering it via a subscription. Even random public records requests may disclose personal data. Intentionally or not, we're swimming in this stuff, and have become jaded as to its prevalence. I simply ask: is it right for us to simply push it out, without re-examining the implications of doing so?
I would stress that I'm not the only person who has thought about these things, and there are a few signs that we as an industry are beginning to formalize our thought process in the same way that we have standards around traditional reporting:
In her landmark 2015 book The Internet of Garbage, Sarah Jeong sets aside an entire chapter just for harassment. And with good reason: the Internet has enabled new innovations for old prejudices, including SWATting, doxing, and targeted threats at a new kind of scale. Writing about Gamergate, she notes that the action of its instigator, Eron Gjoni, "was both complicated and simple, old and new. He had managed to crowdsource domestic abuse."
I choose to talk about harassment here because I think it provides an easy touchstone for the potential dangers of publishing personal information. Since Latanya Sweeney's initial work on de-anonymizing data, an entire industry has grown up around taking disparate pieces of information, both public and private, and matching them against each other to create alarmingly-detailed profiles of individual people. It's the foundation of the business model for Facebook, as well as a broad swathe of other technology companies. This information includes your location over time. And it's available for purchase, relatively cheaply, by anyone who wants to target you or your family. Should we contribute, even in a minor way, to that ecosystem?
These may seem like distant or abstract risks, but that may be because for many of us, this harassment is more distant or abstract than it is for others. A survey of "news nerds" in 2017 found that more than half are male, and three-quarters are white (a demographic that includes myself). As a result of this background, many newsrooms have a serious blind spot when it comes to understanding how their work may be seen (or used against) underrepresented populations.
As numerous examples have shown, we are very bad as an industry at thinking about how our power to amplify and focus attention is used. Even if harassment is not the ultimate result, publishing personal data may be seen by our audience as creepy or intrusive. At a time when we are concerned with trust in media, and when that trust is under attack from the top levels of government, perhaps we should be more careful in what data we publish, and how.
Finally, I think it is useful to consider our twin relationship to power and shame. Although we don't often think of it this way, the latter is often a powerful tool in our investigative reporting. After all, as the fourth estate, we do not have the power to prosecute or create legislation. What we can do is highlight the contrast between the world as we want it to be and as it actually is, and that gulf is expressed through shame.
The difference between tabloid reporting and "legitimate"journalism is the direction that shame is directed. The latter targets its shame toward the powerful, while the former is as likely to shame the powerless. In terms of accountability, it orients our power against the system, not toward individual people. It's the difference between reporting on welfare recipients buying marijuana, as opposed to looking at how marijuana licensing perpetuates historical inequalities from the drug war.
Our audiences may not consciously understand the role that shame plays in our journalism, but they know it's a part of the work. They know we don't do investigations in order to hand out compliments and community service awards. When we choose to put the names of individuals next to our reporting, we may be doing it for a variety of good reasons (perhaps we worked hard for that data, or sued to get it) but we should be aware that it is often seen as an implication of guilt on the part of the people within.
I want to be very clear that I am only talking about the public release of data in this document. I am not arguing that we should not submit FOIA or public records requests for personal data, or that it can't be useful for reporting. I'm also not arguing that we should not distribute this data at all, in aggregated form, on request, or through inter-organizational channels. It is important for us to show our work, and to provide transparency. I'm simply arguing that we don't always need to release raw data containing personal information directly to the public.
In the spirit of Maciej Ceglowski's Haunted by Data, I'd like to propose we think of personal data in three escalating levels of caution:
When creating our own datasets, it may be best to avoid personal data in the first place. Remember, you don't have to think about the implications of the GDPR or data leaks if you never have that information. When designing forms for story call-outs, try to find ways to automatically aggregate or avoid collecting information that you're not going to use during reporting anyway.
If you have the raw data, don't just throw it out into the public eye because you can. In general, we don't work with raw data for reporting anyway: we work with aggregates or subsets, because that's where the best stories live. What's the difference in policy effects between population groups? What department has the widest salary range in a city government? Where did a disaster cause the most damage? Releasing data in an aggregate form still allows end-users to check your work or perform follow-ups. And you can make the full dataset available if people reach out to you specifically over e-mail or secure channels (but you'll be surprised how few actually do).
In cases where distributing individual rows of data is something you're committed to doing, consider ways to protect the people inside the data by anonymizing it, without removing its potential usefulness. For example, one approach that I love from ProPublica Illinois' parking ticket data is the use of one-way hash functions to create consistent (but anonymous) identifiers from license plates: the input always creates the same output, so you can still aggregate by a particular car, but you can't turn that random-looking string of numbers and letters back into an actual license plate. As opposed to "cooking" the data, we can think of this as "seasoning" it, much as we would "salt" a hash function. A similar approach was used in the infosec community in 2016 to identify and confirm sexual abusers in public without actually posting their names (and thus opening the victims up to retaliation).
Once upon a time, this industry thought of computer-assisted reporting as a new kind of neutral standard: "precision" or "scientific" journalism. Yet as Catherine D'Ignazio and Lauren Klein point out in Data Feminism, CAR is not neutral, and neither is the way that the underlying data is collected, visualized, and distributed. Instead, like all journalism, it is affected by concerns of race, gender, sexual identity, class, and justice.
It's my hope that this proposal can be a small step to raise the profile of these questions, particularly in legacy newsrooms and journalism schools. In working on several projects at The Seattle Times and NPR, I was surprised to find that although there are guidelines on how to ethically source and process data, it was difficult to find formal advice on ethical publishing of that same data. Other journalists have certainly dealt with this, and yet there are relatively few documents that lay out concrete guidelines on the matter. We can, and should, change that.
This post was originally written as a lightning talk for SRCCON:Power. And then I looked at the schedule, and realized they weren't hosting lightning talks, but I'd already written it and I like it. So here it is.
I want to talk to you today about election results and power.
In the last ten years, I've helped cover the results for three newsrooms at very different scales: CQ (high-profile subscribers), Seattle Times (local), and NPR (shout out to Miles and Aly). I say this not because I'm trying to show off or claim some kind of authority. I'm saying it because it means I'm culpable. I have sinned, and I will sin again, may God have mercy on my soul.
I used to enjoy elections a lot more. These days, I don't really look forward to them as a journalist. This is partly because the novelty has worn off. It's partly because I am now old, and 3am is way past my bedtime. But it is also in no small part because I'm really uncomfortable with the work itself.
Just before the midterms this year, Tom Scocca wrote a piece about the rise of tautocracy — meaning, rule by mulish adherence to the rules. Government for its own sake, not for a higher purpose. When a judge in Nebraska rules that disenfranchising Native American voters is clearly illegal, but will be permitted under regulations forbidding last-minute election changes — even though the purpose of that regulation is literally to prevent voter disenfranchisement — that's tautocracy. Having an easy election is more important than a fair one.
For those of you who have worked in diversity and inclusion, this may feel a little like the "civility" debate. That's not a coincidence.
I am concerned that when we cover elections with results pages and breaking alerts, we're more interested in the rules than we are in the intended purpose. It reduces the election to the barest essence — the score, like a football game — divorced from context or meaning. And we spend a tremendous amount of redundant resources across the industry trying to get those scores faster or flashier. We've actually optimized for tautocracy, because that's what we can measure, and you always optimize for your metrics.
But as the old saying goes, elections have consequences. Post-2016, even the most privileged and most jaded of us have to look around at a rising tide of white nationalism and ask, did we do anything to stop this? Worse, did we help? That's an uncomfortable question, particularly for those of us who have long believed (incorrectly, in my opinion) that "we just report the news."
Take another topic, one that you will be able to sell more easily to your mostly white, mostly male senior editors when you get back: Every story you run these days is a climate change story. Immigration, finance, business, politics both internal and domestic, health, weather: climate isn't just going to kill us all, it also affects nearly everything we report on. It's not just for the science stories in the B section anymore. Every beat is now the climate beat.
Where was climate in our election dashboard? Did anyone do a "balance of climate?"
Isn't that an election result?
What would it look like if we took the tremendous amount of duplicated effort spent on individual results pages, distributed across data teams and lonely coders around the country, and spent it on those kinds of questions instead?
The nice thing about a lightning talk is that I don't have time to give you any answers. Which is good, because I'm not smart enough to have any. All I know is that the way we're doing it isn't good enough. Let's do better.
[SPARSE, SKEPTICAL APPLAUSE]
There are lies, damn lies, and press releases.
On April 1, Rochester University put out a press release about a researcher who has invented a new way to analyze a clarinet recording and turn it into a new kind of MIDI file for a physical modeling synth. The recording and the synth are not groundbreaking, but the analysis is mildly interesting if it can actually pull expression data from a recording.
Unfortunately, the PR flack didn't write that. Instead, he wrote "music file compressed 1,000 times smaller than MP3," and used provocative quotes from the researchers in question to imply that this technology could be the future of music. By the time Ars asked me to write about it, at least one news outlet had screwed up the story based on the release. Even after I interviewed the team leader and put something together, Wired had reproduced the faulty "1000x better than MP3 compression" headline on their Gadget Lab post.
I don't expect Wired to read Ars before posting to get the real story, of course. But the press release reads as instantly fishy to someone even with my limited digital audio education. It would be nice to have some confidence that a news outlet covering audio tech would be able to reach the same conclusions.
The real problem is twofold. One is that the flacks apparently felt comfortable writing a release about technology that they obviously didn't understand, and were willing to take liberties for a bit more controversy. But perhaps the more serious dilemma is that tech writers fell for it.
A couple of weeks ago, there was another article in Wired about the competition between Engadget and Gizmodo. These two gadget blogs are huge moneymakers online, and they're constantly racing to get the scoop on each other. This, it seems, is the lesson that some online news sources have absorbed: go faster, not deeper. But the opposite, I think, is a more valuable use of journalism online. It doesn't take any skill to do coverage fast--just a subscription to a press release service and a quick hand on the copy and paste. But expertise and a reputation for accuracy are what draw eyeballs. A couple of hours extra won't change that.
Ever since I wrote those first couple of articles for Ars about UX Week, and especially after I covered the Future of Music Conference sessions, I've been getting e-mails from PR reps about music, UI, and tech news. I try to turn them down politely, since I don't have any intention about writing about a press release. It seems lazy to me, and I've done my time writing for the Man.
What is interesting is seeing who picks it up out in the more mainstream publications--they're clearly spamming these letters out to everyone who's ever written about a similar topic online. Someone at Wired can usually be expected to pick them up, for example. Ever wonder why Gizmodo and Engadget seem to share 90% of the same items, even if they're in competition? This is why. I don't really know how I feel about that. On the one hand, what's the harm? It's not like these are (usually) topics that lend themselves to investigative reporting ("My god! Deep Throat has led us to... A NEW CHEAP HD-DVD PLAYER!"). But on the other hand, I don't look at these sites quite the same way again.
The Long Tail author Chris Anderson had enough with PR spam one day, and published the 100 addresses he was adding to his block list. He noted in a followup post that a lot of this kind of thing is done automatically, via huge marketing databases available for hire. The PR reps that take the time to get to know people, or build their own lists, were much less likely to get indignant responses. But do they have as much success? Or, as in regular spam, do the hits outnumber the outrage by sufficient amounts to make the enterprise profitable?
When reading the pundits on the editorial page or watching them on the news channels, do you ever find yourself asking: "Who do these people think they are? What qualifies them to speak in front of all of us?"
I do. All the time. And it's not just the looming prospects of job-hunting behind that thought. About a month ago, The New Republic's Jonathan Chait wrote an article criticizing the liberal blog community for being insufficiently concerned with the truth, to which (of course) every leftist on the Internet responded by asking "so who was pro-Iraq War, again?" Lance Mannion also drew attention today to a few writers who continue to trade in the idea that bloggers and writers online are all just delusional losers in their basements, whose rantings are only marginally more coherent than the average sandwhich-board-wearing lunatic.
Who do these people think they are? the writers and pundits ask, not realizing that we've been asking the same thing in return.
But here's the dirty secret for pundits and journalists and movie critics who stand aghast at those angry bloggers: their job is not special. And they know it.
I'm not a blog triumphalist. I don't think wikis will save the world. But the simple fact of the matter is that there's no particular training to become a journalist, or a pundit, or a movie critic. There's no reason to believe that these jobs can't be done as well as anyone--and indeed, once upon a time, they were. There's a reason that movies like His Girl Friday depict journalists as a bunch of slovenly, low-class opportunists: they used to be a bunch of slovenly, low-class opportunists.
Nowadays, if you can find a journalist amid all the cutbacks at major newspapers and media outlets, there seems to be this idea that journalism has become a higher profession. The attitude betrayed by Chait and others is that these writers are better than the public somehow--better informed, better read, and probably better-looking. They're more public than the public, if you believe the hype behind David Broder. It's even infected relatively niche journalism, which is the only way that you could possibly find people like Gregg Easterbrook masquerading as "science writers."
There's nothing wrong with being an unspecialized journalist (Chait's employment history, for example, is a collection of writing credits but no direct political experience). Plenty of people have done it before. Hunter S. Thompson, one of the great heroes to the profession, started working as a news writer because it let him supplement his army wages. The honest truth is that most journalism, for all its mystique and prestige, amounts to picking up the phone and calling people for information. Occasionally, it requires the reporter to get up and actually go somewhere. This is not brain surgery. And obviously, punditry is even less rigorous--got an opinion? You're good to go.
The implication of writers like Chait, or Brian Williams (who commented recently that he didn't like competing against some guy named Vinny in a bathrobe somewhere) is that they've got something we don't. And indeed, they do: you don't get to be a staff writer for TNR or an anchor for a major news network without a lot of connections and a lot of luck. But self-publishing means that now any dog on the Internet could potentially oustrip their audience, while a lot of us have started to think that those tightly-knit political connections are what's wrong with the media in the first place. And frankly, as news has been cut back in the profit-driven environment, I don't think very much of the argument that they have some kind of journalistic integrity that no-one else can claim.
It astonishes me to read pieces by media professionals that trumpet their ignorance of the blog network. They're missing out, and they're missing the point. Bloggers may just be parasites on the journalists who go out and gather the days events--but there are an awful lot of people who get paid to do the same thing, except in print. An editorial page is just a blog without the links (or in some cases, the readership). In many ways, it's a classic irony of economics--the jobs of the "knowledge workers" can now be outsourced, and they don't even have to leave the country--or get paid.
In other words: Who are these people? And why should we care?
* * *
In the best tradition of a post that quotes from Lance Mannion, a fine writer known for saving his recommended links for the end of his posts, I really do recommend his writings about credentialism and the media.
I know when I leave a job, the last thing I do before waltzing out the door is slander the most competent people I can find. No, wait--actually, that would be incredibly stupid. So why is it that Dan Okrent, the laughably ineffective former Public Editor for the New York Times, decided to do just that? In his last column for the paper, Okrent unleashed a torrent of bitter little jabs at some of the paper's op-ed columnists, namely Frank Rich, Maureen Dowd, and Paul Krugman. The first two, he simply portrayed as partisan bickering. Krugman, a world-reknowned and well-regarded economist, was accused of actually cooking the numbers for his columns.
Okrent is best known, previous to his days ruminating on the liberal bias of the NYT crossword puzzle, as the inventor of fantasy baseball. And he has the nerve to tell Krugman that his math might be suspect. Brad DeLong, another skilled and well-written economist, annotates the resulting back-and-forth in the Times Public Editor weblog, and does a professional's job slicing Okrent's few economic points here I am not qualified to comment on that matter, because my own economic background is accumulated and theoretical, doubtless filled with holes. What I can discuss, and with relish, is the incredible audacity of Okrent's original statement--specifically, when Okrent talks about his own role in (not) correcting the mistakes he saw after his daily martini lunch with whiskey chaser:
I'm sorry? For those who are perhaps unaware of how the news business works (a number which apparently includes Dan Okrent), allow me to elaborate. The role of the public editor, also known as the ombudsman, is to be the representative of the paper's readership to its staff. They are responsible for a level of fact-checking, of bringing issues to the management's attention, and explaining to the public why coverage is what it is. If there are gross mistakes and exaggerations in the text of a paper, it is the public editor's job to correct them. He or she is like a high-level copy editor, in a sense.
But that's not how Okrent sees it. To paraphrase that quote, columnists can lie, cheat, and write whatever they want, and he feels no need to address the problem. He never talks to Krugman, Dowd, or Rich about the questions he has. He thinks it's bad for the paper, but Okrent does nothing about it--as if it's not his job!
Okrent's always been bad--he's published a reader's address just to settle a score, and his columns generally read as whining. He lets the conservative pundits, like David Brooks and William Safire, go almost completely unscathed, even though they lie like threadbare bathmats. But the admission by Okrent that he didn't think it was his job to be, in effect, a public editor is the most stunning thing yet. Honestly: what did he do all day? Did he spend his days playing fantasy sports instead of (oh, I don't know) working? And why didn't anyone in the Times upper management notice that they were employing this slacker? It makes me furious that I have to hunt for writing or editing work, but Okrent can sit back and relax, confident that the only thing he's edited in years was his own job description.
Where did they find this guy?
And what, exactly, were they paying him for?