The Big Contract
Original entry posted: Wed Oct 12 14:34:41 2011
@ Fri Oct 14 19:46:37 2011 EST
"We did a second pass to clean up the data a little (correcting misspelled or inconsistent company names, for example)."
I'm curious how you did this. Hundreds of intern-hours? Or did you programmatically compare lots of fields to see ones that were near-but-not-identical? How would you go about doing that?
@ Mon Oct 17 21:40:05 2011 EST
we had hundreds of intern-hours. I don't even have
intern to abuse anymore!
Anyway, yeah, that's an incredibly hard problem to solve, obviously. First, there are a lot of places in the database where you just can't be sure if two companies with similar (but possibly common) names are actually the same company or two different companies. You can't rely on the addresses. And then you run into problems like casing and style, which don't match CQ style for company names. You can fix that once or twice, but it gets old if you regenerate the data repeatedly.
In the end, we didn't try to correct everyone. But we did run into cases where, for example, the big contractors might have several names used: Lockheed, Lockheed Martin, Lockheed Martin Corp., that kind of thing. In a couple of states, the defense contractors take in so much money, even their misspellings or inconsistencies would show up twice in the top five. So for those we did go through, searching the database for anything containing a common pattern and writing a series of update queries to normalize them based on what we found.
The end result is that if an extra couple of million went to Mel's Federal Contracting and Discount Hardware Shack under duplicate names, our database didn't address that. It wasn't really relevant to our interactive, though. And maybe that's a great reminder of the scale at which these contracts operate: that kind of money is a total drop in the bucket.