The dataset was so large they were probably not opening up the files before sending them over. It was clearly a rookie mistake by someone who had never had to do this stuff before and wouldn't have known what to look for.
Again, the vast majority of the emails were automated alert/spam emails. So it's unclear if a random sample of the data would have turned up anything interesting to look at if you didn't know what you were looking for.
I don't agree, I think opening any of the files would have made it immediately apparent that the first 256 chars of each email were included, if that's indeed what happened as the article said.
>Some things about the dataset: It’s very messy – triple quotes, semicolons, commas, oh my. There are a millions of systems alerts. For seattle.gov → seattle.gov communication, there are two distinct metadata records
Opening a file of that size in Excel would probably crash the desktop. And this was probably a lowly admin without a lot of other tools.
The first 256 char of a lot of system emails are going to just have junk html and header tags. If you don't understand that you are looking at HTML, it's not going to be apparent that you are looking at the body of an email.
It's a rookie mistake to be sure, but the admin was clearly unfamiliar with what was being requested.
Would it really have been a XLS file? I would expect probably CSV? In that case, just running 'less' on the file (or the Microsoft equivalent) would be fine, and wouldn't tax anyone's desktop resources.
That's not a "rookie mistake"; that's gross negligence. Whenever I write some sort of script to produce some sort of data file, I always look to ensure that the data file ends up looking like what I expect.
This isn't even to ensure my data file doesn't include something private, just to ensure that it actually includes what I intended it to, and I didn't do something dumb like put the data in the same field twice, or duplicate the same record over and over, or whatever.
We are not talking about one of the geniuses here at HN. The guy answering FOIA requests for the city is the IT equivalent of the counter person at a McDonalds. I don't think it's fair to flame them for being an idiot when it's clearly a process issue.
That's not how access to email content should work at any organization, and I can personally tell you that responsible government organizations don't give it to entry level employees.
Discovery and FOIA-equivalent requests that I've seen at the SLTT level were handled with the care that is expected for potentially sensitive communications. I'm sure smaller orgs can't do it as well, but Seattle is probably going to have some money for this stuff.
I suppose this is why some gov officials use weird aliases for emails. It’s not on the up and up but it avoids disclosing potentially embarrassing or illegal activities…
- He FOIA'd all metadata of emails to and from the City of Seattle.
- The city IT department pushed back, saying that their policy was to hand-review each email for privacy, and this was 32m emails.
- They later acquiesced and just dumped all of the meta-data into files and sent it over
- They didn't realize email-preview was also meta-data, which included the first 256 char of each email.
- OP informed them that had now committed a very grievous data leak.
- The city fixed the issue and legally pursued OP to ensure the data was deleted.
All say, I am glad I am not a civil servant. The job seems awful.