Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Coral cache may be helpful: http://www.icij.org.nyud.net/offshore/secret-files-expose-of...

This is really huge. Offshore accounts are thought to total up to $32 trillion. The Guardian / ICIJ archive accounts for only a portion of this, though I suspect the BVI are attractive to many for their stability (association with Britain provides its advantages).

On the technical side, I'd be interested in what tools are being used to analyze the archive. A related Guardian article notes:

"Unlike the smaller cache of US cables and war logs passed in 2010 to WikiLeaks, the offshore data was not structured or clean, but an unsorted collation of internal memos and instructions, official documents, emails, large and small databases and spreadsheets, scanned passports and accounting ledgers.

"Analysing the immense quantity of information required "free text retrieval" software, which can work with huge volumes of unsorted data. Such high-end systems have been sold for more than a decade to intelligence agencies, law firms and commercial corporations. Journalism is just catching up."

http://www.guardian.co.uk/uk/2013/apr/04/offshore-secrets-da...

Anyone have any specifics on this "free text retrieval" software, its capabilities, and how it compares with, say, standard Linux/Unix text search and processing tools?



They provide good detail about the software they used.

"The major software tools used for the Offshore Project were NUIX of Sydney, Australia, and dtSearch of Bethesda, Md. NUIX Pty Ltd provided ICIJ with a limited number of licenses to use its fully featured high-end e-discovery software, free of charge. The listed cost for the NUIX software was higher than a non-profit organization like the ICIJ could afford, if the software had not been donated."

Securiing their communications proved more difficult than free-text search, however:

"The project team’s attempts to use encrypted e-mail systems such as PGP (“Pretty Good Privacy”) were abandoned because of complexity and unreliability that slowed down information sharing."


Ah. My read of the PGP comment was that the offshore-banking individuals / organizations had tried to use PGP but bailed. Seems that your interpretation it was the investigative team that tried but failed is the actual case.

Interesting. Sadly, far too common. I've been using PGP for well over a decade, know a small handful of people (outside of technical mailing lists) who have and can access their keys, and have actually been chewed out by some of these for sending encrypted mail.


Having seen how journalists will blow hundreds of dollars on proprietary but basic scraping and processing software, I wouldn't be surprised if the software was an adroit use of NLP with OCR, but nothing out-of-this-world fancy. Management of documents and data is important too, so perhaps this software provided a good front end for it?

edit: in response to the downvotes, I'm not saying journalists are dumb, but that the data/technical problems they face are myriad and they often don't get the grounding they need to tackle them, so they, in my opinion, are too quick to pay for commercial software that only solves part of the problem they need to fix. Not just "opinion" here, but actual experience with colleagues and consulting on projects with outside groups. That said, the software they allude to sounds like a system that can mass-process scanned documents, OCR them, and I would assume, use NLP to reduce the amount of work needed to hand clean. But I'm betting that there was still an incredible amount of hand-cleaning that had to be done.


Re "free text retrieval": This reminds me of a discussion on the POPCNT or sideways add instruction of early processors: http://cryptome.org/jya/sadd.htm

POPCNT has recently landed as part of SSE4.2 in Intel processors.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: