Coral cache may be helpful: http://www.icij.org.nyud.net/offshore/secret-files-e...

kleinmatic · on April 4, 2013

They provide good detail about the software they used.

"The major software tools used for the Offshore Project were NUIX of Sydney, Australia, and dtSearch of Bethesda, Md. NUIX Pty Ltd provided ICIJ with a limited number of licenses to use its fully featured high-end e-discovery software, free of charge. The listed cost for the NUIX software was higher than a non-profit organization like the ICIJ could afford, if the software had not been donated."

Securiing their communications proved more difficult than free-text search, however:

"The project team’s attempts to use encrypted e-mail systems such as PGP (“Pretty Good Privacy”) were abandoned because of complexity and unreliability that slowed down information sharing."

dredmorbius · on April 5, 2013

Ah. My read of the PGP comment was that the offshore-banking individuals / organizations had tried to use PGP but bailed. Seems that your interpretation it was the investigative team that tried but failed is the actual case.

Interesting. Sadly, far too common. I've been using PGP for well over a decade, know a small handful of people (outside of technical mailing lists) who have and can access their keys, and have actually been chewed out by some of these for sending encrypted mail.

danso · on April 4, 2013

Having seen how journalists will blow hundreds of dollars on proprietary but basic scraping and processing software, I wouldn't be surprised if the software was an adroit use of NLP with OCR, but nothing out-of-this-world fancy. Management of documents and data is important too, so perhaps this software provided a good front end for it?

edit: in response to the downvotes, I'm not saying journalists are dumb, but that the data/technical problems they face are myriad and they often don't get the grounding they need to tackle them, so they, in my opinion, are too quick to pay for commercial software that only solves part of the problem they need to fix. Not just "opinion" here, but actual experience with colleagues and consulting on projects with outside groups. That said, the software they allude to sounds like a system that can mass-process scanned documents, OCR them, and I would assume, use NLP to reduce the amount of work needed to hand clean. But I'm betting that there was still an incredible amount of hand-cleaning that had to be done.

revelation · on April 4, 2013

Re "free text retrieval": This reminds me of a discussion on the POPCNT or sideways add instruction of early processors: http://cryptome.org/jya/sadd.htm

POPCNT has recently landed as part of SSE4.2 in Intel processors.