The amount of available documents has skyrocketed in recent past, especially for...

quacked · on Oct 26, 2023

I'm not a big proponent of LLM proliferation, but I was thinking that mass review of tons of scanned documents might be exactly the sort of thing they're really useful for. Given an AI that hasn't been ruthlessly tuned to be as politically neutral as possible, you could have a huge database and query it in plain English like "were there any documents that made overt reference to extremely corrupt behavior?"

pyrale · on Oct 26, 2023

People with the knowhow to do this kind of stuff are mostly busy trading eyeballs or stock, and college history departments are not exactly rolling in it.

Still, there is an effort made to make these collections more easily avaialble. For instance, in the case of soviet archives, [1] describes the work done and the conditions to access. That work is far from exhaustive though, and a large part of the stuff still needs to be done the slow way, or require special requests in order to be accessed.

[1]: https://www.ucl.ac.uk/ceelbas/state-archive-russian-federati...

Fiahil · on Oct 26, 2023

To answer a query, your LLM needs to "read" the documents first. The context window will not be big enough for this, so you have to fine tune the model.

Problem is, you need to cross-check with the reference material in case it's subject to hallucinations.

quacked · on Oct 26, 2023

Oh, I was thinking that the cross-checking is the point. You'd use the LLM as a "hazily thinking search function" to narrow your examination of old documents, not as a replacement for reading the documents.

I don't know what to do about the context window, though.

eternauta3k · on Oct 26, 2023

I don't understand, can't you feed it one page at a time and ask it "is there relevant information here?"

regularfry · on Oct 27, 2023

Or load it all into a RAG system. Give it a few months and it'll be something you can buy off the shelf.