Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The amount of available documents has skyrocketed in recent past, especially for present-day history, and they're not always easily usable. For instance, if your interested in the Stalin administration, there are millions of orders, notes, studies and transmissions stored in boxes somewhere. If you were a historian in that time period, studying new documents, a lifetime would only let you see a very tiny fraction of existing sources.

Remember these movies where a small-firm lawyer is hammered with tons of document boxes in a discovery process against a big corporation? Well, historians are like that, but they have less money and they don't know how many boxes there are. Also they have to look for the boxes themselves rather than them being delivered at their office.

In older, well-studied fields there are few boxes, they are already referenced, and historians have a chance to see everything over their career. In more recent, less studied fields, there are countless unopened boxes.



I'm not a big proponent of LLM proliferation, but I was thinking that mass review of tons of scanned documents might be exactly the sort of thing they're really useful for. Given an AI that hasn't been ruthlessly tuned to be as politically neutral as possible, you could have a huge database and query it in plain English like "were there any documents that made overt reference to extremely corrupt behavior?"


People with the knowhow to do this kind of stuff are mostly busy trading eyeballs or stock, and college history departments are not exactly rolling in it.

Still, there is an effort made to make these collections more easily avaialble. For instance, in the case of soviet archives, [1] describes the work done and the conditions to access. That work is far from exhaustive though, and a large part of the stuff still needs to be done the slow way, or require special requests in order to be accessed.

[1]: https://www.ucl.ac.uk/ceelbas/state-archive-russian-federati...


To answer a query, your LLM needs to "read" the documents first. The context window will not be big enough for this, so you have to fine tune the model.

Problem is, you need to cross-check with the reference material in case it's subject to hallucinations.


Oh, I was thinking that the cross-checking is the point. You'd use the LLM as a "hazily thinking search function" to narrow your examination of old documents, not as a replacement for reading the documents.

I don't know what to do about the context window, though.


I don't understand, can't you feed it one page at a time and ask it "is there relevant information here?"


Or load it all into a RAG system. Give it a few months and it'll be something you can buy off the shelf.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: