Whatever makes sense to your users for their search results. Do they want to get...

inertiatic · on Feb 24, 2020

The text you display back to your user doesn't/shouldn't have to depend on what you index in your information retrieval system.

At my last job, we used to index large documents on Solr. The largest chunks of them were indexed in non-stored fields. Which means they were searchable but you couldn't retrieve the actual text. This drastically cut down on the index size and the resources we needed to support it.

Then after scoring you'd return the top hits to the main application as IDs, which could retrieve everything needed (full text included) along with all the other information they'd retrieve for the document and generate the actual output.

shreyshrey · on Feb 24, 2020

Just relevant parts. ES says their max size is 100 mb. We have a real life scenario where we want to index millions of office documents to find PII/PHI

What is the realistic expectation here. Should we say 50 mb. How everybody else do?

itronitron · on Feb 24, 2020

Not sure about ES, but Solr removed it's max field limit in release 4.0. Text documents tend to be a lot smaller than people expect, both in terms of word count and file size. I think you will be fine with 50 mb if you are using ES.