Hacker News new | past | comments | ask | show | jobs | submit login

Whatever makes sense to your users for their search results. Do they want to get back the whole document or just the relevant parts?

If there are separate sections in the office documents that you can pull out and index as separate fields then you should do that. For example, if you were indexing patents, you would want to index abstracts and claims into separate fields.




The text you display back to your user doesn't/shouldn't have to depend on what you index in your information retrieval system.

At my last job, we used to index large documents on Solr. The largest chunks of them were indexed in non-stored fields. Which means they were searchable but you couldn't retrieve the actual text. This drastically cut down on the index size and the resources we needed to support it.

Then after scoring you'd return the top hits to the main application as IDs, which could retrieve everything needed (full text included) along with all the other information they'd retrieve for the document and generate the actual output.


Just relevant parts. ES says their max size is 100 mb. We have a real life scenario where we want to index millions of office documents to find PII/PHI

What is the realistic expectation here. Should we say 50 mb. How everybody else do?


Not sure about ES, but Solr removed it's max field limit in release 4.0. Text documents tend to be a lot smaller than people expect, both in terms of word count and file size. I think you will be fine with 50 mb if you are using ES.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: