Hacker News new | past | comments | ask | show | jobs | submit login

Can do. A lot of the magic happens in the es/indexer and search lambdas here: https://github.com/quiltdata/quilt/tree/master/lambdas.

The short of what we do: we listen for bucket notifications in Lambda, open the object metadata and send it, along with a snippet of the file contents, to ElasticSearch for indexing. ElasticSearch mappings are a bit of a bear and we had to lock those down to get them to behave well.

What are the big barriers you're bumping into on the data management and portability side of things?




Seems like it'd be more elegant (and probably cost effective) if you stored the Lucene indexes inside the buckets themselves.


That is an interesting idea. What kind of performance could we expect, especially in the federated case of searching multiple buckets? Elastic has sub-second latency (at the cost of running dedicated containers).


That's a bit of an open question right now, unfortunately. Using S3 to store Lucene indexes is a roll-your-own thing since last I checked, and the implementation I wrote currently deals with smaller indexes where files can be pulled to fully disk as needed. S3 does support range requests, which I'd think would mimic random access well enough.

Assuming whatever ElasticSearch implementation you're using is backed by SSDs there'd likely be more latency with S3, but I'd expect it to scale pretty well. Internally, a Lucene index is an array of immutable self-contained segment files that store all indices for particular documents. Searching in multiple indices is pretty much just searching through all their segments- which can be as parallel as you want it to be.

To be honest, I'm actually surprised the Elasticsearch company doesn't offer this as an option. Maybe because they sell hardware at markup?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: