Hacker Newsnew | past | comments | ask | show | jobs | submit | dtunkelang's commentslogin

In general, the two ways to compute counts are top-down, by making a separate query for each filter, or bottom-up, by scanning the results and aggregating the counts, like a group-by. Top-down is good for a small universe of values, but bottom-up tends to be the scalable approach. And, as has been pointed out, you can produce approximations by aggregating a sample of the results -- as long as it is a representative random sample. Just be mindful of statistics, particularly confidence intervals.

A related issue is that counts tend to treat all results as equal. If you retrieva a lot of results but most of them are not relevant -- as can happen with full-text search -- then the counts can be misleading. You may have the converse problem if your retrieval excludes a lot of relevant results. So, if you are implementing a faceted search application where you use and show counts, you should keep in mind that it will only work if your retrieval does a reasonable job of balancing precision and recall.

Finally, remember that supply != demand. The distribution of a facet in your index may be different from the distribution of that facet in searcher intent. A bit more on that here: https://dtunkelang.medium.com/search-intent-not-inventory-28...


BTW, I highly recommend the Relevance Slack as a good space to discuss this and search-related topics generally. It's run by OpenSource Connections, but it's free and anyone can join, even folks who compete with them. It's the best community I have found for search developers.

https://opensourceconnections.com/blog/2021/07/06/building-t...


While the acquisition was good for me financially, I agree that Oracle didn't really invest in sustaining Endeca as a product. I am proud of what we achieved at Endeca, but search has come a long way since then.

Coincidentally, I now consult for Algolia. :-)


Cool, I thought I was the only one who has uttered Endeca and Algolia in the same sentence.

I wrote up some tips/impressions on configuring Algolia for ecomm:

https://www.avikaminetzky.dev/posts/algolia-ecommerce-nextjs


I've found that, for mobile screens, the most valuable refinement real estate is often the top row of the results. Since that space is very limited, there's only room for the most useful facets or filters. And you have to decide whether to show keys or values, e.g., "Brand" as a key or "Nike", "Adidas", etc. as values. Showing keys takes up less space and allows you to cover more ground, but showing values may be more useful -- and certainly more discoverable -- to the user, since there's one less step. As with all things, it's a tradeoff, and I don't think there's been that much research on optimizing it.


Thanks for the kind words! I'm not usually on HN, but I just discovered this thread and am happy to contribute if I have anything useful to add. The book is a bit dated, but I have continued to post about search more broadly on Medium, particularly on the topic of query understanding.


As others have pointed out, most search engines don't support natural language search in general, let alone natural language negation in particular.

There are several reasons for this, including the following:

1) Natural language understanding for search has gotten a lot better, but it is still not as robust as keyword matching. The upside of delighting some users with natural language understanding doesn't yet justify the downside of making the experience worse for everyone else.

2) Most users today don't use natural language search queries. That is surely a chicken-and-egg problem: perhaps users would love to use natural language search if it worked as well or better than keyword search. But that's where we are today. So, until there's a breakthrough, most search engine developers see more incremental gain from optimizing some form of keyword search than from trying to support natural language search.

3) Even if the search engine understands the search query perfectly, it still has to match that interpretation against the documentation representation. In general, it's a lot easier to understand a query like "shirt with stripes" than to reliably know which of the shirts in the catalog do or don't have stripes. No one has perfectly clean, complete, or consistent data. We need not just query understanding, but item understanding too.

4) Negation is especially hard. A search index tends to focus on including accurate content rather than exhaustive content. That makes it impossible to distinguish negation from not knowing. It's the classic problem of absence of evidence is not being evidence of absence. This is also a problem for keyword and boolean search -- negating a word generally won't negate synonyms or other variations of that word.

5) The people maintaining search indexes and searchers co-evolve to address -- or at least work around -- many of these issues. For example, most shoppers don't search for a "dress without sleeves"; they search for a "sleeveless dress". Everyone is motivated to drive towards a shared vocabulary, and that at least addresses the common cases.

None of this is to say that we shouldn't be striving to improve the way people and search engines communicate. But I'm not convinced that an example like this one sheds much light on the problem.

If you're curious to learn more about query understanding, I suggest you check out https://queryunderstanding.com/introduction-c98740502103


Yes, this is real. And she could really use the help. It was very brave of her to come forward this way: she took personal risk, and it's not like she's gaining anything from this personally. On the contrary, she's facing threats of legal action, as well as the risk that future employers might avoid her because of association with this dumpster fire.


> "The idea was started because Daniel Tunkelang of Netflix was very anti RSS."

Um, what? I've never worked at Netflix, and I've never been anti-RSS. And I'm pretty sure there isn't another Daniel Tunkelang. :-)


Am curious to see having the dictionary in a trie allows you to build an implementation that is O(n). Will add it to the blog post if it works.


Peter actually emailed me directly, though I was already very familiar with his work on this and similar problems. But yes, I make it very clear in the post (and to candidates) that they should not assume an English -- or even a natural language -- dictionary.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: