Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Building a Simple Search Engine That Works (karboosx.net)
242 points by freediver 16 hours ago | hide | past | favorite | 69 comments




The idea behind search itself is very simple, and it's a fun problem domain that I encourage anyone to explore[1].

The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.

A DBMS-backed approach breaks down surprisingly fast. Probably perfectly fine if you're indexing your own website, but will likely choke on something the size of English wikipedia.

[1] The SeIRP e-book is a good (free) starting point https://ciir.cs.umass.edu/irbook/


> The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.

Large amounts of data seem obviously difficult.

For your second difficulty, "handling underspecified queries": it seems to me that's a subset of the problem of, "given a query, what are the most relevant results?" That problem seems very tricky, partially because there is no exact true answer.

marginalia search is great as a contrast to engines like google, in part because google chooses to display advertisements as the most relevant results.

Have you found any of the TREC papers helpful?

https://trec.nist.gov/


I think in today's world the harder problem is evading SEO spam. A search engine is in constant war with adverserarial players, who need you to see their content for revenue, rather than the actual answer.

This neccessitates a constant game of cat and mouse, where you adjust your quality metric so SEO shops can't figure it out and capitalise on it.


I feel at this point you'd almost be better off hand-curating a set of domains and only crawl those.

not sure if this was intentional, but everything old is new again; back to OH yahoo? or Craig's list?

Not quite, in that you can curate domains but crawl all the urls on those domains.

I think SEO plam + AI slop is likely to lead us back to human curation.


I wonder how hard it is when mice are not paying the cat to serve ads.

There are more kinds of search engines than just internet search engines. At this point I’m is almost certain that the non-internet search engines of the world are much larger than internet search engines.

Edit: And I’m getting downvoted for this. If it’s because I am tangential to the original comment then that’s fair. If it’s because you think I’m wrong, I have worked on the two largest internet search engines in the world and one non-internet search engine that dwarfed both in size (although different in complexity).


What is the order of magnitude of the largest document store that you can practically work from SQLite on a single thousand-dollar server run by some text-heavy business process? For text search, roughly how big of a corpus can we practically search if we're occupying... let's say five seconds per query, twelve queries per minute?

If you held a gun to my head and forced me to make a guess I'd say you could push that approach to order of 100K, maybe 1M documents.

If sqlite had a generic "strictly ascending sequence of integers" type[1] and would optimize around that, you could probably push it farther in terms of implementing efficient inverted indexes.

[1] primary key tables aren't really useful here.



"Building A Complex Search Engine That Works Sometimes"

15% of the time it works every time.

> The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.

I would expect the difficulty to be deciding which item to return when there are multiple that contain the search term. Is wikipedia's article on Gilligan's Island better than some guy's blog post? Or is that guy a fanatic who has spent his entire life pondering whether Wrongway Feldman was malicious or how Irving met Bingo Bango and Bongo?

Add in rank hacking, keyword stuffing, etc. and it seems like a very hard problem, while scaling... is scaling? ¯\_(ツ)_/¯


That would be the "handling underspecified queries" thing I mentioned.

Elastic and many others fail to solve this problem too. There are many different strategies and many of them require ingenuity and development.

It’s not like ElasticSearch lacks ranking algorithms and control thereof. But it can require tuning and adjustment for various domains. Relevancy is, after all, subjective.

Thank you very much for the recommendation. I am in the process of building knowledge base bots, and am confronted with the task of creating various crawlers for the different sources the company has. And this book comes in very handy.

Searching in general is difficult. It is really a difficult thing.

If you haven't felt it, look at companies like Apple, Microsoft, or "The most important AI research lab in the world" OpenAI, for example, their products have terrible search features even though their resources - money - technology can be considered top-notch.


I think the reason most companies can't implement a working search box is the sort of work needed to make it perform adequately clashes catastrophically with the software development culture that has emerged in the corporate world (anything to do with sprints, jira, and daily standups).

Getting search to work well requires a lot of fiddling with ranking parameters, work that is difficult bordering on impossible to plan or track. The work requires a degree of trust that developers are rarely afforded these days.


idk if that argument really makes sense. A lot of AI chatbot companies have terrible or broken webapps and backend servers because it's not what they really care about. They put billions into their AI models, not their search features. I think the shittiness of their search features is symptomatic of the company's incentives, not necessarily the difficulty of the problem.

Long time ago, I've really enjoyed a course by David Evans from Virginia University about building a search engine and concepts of computer science.

Building a "classic" search engine is a very fun project to go through.

https://www.cs.virginia.edu/~evans/courses/cs101/

List: https://www.youtube.com/watch?v=9nkR2LLPiYo&list=PLAwxTw4SYa...

Dave's profile: https://www.cs.virginia.edu/~evans/


About a decade ago, I was working with a guy who was getting a PhD in search engine design, which I knew/know nothing about.

It was actually a lot of fun to chat with him, because he was so enthusiastic about how searching works and how it can integrate with databases, and he was eager to explain this all to anyone who would listen. I learned a fair amount from him, though admittedly I still don't know much about the intricacies of how search engines work.

Some day, I am going to really go through the guts of Apache Solr and Lucene to understand the internals (like I did for Kafka a few years ago), and maybe I'll finally be competent with it.


People who work on really obscure things love to talk about their work, heck if someone would listen to me I could talk for hours about what I do.

Unfortunately very few people care about the minutia of making a behemoth system work.


As I have gotten older, I have grown immense respect for older people who can geek out over stuff.

It’s so easy to be cynical and not care about anything, I am certainly guilty of that. Older people who have found things that they can truly geek out about for hours are relatively rare and some of my favorite people as a result (and part of the reason that I like going to conferences).

I like my coworkers and they’re certainly not anti-intellectual or anything, but there’s only so long I can ramble on about TLA+ or Isabelle or Alloy before they lose interest. It’s not a fault on them at all, there are plenty of topics I am not interested in.


It seems a common problem in our profession that you can’t really talk to anybody about what you are doing. My friends have a vague idea but that’s it.

I would be more than interested to listen to you and what you do. Do not hesitate to share (blog post, AskHN, ShowHN, ...)

I would. Heck, I bet half of HN would be interested in what kind of insanity lies under those behemoths.

I work with music streaming, it is mostly just a lot of really banal business rules that become an entangled web of convoluted if statements. Where to show a single button might mean hitting 5 different microservices and checking 10 different booleans

It's an interesting exercise. Having built searches before easily-available OSS products were available, and when even the commercial offerings sucked, do not ever build your a) database b) search engine, unless you can clearly state the reason for doing so.

Entire cubicle farms of people have been devoted to this problem for years, and if you dare to do this for work because "I think I can", you will find yourself in an ocean of hurt.

"Hey, so it won't be so hard to add 'did you mean' functionality, right? And we were thinking of adding a taxonomy next year for easy navigation..."

Check. Mate.


My pet peeve for search engines for content I use is that they regularly ignore 2-letter and 3-letter "words" or acronyms. If all I need is a search for "mp3" then stripping exactly that is not useful ;) (was just the first file extension that came to my mind, but "PHP" works just as well).

Reminds me of reading Programming Collective Intelligence by Toby Segaran, which inspired me with a range of things, like building search, recommenders, classifiers etc.

I loved that book also, but saw him a few year later saying in some youtube video "don't use that book" because it is obsolete in his opinion.

That was a great book, I wonder what the 2025 equivalent of it is...

Prompting Inferred Intelligence: from ChatGPT to Claude

I wonder how well it would scale. Elasticsearch's performance is impressive even at an unrecommended scale.

Great read. It makes you wonder how heavily optimised the tokenizers used by popular search enginea truly are.

why isn't there a place to post something where someone else will find it when searching that doesn't require auth? i get the logistics of what i'm asking, but i really think we need a global index.

Incredible article. Does what it claims in the title, is written well and follows a linear chain of reasoning with a minumum of surprises.

Building a simple text search engine isn't that hard. People show them off on HN on a fairly regular basis. Most of those are fairly primitive. Unfortunately building a good search engine isn't that straightforward. There's more to it than just implementing bm25 (the goto ranking algorithm), which you can vibe code in a few minutes these days. The reason this is easy is because this is nineties era research that is all well publicized and documented and not all that hard once you figure it out.

Building your own search engine is a nice exercise for understanding how search works. It gets you to the same level as a very long tail of "Elasticsearch alternatives" that really aren't coming even close to implementing a tiny percentage of its feature set. That can be useful as long as you are aware of what you are missing out on.

I've been consulting companies for a few years with going from in house coded solutions to something proper (typically Opensearch/Elasticsearch). Usually people fight themselves into a corner where their in house solution starts simple and then grows more complicated as they inevitably deal with ranking problems their users encounter. Usual symptoms: "it's slow" (they are doing silly shit with multiple queries against postgres or whatever), "it's returning the wrong things" (it turns out that trigrams aren't a one size fits all solution and returns false positives), etc. Add aggregations and other things to the mix and you basically have a perfect use case for Elasticsearch about 10 years ago before they started making it faster, smarter, and better.

The usual arguments against Elasticsearch & Opensearch:

"Elasticsearch/Opensearch are hard to run". Reality, there isn't a whole lot to configure these days. Yes you might want to take care of monitoring, backups, and a few other things. As you would with any server product. But it self configures mostly. Particularly, you shouldn't have to fiddle with heap settings, garbage collection, etc. The out of the box defaults work fine. Get a managed setup if all this scares you; those run with the same defaults typically. Honestly, running postgres is harder. There's way more to configure for that. Especially for high availability setups. The hardest part is sizing your vms correctly and making sure you don't blow through your limits by indexing too much data. Most of your optimizations are going to be at the index mapping level, not in the configuration.

"It's slow". That depends what you do and how you use it. Most of the simple alternatives have some hard limitations. If you under engineer your search (poor ranking, lots of false positives) it's probably going to be faster. That's what happens if you skip all the fancy algorithmic stuff that could make your search better. I've seen all the rookie mistakes that people make with Elasticsearch that impact performance. They are usually fairly easy to fix. (e.g. let's turn off dynamic mapping and not index all those text fields you never query on that fill up your disk and memory and bloat your indexing performance ...).

"I don't need all that fancy stuff". Yes you do. You just don't know it yet because you haven't figured out what's actually needed. Look, if your search isn't great and it doesn't matter, it's all fine. But if search quality matters and you lose user's interest when they fail to find stuff in your app/website it quickly can become an existential problem. Especially if you have competitors that do much better. That fancy stuff is what you would need to build to solve that.

Unless you employ some hard core search ranking experts, your internally crafted thing is probably not going to be great. If you can afford to run at ~2005 era state of the art (Lucene existed, SOLR & Elasticsearch did not, Lucene was fairly limited in scope), then go for it. But it's going to be quite limited when you need those extra features after all.

There are some nice search products out there other than Elasticsearch & Opensearch that I would consider fit for purpose; especially if you want to do vector search. And in fairness, using a search engine properly still requires a bit of skill. But that isn't any different if you do things yourself. Except it involves a lot less wheel reinvention.

There just is a bit of necessary complexity to building a good search product.


Seems like good advice, search has been built quite a few times now :-) I've defaulted to elasticsearch myself.

However, have you tried running any of the "up and coming" alternatives that keep showing up here? In particular, https://github.com/SeekStorm/SeekStorm seems very interesting, though I haven't heard from anyone using it in prod.


A red flag for me is that it lists stopword lists as a feature. Those went out of fashion in Lucene/Elasticsearch because of some non trivial but very effective caching and other optimizations around version 5.

Stopwords are an old school optimization to deal with the problem of high frequency tokens when calculating rankings. Basically that means dealing with long lists of document ids that e.g. contain the word "to". This potentially has a lot of overhead. The solution is to eliminate the need to do so by focusing on ranking clauses with low frequency terms first and caching the results for the low frequency terms. You can eliminate a lot of documents by doing that. This gets rid of most of the overhead for low frequency terms without resorting to simply filtering them out.

The key test here is queries that consist of stop words, like "to be or not to be" to find documents about Hamlet. If you filter out all the stop words, you are not going to get the right results on top.

Just an example of where Seekstorm can probably do better. I have no direct experience with it though. So, maybe they do have a solution for that.

But you should treat the need for stop word lists as a red flag for probably fairly immature takes on this problem space. Elasticsearch doesn't need those anymore. Also, what do stop word lists look like if you need multi lingual support? Who maintains these lists for different languages? Do you have language experts on your search team doing that for all the languages you need to support? People always forget about operational overhead like that. Stop word lists are fairly ineffective if you don't have people curating them and it creates obvious issues with certain queries.


The stopword list in SeekStorm is purely optional, per default it is empty.

The query "to be or not to be" that you mentioned, consisting solely of stopwords, returns complete results and perform quite well in the benchmark: https://github.com/SeekStorm/SeekStorm?tab=readme-ov-file#be...

Both Lucene and Elastic still offer stopword filters: https://lucene.apache.org/core/10_3_2/analysis/common/org/ap... https://www.elastic.co/docs/reference/text-analysis/analysis...


Thanks for correcting me and clarifying this.

> "I don't need all that fancy stuff". Yes you do.

> let's turn off dynamic mapping and not index all those text fields you never query on


what do you think about ManticoreSearch? It has been around longer than Lucene

I have no experience with ManticoreSearch but they've been around for a while. I think it migh be a Sphinx spinoff, this was a long abandoned solr like search engine written in C++ that they seem to have forked (correct me if I'm wrong). Mainly popular for some ecommerce use cases (as is the case with Solr). Looking at their front page I don't see any compelling reason to switch and a couple of things that I don't like

- GPLv3 better than AGPLv3 in Elasticsearch but less permissive than Apache 2.0 in Opensearch.

- They seem to emphasize being a drop in replacement a lot. Which raises the question: why not just stick with Opensearch.

- I'm very skeptical of benchmarks in this space. Mostly they are apples and oranges comparisons. As I argued earlier it mainly raises the question what they are not doing or skipping. Barring major algorithmic improvements which Lucene developers could just copy if it's valid, I don't see how they could be better/faster. And Lucene is of course well known to be heavily optimized and still squeezing out a lot of performance from release to release. Progress has been pretty substantial in v8 and v9 in recent years.

Other than that they seem to know what they are doing is the best I can say about it.


Yeah, it forked from sphinx when sphinx died or rugpulled or something like that. Seems to be very actively maintained.

I think one of the main draws is that it is a single binary rather than the complexity of ES/OS, JVM etc... And if you have a mysql/mariadb database, it just connects and automatically ingests extremely quickly. They also use Galera for replication, but I also think its not as explicitly shareded, which simplifies things.

Yeah, their benchmarks are astounding, so much so that it is hard to believe. yet, I have seen them be quite open to feedback, collaboration etc so

Anyway, thanks for your thoughts and insights!


Good. Now please someone replace Google's search engine.

I am always annoyed using it, how bad it is these days. Then I try the alternatives such as Duck Duck Go and they manage to be even worse.

Qwant is semi-ok but it also omits tons of things that Google Search finds (and also is slower, for some weird reason).

Google's UI nerf is also annoying - so much useless stuff. In the past I could disable that via ublock origin but Google killed that for chrome.

We need to do something against this Evil that Google brought into this world.


Not quite independent as it’s a meta-search, but I developed a subscription based one at search.waterfox.net. Pays for the infrastructure costs and remains ad/tracking free.

Nice! I couldn't see the list of search engines that are included in your meta-search, the FAQ currently seems to imply that it only serves Google results?

If you give users the option to include / not include certain search engines in their results, so their money never goes to those particular engine companies, that could be of interest to some Kagi refugees.

I ended up vibe coding my own meta-search engine (augmented with a local SQlite database of hand-picked sites) so that I could escape Kagi, but I'm excited if Waterfox Search is an alternative I can recommend to others!


Currently only Google, but Brave and Mojeek are going to be made available as well very soon

Try kagi.com. I tried and stayed. It’s paid though.

I also used Kagi, but decided to cancel my subscription last year when it was revealed they pay Yandex for their search, which is a Russian company that ultimately fuels the Russian war on Ukraine.

Once Kagi stops transferring money to Russia, I’d be happy re-subscribe.


I have the feeling that, if you look a little closer, a lot of products you are using are supporting atrocities somewhere directly or indirectly.

Do you have a source how funding yandex funds the war? Yandex is a great search engine, so I would hate to find out that this is true

  https://en.wikipedia.org/wiki/Yandex#Legal_issues_in_Ukraine
  https://www.zois-berlin.de/en/publications/zois-spotlight/the-sad-fate-of-yandex-from-independent-tech-startup-to-kremlin-propaganda-tool

It's based in Russia so it presumably pays taxes and salaries in Russia.

All American companies pay taxes to America which is basically always commiting atrocities so I don't think that's a strogn enough reason on its own.

- How many people use only Google search engine nowadays? More and more people use chatbots, with Google search.

- Google search also does not provide good results for finding stuff in all walled gardens, so we also use niche search engines for individual platforms. I am not sure if it finds good results for posts in facebook and x.com

- I also use my own index of pages, YouTube channels, and github pages. Contains tags, page scoring system, related links, social information like number of followers etc.

https://github.com/rumca-js/Internet-Places-Database

So in a way, it has being replaced. It just takes some time for people to switch.


I completely agree with the insight that full text search has been complexified. People seem to want to jump straight to clustering or other enterprise level things.

I also appreciate the moxie of getting in there and building it yourself.

Myself, I reach for Lucene. Then you don’t need to build all this yourself if you don’t want. It lives in a dir on disk. True, it’s a separate database, but one optimized for this problem.


This was the solution I was thinking about, but I thought, well that's the way someone would have done it 20 years ago

Alright but why do we not have more search engines that are actually good?

I'd love to cut myself off from Google, including Google Search, but any alternatives manage to be even worse. Consistently so. It's as if Google won the war by being just permanently slightly better - while everyone is actually really crap. That wasn't the case, say, 10 years ago or so.


Because it's not a simple problem space. Lucene has gone through about three decades of lots of optimization, feature development, and performance tuning. A lot of brain power goes into that.

Google bootstrapped the AI revolution as a side effect of figuring out how to do search better. They started by hiring a lot of expert researchers that then got busy iterating on interesting search engine adjacent problems (like figuring out synonyms, translations, etc.). In the process they got into running neural networks at scale, figuring out how to leverage GPUs and eventually building their own TPUs.

The Acquire Podcast recently did a great job of outlining the history of Google & Alphabet.

Doing search properly at scale mainly requires a lot of infrastructure. And that's Google's real moat. They get to pay for all that with an advertising money printing machine. Which BTW. leverages a lot of search algorithms. Matching advertisements to content is a search problem. Google just got really good at that. That's what finances all the innovation in this space from deep learning to TPUs. Being able to throw a few hundred million at running some experiments is what makes the difference here.


I use the '4get' proxy search engine, which lets you use pretty much every search engine under the sun, for both websites and images. It's really useful because it is faster than google, and if you need to find some pages you can just change the search engine quickly.

It is open source and there are many instances available, I use '4get.bloat.cat' or '4get.lunar.iu'

It is a better alternative to SearX for sure


I checked the about page on 4get.bloat.cat, and within the first paragraph of the "what is this" section, it used the phrase "globohomo bullshit". I dont think these are people I want to support.

Not all search is web-wide search. The best-known example of this is probably Amazon's search bar. No one really wants to search Amazon via Google. They have staffers contributing heavily to Lucene.

But also there are all kinds of other applications. Let's say you run a reviews site; you can build a bespoke power search form allowing people to sort on things like price, date of review, set a minimum star threshhold, etc. You can also weigh product names or review titles more heavily in the index scoring (a review /of/ the Pixel 10 should rank higher than a review that mentions the Pixel 10 prominently).

Even being able to sort results of searching blog posts or other dated content by date is powerful - Google can only guess at the actual dates of those posts. You can search with required tags, or weigh tags more heavily in result scoring. You can put your finger on the scale and say, effectively, post A should always rank more highly than post B for term X.

Also, site operators know traffic/popularity, which internet search engine can only sort of guess at, and can use this to score/sort. Amazon clearly does this.

For some reason a lot of web devs seem to think search is this really hard problem. But once you learn the basics of how it works, and if you use a library like Lucene, it does not need to be hard at all. Mostly you just have to be strategic and consistent about where and when you index and deindex content, it's usually alongside your db persistence calls. Once it's running you optimize by sprinkling some minimum amount of magic on your scoring setup to make it worthwhile/differentiated from Google.


These days I default to DDG. Not because it's improved but because Google's results are just that bad. Even a couple years ago I was reaching for Google with a lot more frequency.

love the style, colors and the cookie popup from https://karboosx.net/. Anyone knows if its an open source framework/style/tool being used here or it just the web dev skills of the author that are superb?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: