Does anyone here have experience trying to build a search engine, even a "boutique" one, for the web? Is it something an individual could conceivably operate on their own?
> Does anyone here have experience trying to build a search engine, even a "boutique" one, for the web? Is it something an individual could conceivably operate on their own?
Yes, I've built a curated boutique search engine, such as that described in the article. Short summary is that yes, it is possible for one person to build a useful search engine nowadays, if they don't attempt to index the whole of the internet (the niche mine covers is personal and independent websites). I've plenty of details about building it in the blog at https://blog.searchmysite.net/ , but key points you might find useful:
- It is now indexing around 6.5M pages, which is around quarter of what Google indexed when it was launched in 1998.
- Estimated running costs are now looking to exceed US$1000 a year. (I could change hosting provider to reduce this.)
- In January 2021 I estimated I'd spent around 350 hours building it (evenings and weekends over the preceding year). I haven't estimated how long I have spent this year, but it won't be quite as much as last year.
Such an interesting idea! I tried it just once, with 'startup ideas' and I can see how it gives more useful results than google for this simple phrase.
Thanks for your feedback. I was hoping that the paid listings for the search as a service would cover the running costs so it could be self-sustaining, although so far to be honest it is a long way off doing that.
I've built https://search.marginalia.nu/ from scratch, as a solo hobby project. It's literally just a computer in my living room.
Hardware investment is about $3-4k as a one-time cost and then I estimate I'll need a 1 Tb SSD per every couple of years as the server does kind of chew through them with great appetite.
My monthly operational costs are $15 in power, and $20 for cloudflare because I kept getting DDoS:ed by botnets.
As for development time, dunno, I've been working on it in my spare time since this spring some time, generously estimated 30h/week x 30 weeks, so the upper bound may be 900 hours, but probably closer to something like 600 hours as I have other projects as well, and I'm not always feeling it.
I don't think off the shelf search solutions or databases are viable, they are too flexible which means they can't be fast and space-efficient enough to keep cost down. They're meant to run in a data center, not on a single computer. That means your operational costs will be prohibitive.
It's required a lot of old-fashioned wizardry to build though, bit-twiddling and demoscene-esque hacks to coax a lot of data into a minimal amount of space, the type of microoptimization stuff that usually is a waste of time except the data set is so large that saving single bytes in object encoding often translates to saving multiple gigabytes. If you aren't at least fairly comfortable with building custom compression algorithms, memory mapped hash tables, things like that, it's gonna be a rough project. If I didn't have a background in low level programming, this would have been nearly impossible.
Beyond that, most of this stuff you can pick up along the way. I didn't really know shit about building search engines before I started. I just threw together a design that made sense and built... something, and iterated upon that. With every iteration it's gotten faster, smaller, better, smarter. I think the upcoming release is gonna be yet another huge improvement.
Yes. Gigablast is run by one person. Assuming you roll your own hardware and co-lo it or such you can get by with minimal hardware. The catch being once you hit scale your servers become hammered. It’s one reason why I have been investigating using aws lambda to hold and search the index. Solves the initial scale problem.
Honestly though the big issue is crawling. Not only is it a bandwidth monster many sites are hostile to non google bots, along with cdn’s and cloudflare.
> Honestly though the big issue is crawling. Not only is it a bandwidth monster many sites are hostile to non google bots, along with cdn’s and cloudflare.
I do my own crawling and don't agree with any of these statements. Bandwidth is not a bottleneck, and blocking is mostly only a problem if your bot is too aggressive.
Could be due to the long form content you index. I found those sort of sites tend to have less reluctance on 3rd parties. Also possible you are more talented than I with your crawler writing.
Crawling is a challenge but not as much as mentioned. We have had a respectful bot since 2004, so the problems for us are more about awareness than hostility.
Indexing is the bigger challenge. An efficient fast service for an index of billions of pages is a different level to that for millions of pages.
I have always found crawling harder personally. At least with indexing you can switch to batch processing, or split the index into real-time and stale portions to give that fresh feel. Just my personal opinion though. I have never had the chance to index billions of pages. Several hundred million is about as far as I have gone.
A major cost for a web search engine is that you have to index the web, and it's pretty big. For even a modest commercial enterprise this isn't insurmountable, but it's a lot to ask for a hobbyist.
Attic.city was developed and is run with less than one FTE (two part time founders). Definitely boutique — home and fashion products from indie stores in the US — and thereby manageable both in terms of labor and the stack. We’re growing incrementally. Lately the focus has been on internal tooling and progressively automated health/status metrics.