"Kagi is building a novel ad-free, paid search engine and a powerful web browser...

marginalia_nu · on June 29, 2023

I think a small operation is exactly the sort of outfit to do it. Makes you focus on what is important.

The absolute worst way you can arrange a search engine project is as some sort of manhattan project with a humongous budget and an army of professors and experts.

History is littered with bold and ambitious Google killers that went nowhere.

You can throw almost any amount of money at an operation, and it will gobble it up. I think search in particular is very prone to bloated R&D budgets and various forms of mission creep.

Yet the underlying reality is that software development scales very poorly with organization size, and the larger your organization is, the harder it is to steer and the more difficult it is to make the right calls. A squad of 3-4 motivated and talented guys is absolute peak get-shit-done.

fauigerzigerk · on June 29, 2023

>I think a small operation is exactly the sort of outfit to do it.

A small operation in terms of number of developers yes. But not being able to subsidise user growth means they will never be able to build their own full web index as the fixed cost of doing that is too high for a small number of users.

As a consequence, they will always be at the mercy of Google/Bing. The range of things they can innovate on will always be limited. And the situation can only get worse as people start to expect more AI functionality.

I doubt that you can build a sustainable niche product if the effort you have to put into it is just as big as if you're building for billions of users. Having few users is not what defines a niche. Niches are defined by specialisation.

petra · on June 29, 2023

//As a consequence, they will always be at the mercy of Google/Bing. The range of things they can innovate on will always be limited.

Sure. But Google/Bing has shown lack of interest in innovating for power searchers. I believe they'll continue doing so.

And even so, this space is big, I'm sure there are many niches that would fit a scrappy, subscription based player.

marginalia_nu · on June 29, 2023

> A small operation in terms of number of developers yes. But not being able to subsidise user growth means they will never be able to build their own full web index as the fixed cost of doing that is too high for a small number of users.

What do you reckon the cost of doing this would be?

Just doing the napkin math for say a Mojeek sized index (couple of billion docs) doesn't seem to justify a particularly astronomical budget.

fauigerzigerk · on June 29, 2023

>What do you reckon the cost of doing this would be?

That is indeed the key question. I tried to find out before commenting but the information I found is very vague. Internet Archive spends millions per year, but their index is updated far too slowly for a search engine. I have no idea what it costs to create a Google-size index.

Do you think the Mojeek index is good enough to compete with Google?

ColinHayhurst · on June 29, 2023

Google started becoming an answer engine from ~2015 with the introduction of "People Also Ask", and arguably earlier than that [0]. Mojeek.com is a (information retrieval) search engine, and we resist the temptation to also become an answer engine. So you might say we do not compete with Google. afte all we have a very different business model and proposition.

As for the index; this underpins mojeek.com and our API; which customers use for search and/or AI. Common Crawl is ~3.5 billion pages and underpins LLMs. Our index is ~7 billion. Who knows what (else) Google, and Bing, do with their index ;) ?

[0] https://blog.mojeek.com/2023/05/generative-ai-threatens-dive...

fauigerzigerk · on June 29, 2023

>Google started becoming an answer engine from ~2015 with the introduction of "People Also Ask", and arguably earlier than that [0]. Mojeek.com is a (information retrieval) search engine, and we resist the temptation to also become an answer engine. So you might say we do not compete with Google.

I'm sure you know your users well after so many years in the search engine business, but having read your article I must say I find your approach risky. You seem to be betting on search engines and answer engines continuing to be complementary rather than substitutes.

But we are not the ones making this decision. Users will be making the decision in light of the newly available AI capabilities, and they will be making it with complete disregard for the health of the web, as is their nature :)

The "funny" thing is that big publishers are as happy right now as I haven't seen them in the past 25 years, because it is so completely obvious that chat AIs will destroy the web unless big tech starts making big payments to big publishers. As you rightly say, small businesses and publishers will be collateral damage.

But how do you make sure you're not collateral damage as well?

marginalia_nu · on June 29, 2023

It's basically the safest position you could be in.

An LLM to digest results of a classic search index is greater than the sum of its parts. An LLMs that is not permitted to brush up on the relevant literature before answering a question generally doesn't produce very good answers, is prone to hallucinations etc. A pure LLM design isn't even a serious contender in the answer engine space.

fauigerzigerk · on June 29, 2023

That would mean it's safe if "you" are Google or Microsoft+OpenAI as no one else has both a search index and an LLM.

marginalia_nu · on June 29, 2023

You don't need to have both to sell search index access to anyone with an LLM, which seems like just about anyone these days.

fauigerzigerk · on June 29, 2023

Why would publishers allow you to crawl their sites if you're not sending them any traffic?

The big publishers certainly won't let you do that as they are selling their data to Google, Microsoft, Facebook and whoever else has the money to train a fully fledged LLM, which is certainly not everyone.

marginalia_nu · on June 30, 2023

Because it lets them sell data to other parties than Google and Facebook? That's actually pretty great. Only having a 1 or 2 customers kinda sucks.

A search engine partnering with an answer engine may not send traffic, but the answer engine is a potential customer for the websites the search engines direct them to.

fauigerzigerk · on June 30, 2023

>Because it lets them sell data to other parties than Google and Facebook?

Only indirectly by charging search engines for access to content. It would be an entirely different business model that requires a complex set of agreements between publishers, search engines and LLM providers.

Granted it's not impossible and certainly worth considering if you have search engine expertise but no money to train an LLM.

marginalia_nu · on June 29, 2023

> Do you think the Mojeek index is good enough to compete with Google?

As a back-end for something like Kagi, it's sure getting there. Most of what sets Google apart is their exceptional level of user profiling. The actual indexing technology is likely on par with most of their competition.

Of course very little of that enters into API queries.

neurostimulant · on June 29, 2023

Internet archive snapshot the whole page while google only index the first few kb of text data on every page (forgot the number, was it 100kb max per page?), so maybe it'll cost lower to build a search engine index compared to an internet preservation project?

Assuming they'll need to index 820 billion pages (the number of pages preserved in the internet archive), at 100kb each, and assuming they use a database with 0.3x text data compression efficiency, they'll need at least 24600 TB to store those text data. Assuming $300 per 16TB disk, then they'll need to spend at least $7,380,000 for disk alone. This is a lot of money just for storage and we haven't included stuff like replication and backup, indexing metadata overhead, etc.

leoedin · on June 29, 2023

This seems like a good use case for some sort of spam filtering. Maybe the web is that big, but what percentage of it is data someone will ever care about? I wonder if you could make a good-enough search with agressive filtering of pages before they enter the index.

marginalia_nu · on June 29, 2023

A search engine doesn't index the HTML code. You're looking at a few Kb per document. You also don't need multiple historical snapshots of the document like WM retains.

So you're looking at maybe 20 bn docs, 4 Kb each. 100 Tb, before compression.

fauigerzigerk · on June 29, 2023

True, but then Google doesn't just download the page source and index that. They run JavaScript in some cases to get to the actual content. This must come at a significant cost. Their index is enormous as well:

"The Google Search index contains hundreds of billions of web pages and is well over 100,000,000 gigabytes in size."

https://www.google.com/intl/en_uk/search/howsearchworks/how-...

Doesn't mean you have to be as big as Google to do something useful of course.

marginalia_nu · on June 29, 2023

Sure, download and run the javascript, but then you can snapshot the DOM, grab the text, and discard all the rest. The HTML and js is of little practical value for the index after that point.

Google's index is likely very large because they don't have any real economic incentives to keeping it small.

fauigerzigerk · on June 29, 2023

>... but then you can snapshot the DOM, grab the text, and discard all the rest

Yes, absolutely, I didn't mean to imply otherwise. But first you have to figure out what you can discard beyond the HTML tags themselves to avoid indexing all the garbage that is on each and every page.

When I tried to do this I came to the conclusion that I needed to actually render the page to find out where on the page a particular piece of text was, what font size it had, if it was even visible, etc. And then there's JavaScript of course.

So what I'm saying is that storing a couple of kilobytes is probably not the most costly part of indexing a page.

akiselev · on June 29, 2023

> When I tried to do this I came to the conclusion that I needed to actually render the page to find out where on the page a particular piece of text was, what font size it had, if it was even visible, etc. And then there's JavaScript of course.

Are there open source projects devoted to this functionality? It’s becoming more and more a sticking point for working with LLMs. Grabbing the text without navigation and other crap but while maintaining formatting and links, etc

fauigerzigerk · on June 29, 2023

Good question (meaning I don't know :)

For my specific purposes it has always been good enough to apply some simple heuristics. But that wouldn't have been possible without access to post rendering information, which only a real browser (https://pptr.dev) can reliably produce.

DeathArrow · on June 29, 2023

There are many software libraries that can output just the text from HTML or run JS. For C# there's HTML Agility Pack and PuppeteerSharp, for example. I did use them for web scrapping.

marginalia_nu · on June 29, 2023

You don't need to store it indefinitely though, and there's not much point in crawling faster than you can process the data.

The couple of kilobytes per document is the actual storage footprint. Sure you need to massage the data, but that almost entirely CPU bound. You also need a lot of RAM for keeping the hot parts of the index.

ColinHayhurst · on June 29, 2023

I've been told, by a very credible source that would know, that the top level index is (only) 10 billion web pages.

DeathArrow · on June 29, 2023

You don't index HTML code, but you have to process HTML and eventually run Javascript to get the text content. Then you have to compute the word frequencies.And that means you have to use more compute power.

ricardo81 · on June 29, 2023

>100kb max per page?

They changed that limit somewhere in the mid-2000s. Just as well, there's some CMS's out there where there's several hundred kilobytes of inline JS and CSS before any body text.

DeathArrow · on June 29, 2023

> Internet archive snapshot the whole page while google only index the first few kb of text data

Is that a good way to build an index?

DeathArrow · on June 29, 2023

>so maybe it'll cost lower to build a search engine index compared to an internet preservation project?

An index is just hashmap of words and list of urls. So you have to parse the page, and add the urls and word frequencies to the list.

In terms of storage is cheaper, in terms of computing power is more expensive.

marginalia_nu · on June 29, 2023

A hash table is not a good backing structure for a search engine.

Hash tables almost guarantee worst case disk read patterns. You use something like a skip list or a b-tree, since that's makes much better use of the hardware, and on top of that allows you to do incredibly fast joins.

DeathArrow · on June 30, 2023

That's true, but I was referring to a hash table as a mere mental model, not an actual implementation. A better name would be a dictionary instead of hash table.

fomine3 · on June 30, 2023

Perl culture here.

ColinHayhurst · on June 29, 2023

7 billion+ actually, and our investment level mentioned here: https://news.ycombinator.com/item?id=36517505

marginalia_nu · on June 29, 2023

Shit, man. That number is larger every time I check :D

alwaysbeconsing · on June 29, 2023

> not being able to subsidise user growth

You're ignoring that Kagi runs on subscriptions, so this funding is not all there is. If the subscription covers basic user costs (which it should, because they just adjusted the pricing) or at least covers most of them then the 670K can be used for growth.

andsoitis · on June 29, 2023

> I think a small operation is exactly the sort of outfit to do it. Makes you focus on what is important.

They're focusing on 2 things: a search engine AND a Mac-only web browser.

sofixa · on June 29, 2023

> Mac-only web browser

Such a waste. MacOS users might be open to paying (for Kagi in general) because they're used to paying for a bunch of things other OSes get included or as freeware, but still, the market share is small (depending on the source, 10-30%). And even if many of those would enjoy a Mac-native app, there are at least some, like myself, that refuse to use single platform tools. I have a bunch of devices on a bunch of different OSes, I'm not going to use a very special browser on one of them, losing sync, history, muscle memory when switching. That's the reason I can't stay on Arc even if I quite like some of it's goals and structure.

freediver · on June 30, 2023

> Such a waste.

You always need to start somewhere, Kagi is now (just) a 15 person team. There are enough Macs (200M+) and iPhones (1B+) in use to justify selecting this as the first ecosystem to target for the browser. Really it comes down to resource management and allocation in the early stages.

hgsgm · on June 29, 2023

A nearly impossible project and a nearly pointless project make a perfect pairing.

DeathArrow · on June 29, 2023

> History is littered with bold and ambitious Google killers that went nowhere.

What are some examples of those?

marginalia_nu · on June 29, 2023

Well Cuil has already been mentioned, there's A9.com and Quaero; on the open source side there's mozDex, notably Jimbo Wales failed twice with Wikia Search and then Knowledge Engine.

davedx · on June 29, 2023

Yeah, there's a lot of truth in this, and it matches my experience, with the caveat that the 3-4 guys need to stay very clear eyed and pragmatic with their technology choices. Kind of covered by the "talented" part but I've seen so many otherwise talented people fall into holes because of this, and I'd guess it applies doubly for building a search engine.

ColinHayhurst · on June 29, 2023

Same. Perhaps for a search engine though my previous experiences in several ambitious startups is not so different. In our case it was 1 guy for a long time https://blog.mojeek.com/2021/03/to-track-or-not-to-track.htm...

antupis · on June 29, 2023

Also search is now kinda in the strategic inflection point where Google has to change so it will create opportunities for smaller players.

marginalia_nu · on June 29, 2023

There's also interesting things happening in the server space that doesn't get talked about enough. Not only are the latest generation of Epycs very good, the price of RAM and especially SSDs has absolutely plummeted.

If ever there was a moment where horizontal scaling looked promising, this is it.

dgb23 · on June 29, 2023

Recently saw a talk from one of the ID software people. Really supports and illustrates your claim.

Tade0 · on June 29, 2023

Perhaps the business is already sustainable and that $670k will just let them hire an additional person to speed things up a bit?

There's a lot of wasted effort in this industry. Slack has 1500+ employees just to make a chat app. Granted, it's a damn good chat app, but Mozilla is maintaining a browser with half as many people, with not everyone focused on Firefox at that.

My friend is now in a project where he billed for three weeks until all the issues with his dev account were sorted out. It remains unclear when he will be able to start actually working.

I spent two years building a web app + gRPC server that perhaps could have been just the latter.

I could go on. Point is, you can blow through $30mln easily but that doesn't mean you have to.

PufPufPuf · on June 29, 2023

Have you seen what they've already built? The product is out there and very usable: https://kagi.com/

ignoramous · on June 29, 2023

Per TFA, this money was mostly raised from Kagi's existing customers.

They aren't building a web crawler. Kagi is a search "client" (it is one way to build on a shoestring budget, alright), augmented by an in-house small-scale just-in-time crawler.

> From here, we take your query and use it to aggregate data from multiple other sources, including but not limited to Google, Bing, and Wikipedia, and other internal data sources in order to procure your search results. https://help.kagi.com/kagi/privacy/privacy-protection.html

> We also have our own non-commercial index (Teclis), news index (TinyGem), and an AI for instant answers. https://help.kagi.com/kagi/search-details/search-sources.htm...

I like Kagi's general vision for LLMs + Search. There's a real chance small competitors can compete with Google with clever use of LLMs' zero-shot summarization, categorization, intent recognition, and answering abilities.

kaba0 · on June 29, 2023

> competitors can compete with Google

Especially when google continues to be worse off year after year. Just search for my damn keywords, like you did 10 years ago, especially when I have already put quotes around them because you Re useless!

tailspin2019 · on June 29, 2023

> But not for $670K.

As someone who has already replaced Google with Kagi, as far as I'm concerned they've already done it.

I've never paid Google for anything (apart from with my data) but I've been paying Kagi for almost a year now. The model is different. From what I've seen so far, it's better.

The amount of money you burn through is not a good indicator of whether you'll end up with a good product or viable business at the end, IMHO

grantcarthew · on June 29, 2023

I've been using Kagi since I could first get onto the product. As soon as it came out of Beta, I subscribed.

I had no idea how intrusive Google was. I love Google services. Google Cloud is far better than the others. That said, the search is just poison.

I now spend time looking at results, not searching for results.

I have to admit though, ChatGPT did make me consider dropping the subscription. Mr Chat has weakened the value of Kagi to me.

They have their own AI summarizer engine though: https://kagi.com/summarizer/index.html

ignoramous · on June 29, 2023

> Google Cloud is far better than the others.

Per their own documentation, Kagi servers are hosted on Google Cloud.

qmarchi · on June 29, 2023

That in of itself isn't such a scary thing.

There's real restrictions, contractually, that Google has on accessing any data the a customer generates, and of which carry hefty fees for Google.

This is one of the reasons on why some Google products never launch with support for Workspace accounts. There's just too much red-tape that a team doesn't want to deal with.

Source and Disclaimer: I was a Technical Solutions Engineer for Google Cloud.

furyofantares · on June 29, 2023

> I have to admit though, ChatGPT did make me consider dropping the subscription. Mr Chat has weakened the value of Kagi to me.

I felt the same, but I do still need web search, and I do still want it to be as good as possible, so it's still worth it to me.

And actually, as I type this I'm less sure about how much ChatGPT has reduced the value of web search for me. I start with ChatGPT for almost everything that's not "news". But I still frequently do web searches branching out from what I learned via ChatGPT. It's fewer total searches than I've done before, but each one might be more valuable. And since I'm flowing from one tool to another, not having to wade through ads and bad ui and SEO spam also becomes even more valuable.

unshavedyak · on June 29, 2023

As a subscriber to both ChatGPT and Kagi, agreed. You may be interested to know that Kagi also has https://labs.kagi.com/fastgpt. It's been pretty good in my tests so far, nothing magical just fast and similar to a search input field.

ColinHayhurst · on June 29, 2023

Congratulations Kagi.

We beg to differ on the need to raise large sums. Like Kagi we are on a marathon not a sprint. We have built a no-tracking and completely independent crawler search engine and infrastructure from the ground-up having raised £3m from angels only. Cuil, Blekko, Quaero and Neeva who raised 10s/100s of $m may have come and gone; meanwhile we have been slowly building since 2004, with a user and API customer base that is also growing healthily.

hgsgm · on June 29, 2023

Do you have an option to filter out or downrank porn when I search for a slightly obscure initialism?

ColinHayhurst · on June 29, 2023

We have Adult Search option available for API customers. This will be rolled out on mojeek.com very soon.

freediver · on June 29, 2023

Thanks Colin!

shafyy · on June 29, 2023

Stop drinking the VC kool-aid. It's possible to build almost any software business without huge upfront investments.

fho · on June 29, 2023

Have you tried kagi? Because I have and it is pretty good. I was sold when they summarized several "listicles" (ie your typical Top10 blog spam) into one concise listing. "Yes these results would pop up for your search ... but you probably don't want to look at them".

jillesvangurp · on June 29, 2023

At their current pricing, about 10k paying users would get them 600K/year in revenue. Which is what they just raised. Enough for a small team and modest infrastructure. You don't need a lot to serve 600K users.

All they need to do is nail enough value add for those users. The wider goal of disrupting Google/Bing is a different game. But getting a small company to 10K paying users might be doable given enough of a value add.

Of course they are based in the Bay area, so this kind of money doesn't have a huge runway there.

dcow · on June 29, 2023

Kagi has been around since 2018. They're privately bootstrapped (not by a VC firm). 670K is not the only money that's gone into the business. 670K is how a bootstrapped company responsibly raises money.

yxre · on June 29, 2023

I use it, and it's better than google and duckduckgo. What features are they missing?

resolutebat · on June 29, 2023

I'm happily paying $10/mo for Kagi, I find it genuinely better* than Google already. And yes, it's genuinely amazing that it's a pipsqueak startup that's pulling this off.

* For core search, that is, obviously there's vast slabs of services like maps, translate, etc etc where they're not even trying to compete.

detourdog · on June 29, 2023

This demonstrates to me how distorted searching has become. The early days of the web. The standard thing was to show someone yahoo which was a human curated list of sites. The other site was a search engine from one of the colleges I think it was Lycos.

This changed with ads. Suddenly search engines were used to find people searching.

Sakos · on June 29, 2023

There's a huge difference between typical VC funding and what this is. Kagi is already a self-sustaining business with a working and effective search engine. This is extra funding on top that they have available for improving Kagi. That's great. You don't need 30 million for that.

golemotron · on June 29, 2023

That $30 million might not have yielded a great search engine but it did produce Cuil Theory, a branch of mathematics with the potential to change the world:

http://cuiltheory.wikidot.com/what-is-cuil-theory/

sunnybeetroot · on June 29, 2023

This is a joke about how bad the search results of Cuil were right?

golemotron · on June 29, 2023

Yes, but very amusing regardless of how it came to be.

thiht · on June 29, 2023

I have no idea how you could justify a $30 million budget. $670k seems much more reasonable to me. I can imagine where this budget goes, the numbers make sense.

johnnyanmac · on June 29, 2023

someone below linked to the fact that they do leverage other search engines for their tech. So that can explain the lower cost: https://kagi.com/faq#Where-are-your-results-coming-from

It's more about curation (and I guess some machine learning) than completely new tech.

raincole · on June 29, 2023

The less money they raise the more hope I have in them.

So long story short, my opinion is the exact polar opposite of what you said.

detourdog · on June 29, 2023

They have already done it I’m a paying subscriber.