Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For

> Ask HN: Is there a search engine which excludes the world's biggest websites?

> Discovering unknown paths of the web seems almost impossible with google et al..

> Are there any earch engines which exclude or at least penalize results from, say, top 500 websites?

Let's back up a little and then try for an answer:

Some points:

(1) For some qualitative exclamation, there is a LOT of content on the Internet.

(2) There are in principle and no doubt so far significantly in practice a LOT of searches people want to do. The search in the OP is an example.

(3) Much like in an old library card catalog subject index, the most popular search engines are based heavily on key words and then whatever else, e.g., page rank, date, etc.

So: (1) -- (3) represent some challenges so far not very well met: In particular, we can't expect that the key words, etc. of (3) will do very well on all or nearly all the searches in (2) for much of the content in (1).

And the search in the OP is an example of a challenge so far not well met.

Moreover, the search in the OP is no doubt just one of many searches with challenges so far not well met.

Long ago, Dad had a friend who worked at Battelle, and IIRC they did a review of information retrieval that concluded that keyword search covers only a fraction, maybe ballpark only 1/3rd, of the need for effective searching. And the search in the OP is an example of what is not covered because the library card catalog did not index size of the book or Web site! :-)!

Seeing this situation, my rough, ballpark estimate has been that the currently popular Internet search engines do well on only about 1/3rd of the content on the Internet, searches people want to do, and results they want to find.

So, I decided to see what could be done for the other 2/3rds.

I started with some not very well known or appreciated advanced pure math; it looks like useless, generalized abstract nonsense, but if calm down, stare at it, think about it, ..., can see a path for a solution. Although I never thought about the search in the OP until now, in principle the solution should work also for that search. Or, the math is a bit abstract and general which can translate in practice to doing well on something as varied as the 2/3rds.

Then for the computing, I did some original applied math research.

Using TeX, I wrote it all up with theorems and proofs.

So, the project is to be a Web site. While in my career I've been programming for decades, this was my first Web site. I selected Windows and .NET, and typed in 100,000 lines of text with 24,000 statements in Visual Basic .NET (apparently equivalent in semantics to C# but with syntactic sugar I prefer).

The software appears to run as intended and well enough for significant production.

I was slowed down by one interruption after another, none related to the work.

But, roughly, ballpark, the Web site should be good, or by a lot the best so far, for the 2/3rds and in particular for the search in the OP.

So, for

> Ask HN: Is there a search engine which excludes the world's biggest websites?

there's one coded and running and on the way to going live!

I intend to announce an alpha test here at HN.



Can you talk at a high level about the problems of keyword search, or is that part of your secret sauce? Off the top of my head I can think of two, which are intent and encoding.

Before you can even do a keyword search, you obviously need an intent to do so. But that means keyword search is pretty useless when you don't know what you don't know.

Encoding that intent...maybe doesn't matter for common searches, but everyone has heard of the concept of "Google-Fu". English text is a pretty lossy medium compared to the thoughts in people's heads...Shannon calculated 2.62 bits per English letter, so the space of possibly-relevant sites for almost any keyword is absolutely enormous (e.g. there are about 330,000 7-letter english keyword searchs...distributed across how many trillions of pages, not even counting "deep web" dynamically generated ones?). So we punt on that and use the concept of relevance for sorting results, and in practice no one looks beyond the first 10. I don't know what an alternate encoding might look like though


Good questions.

For

> Before you can even do a keyword search, you obviously need an intent to do so. But that means keyword search is pretty useless when you don't know what you don't know.

Right: The way at times in the past I have put something like that is to say that, ballpark, to oversimplify some, keyword search requires the user to know what content they want, know that it exists, and have keywords/phrases that accurately characterize that content. For some searches, e.g., the famous movie line

"I don't have to show you no stinking badges",

https://www.youtube.com/watch?v=VqomZQMZQCQ

that is fine; otherwise it asks too much of the user.

For "encoding", my work does not use keywords or any natural language for anything.

My work does get some new data for each search for each user. But privacy is relatively good because for the results I use only what the user gives for that search; in particular, two users giving the same inputs at essentially the same time will get the same results. Thus, search results are independent of the user's IP address or browser agent string. Moreover, the site makes no use of cookies.

The role of the advanced pure math is to say that the data I get and the processing I do with that data and what is in the database should yield good results for the 2/3rds. The role of my original applied math is to make the computations many times faster -- they would be too slow otherwise.

When keywords work well, and they work well enough to be revolutionary for the world, my work is, except for some small fraction of cases, not better. So, there is ballpark the 1/3rd where keywords work well. Then there is the ballpark, guesstimate, 2/3rds I'm going for.

My work is not as easy for the users as picking a great, very accurate, result from the top dozen presented by a keyword search engine, e.g., the movie line example, but is much easier to use than flipping through 50 pages of search results and is intended usually to give good results unreasonable to get from a keyword search, without "characterizing" keywords, that yields, say, millions of search results and would require a user to flip through dozens of pages of search results.

Ads are off on the right side and not embedded in the search results. The SEO (search engine optimization) people will have a tough time influencing the search results!

We will see how well users like it. If people like it, then it will be good to make progress on the huge, usually neglected, content of the 2/3rds.


Looking forward to it. You got a name?


I have a list of 100+ candidates but (i) I have not selected one yet, (ii) don't really need one before I register it, and (iii) don't want to start paying for it before I need it.

But thanks for your interest.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: