I have been thinking about the same problem since a few weeks.
The real problem with search engines is the fact that so many websites have hacked SEO that there is no meritocracy left. Results are not sorted based on relevance or quality but by SEO experts' efforts at making the search results favor themselves. I can possibly not find anything deep enough about any topic by searching on Google anymore. It's just surface-level knowledge that I get from competing websites who just want to make money off pageviews.
It kills my curiosity and intent with fake knowledge and bad experience. I need something better.
However, it will be interesting to figure the heuristics to deliver better quality search results today. When Google started, it had a breakthrough algorithm - to rank page results based on number of pages linking to it. Which is completely meritocratic as long as people don't game for higher rankings.
A new breakthrough heuristic today will look something totally different, just as meritocratic and possibly resistant to gaming.
The real reason why search is so bad is that Google is downranking the internet.
I should know - I blew the whistle on the whole censorship regime and walked 950 pages to the DOJ and media outlets.
--> zachvorhies.com <--
What did I disclose? That Google was using a project called "Machine Learning Faireness" to rerank the entire internet.
Part of this beast has to do with a secret Page Rank score that Google's army of workers assign to many of the web pages on the internet.
If wikipedia contains cherry picked slander against a person, topic or website then the raters are instructed to provide a low page rank score. This isn't some conspiracy but something openly admitted by Google itself:
See section 3.2 for the "Expertise, Authoritativeness and Trustworthiness" score.
Despite the fact that I've had around 50 interviews and countless articles written about my disclosure, my website zachvorhies.com doesn't show up on Google's search index, even when using the exact url as a query! Yet bing and duckduckgo return my URL just fine.
Don't listen to the people who say that's its some emergent behavior from bad SEO. This deliberate sabotage of Google's own search engine in order to achieve the political agenda of the controllers. The stock holders of Google should band together in a class action lawsuit and sue the C-Level executives of negligence.
If you want your internet search to be better then stop using Google search. Other search engines don't have this problem: I'm looking at qwant, swisscows, duckduckgo, bing and others.
Google's search rankings are based on opinions held by other credible sources. This isn't really blowing the whistle when, as you admitted, Google admits this openly.
And maybe your site doesn't get ranked well because it's directly tied to project veritas. I don't like being too political, especially on hn and on an account tied to my real identity, but project veritas and it's associates exhibit appalling behavior in duplicity and misdirection. I would hope that trash like this does get pushed to the bottom.
In a political context, "credible" is often a synonym for "agrees with me". Anyone ranking "page quality" should be conscious of and try to avoid that, and yet the word "bias" doesn't even appear in the linked guidelines for Search Quality Raters.
Of course Google's own bias (and involvement in particular political campaigns) is well known, and opposed to Project Veritas, so it's quite possible that you are right and Google is downranking PV.
Would that be good? Well, that's an opinion that depends mostly on the bias of the commentator.
Two wrongs don't make a right – but more importantly, Project Veritas deliberately uses lies and deception to obtain its footage, which it then edits to remove context. That crosses a line far beyond run-of-the-mill reporting bias.
The point of that story isn't "two wrongs make a right", but that the left wing still accepts CNN just as the right wing still accepts PV.
"Gotchas" and "gaffes" are taken out of context all the time. And when reporters lie to go undercover and obtain a story, they're usually hailed as heroes.
These objections are only used to "discredit" opposing viewpoints. People don't object when sources they agree with use the same tactics.
Why the assumption of left or right, when at least in the US most people identify as independents?
And to be even more fair, you aren’t talking about “people” here but the OP who can clearly state their own personal preferences, without the need for you to construct a theory of hidden bias.
> Why the assumption of left or right, when at least in the US most people identify as independents?
Most people don’t identify as independents, and, anyhow, studies of voting behavior show that most independents have a clear party leaning between the two major parties and are just as consistently attached to the party they lean toward as people who identify with the party.
So not only is it the case that most people don't identify as independents, most of those who do aren't distinguishable from self-identified partisans when it comes to voting behavior.
On the contrary, according to relatively recent Pew numbers from 2018 [1], around ~38% of American's identify as independents, with ~%31 democrat and ~%26 republican. To be fair many of them lean towards one party or another (your last point acknowledged), but independents are much more important in politics than they are usually given credit for.
> On the contrary, according to relatively recent Pew numbers from 2018 [1], around ~38% of American's identify as independents, with ~%31 democrat and ~%26 republican.
That's not “on the contrary”; 38% < 50%; most Americans don't identify as independents.
Heck, even 43% (the number that don't identify as either Republicans or Democrats) is less than 50%; most Americans specifically identify as either Democrats or Republicans.
> but independents are much more important in politics than they are usually given credit for.
On the contrary, independents are given outsize importance by the major media, because they are treated as just as large of a group as they actually are, but are treated as if their voting behavior was much more different from that of self-identified partisans than it actually is.
You got me in a technically correct way, which is as they say, the best kind of correct. I think the keyword "majority" being defined as more than %50 being the crux, so I should have worded my statement better with a "most", "plurality" or "relative majority" instead.
> most Americans specifically identify as either Democrats or Republicans
Now you just did the same thing I did but in reverse! See my comments about more specific words for plurality such as "most".
> independents are given outsize importance by the major media
I don't think independents are given much importance at all by the major media, but I have increasingly disconnected from that circus too so I might not be a good judge of it.
> Now you just did the same thing I did but in reverse!
No, I didn't.
> See my comments about more specific words for plurality such as "most".
“Most” (as an adjective applied to a group) is for a majority, not a plurality, but that's okay, because 57% is an absolute majority, anyway, which is why my reference, which did use “most” for a majority, was not what you did.
Maybe not the best source, but quite a few sites returned similar verbiage about most usually but not always referring to a plurality. I'm open to correction on this point, and am genuinely curious about this meta argument now. Having a hard time finding a statistical dictionary that references the words we are using here (majority, most, etc).
Are you asking rhetorically in a facetious way or a real question? If the former am I missing something (like maybe I said something dumb, it's happened before)?
Why the assumption of left or right, when at least
in the US most people identify as independents?
Do you wish to cite a source for this or do you wish to gaslight here?
Voters declaring themselves as independent of a major
political party rose nationally by 6 percentage points
to 28 percent of the U.S. electorate between 2000 and
last year’s congressional midterm contest in states
with party registration, analyst Rhodes Cook wrote
in Sabato’s Crystal Ball political newsletter at the
University of Virginia Center for Politics.
But as few as 19 percent of Americans identifying as
independent truly have no leaning to a party, according
to the Pew Research Center.[1]
[1]
'Difference-maker' independent voters in U.S. presidential election crosshairs
The gist of it is that for any given party, more people are not members of it (e.g. 70%+ of Americans are not Democrats) so you are not safe in guessing someone political affiliations. Stats wise, you will likely be wrong.
Secondarily, saying that someone who leans to a party is basically in that party is wrong.
For example, I am an independent. I lean republican on the local level, but in the Federal elections I lean Democrat. If you polled me, I would say I lean republican since that’s usually what’s on my mind
You're free to select the occasional time when a news agency gets it wrong (and apologizes and issues a correction)...but to try and compare it to an intentionally biased organization that always, intentionally gets it wrong (by willfully setting up people, then editing their responses, to paint things in a particular light), and never backs down, never apologizes, well...at best, your bias is showing.
Actually I'm not a fan of PV. Gotcha journalism doesn't appeal to me. But I see that low quality "journalism" every day, so I don't see PV as particularly noteworthy. Certainly not worth altering search results for.
And I'd say you are mistaken when you call changing "take that [violence] to the suburbs" into "a call for peace" merely getting it wrong. What CNN did could hardly be anything other than intentional deception from an organization with a strong left-wing bias[1].
PV isn't "gotcha journalism". They've repeatedly committed felonies and completely fabricated things (ex. when they tried to trick WaPo into pushing a fake #metoo story about Roy Moore) in attempts to create their content.
O'Keefe plead out to a misdemeanor and served probation + a fine. But the crime he committed was a felony.
> In January 2012, O'Keefe released a video of associates obtaining a number of ballots for the New Hampshire Primary by using the names of recently deceased voters.
That's probably a felony in most parts of the US, although O'Keefe claims it wasn't since he didn't actually vote.
I agree. I watched one of his newer docs and anytime I heard something that just seemed too crazy to be true, I researched it. And in all cases, I found that important context was missing and in some cases, the evidence for the point being made was almost completely fabricated.
Meta-comment, but this is partly why search and the internet is so bad now; there are a large number of political disinformation campaigns which are getting increasingly blatant, and getting better at finding believers on the internet.
People have a vested interest in destroying the idea that anything can be a non-partisan "fact". Anything can become a smear. Only the most absolutely egregious ones can be reined in by legal action (e.g. Alex Jones libelling the Sandy Hook parents).
(This is not just internet, of course; the British press mendacity towards the Royal family is playing out at the moment.)
His website contains gems like "Things got political in June 2017 when Google deleted "covfefe" out of it's arabic translation dictionary in order to make a Trump tweet become nonsense." (No, covfefe doesn't mean anything in Arabic.)
Here is someone who believes that a private company's open attempts to rank websites by quality amounts to "seditious behaviour" deserving of criminal prosecution, and the only people willing to pay attention were Project Veritas. Google has plenty of ethics issues, but this guy's claims are absurd.
> Despite the fact that I've had around 50 interviews and countless articles written about my disclosure, my website zachvorhies.com doesn't show up on Google's search index, even when using the exact url as a query!
I just tried it, it's just showing results for "zach vorhies" instead, which it thinks you meant. I just tried a few other random "people's names as URL" websites I could find, sometimes it does this, sometimes it doesn't.
Furthermore, the results that do appear are hardly unsympathetic to you. If google is censoring you/your opinions, they're doing a very poor job of it.
Why do your colleagues get to decide what is fact and what is fiction? It's our right, as humans, to be able to make that decision on our own after we encounter information. If Wikipedia gains a reputation for libel, then the onus should be on the public to stop trusting them.
Google does not have the moral authority to censor the internet, and it's absolutely wrong for them to attempt this. Information should be free, and you don't have the right to get in the way of that.
They don't get to decide any such thing, and in fact, can't. Google (fortunately for all of us) doesn't run the Internet.
They do run a popular search page, and have to decide what to do with a search like "Is the Earth flat?".
Personally, I would prefer they prominently display a "no". Others would disagree, but a search engine is curation by definition, that's what makes it useful.
You would, and fortunately Google agrees with you, but imagine for a moment that they didn’t. 90% if the internet would suddenly see a ’yes’ to that answer, even if 99% of websites disagree.
They get to decide because it is their algorithm and the whole point is of a search function is to discriminate inputs to be relevant. They aren't "getting in the way" - they are using it as they please.
What you ask for isn't freedom but control over everyone else - there is nothing stopping you from running your own spiders, search engines, and rankings.
Who said anything about censorship? The topic of discussion is what order results are in. Are you saying Google would be more useful if it returned the 100 million results randomly and left you to sort them out?
They get to decide what they display as results. What do you suggest they do instead? Display all of the internet and let the user filter things for themselves?
The fun thing about facts is that nobody needs to decide whether or not they are true. Perhaps the fact that you can honestly claim to think otherwise means you need to take a step back and examine your reasoning.
> Perhaps the fact that you can honestly claim to think otherwise
This is an example of a "fact" that I'm talking about. It's not a fact, it's an opinion being presented as fact. I guess if you present yourself this way online you have no problem with Google controlling what "facts" are found when you use their search engine.
I guess I'll just have to wait until they start peddling a perspective you disagree with.
> If wikipedia contains cherry picked slander against a person, topic or website
Just remember that this is the comment we're discussing... how does one determine if a statement is slander? Are you telling me Google has teams of investigative journalists following up on each of their search results? Or did someone at Google form an opinion, then decide their opinion is the one that should be presented as "fact" on the most popular search engine in the world?
> Despite the fact that I've had around 50 interviews and countless articles written about my disclosure, my website zachvorhies.com doesn't show up on Google's search index, even when using the exact url as a query!
I am not sure what is happening but I directly searched for your website : zachvorhies.com on Google (in Australia, if that matters), which returned the website as the first result:
I would think the quotes would make a difference. I was not using quotes, does using quotes in the US still return the website? I would use a VPN and test it out but I am at work right now.
Not using quotes, the first result I get is his twitter account, and the second is a link to a Project Veritas piece about this document leak he describes. Hardly seems like he's getting buried.
I'm in the US and I get the same search results. Although if I put in your name, I get your Twitter instead. Not sure why anyone would be searching for a dot com instead of going to it.
Same for me in Australia, searching for the name results in the Twitter handle showing up, and then a WikiSpooks websites and so forth. The website isn't even on the first page. I think that's rather concerning, that searching for a person's won't return their website but rather a twitter feed and other websites that have possibly optimized for SEO.
Ahh, Zachary Vorhies. I remember your bizarre internal posts literally claiming that Obama's birth certificate was fake. I wasn't surprised at all when you leaked a pile of confidential information to a far right conspiracy theorist group (Project Veritas), and were subsequently fired.
I wouldn't trust a single word that comes out of this guy's mouth.
For what it's worth, when I search for your site it's the first result. You have to click past the "did you mean", which searches on your name originally, but then it's there.
Highly polarized views, like the one you hold, are a result of a multiplicity of communications between entities on the Internet. Those humans who are more prone to spontaneous visualizations or audio tend to follow patterns which use biased arguments over reasoned arguments. That nets you comments like:
> Don't listen to the people who say
> this beast has to do with a secret Page Rank score that Google's army of workers
Anyone who tells you to not listen to others intends to tell you gossip about why you shouldn't gather data that conflicts their own views. They'll SAY all sorts of things to try to make you "see" it their way. Rational people DO things to prove or disprove a given belief. (Just to note SAYING a bunch of stuff does not equal DOING a bunch of stuff.)
For anyone rational and interested, Google "this video will make you angry" and bump the speed to 75%. The idea is that memes will compete for "compute" space in both people's minds and the space in which they occupy the Internet. Those who get "infected" with a given meme, will go to all sorts of lengths to rationalize why that meme is true, even though the meme the arguing against is just as irrational as their own.
> I can possibly not find anything deep enough about any topic by searching on Google anymore.
> It kills my curiosity and intent with fake knowledge and bad experience. I need something better.
It's hard for me to take this seriously when wikipedia exists, and almost always ranks very highly in search results for searches for "knowledge topics". Between wikipedia and sources cited on wikipedia, I find the depth of almost everything worth learning about to be far greater than I can remember in, say, the early 2000s, which is seems like the "peak" of google before SEO became so influential.
In general, I think there are a lot of people wearing rose tinted glasses looking at the "old internet" in this thread. The only thing that has maybe gotten worse is "commercial" queries like those for products and services. Everything else is leaps and bounds better.
There is a lot of stuff you won't find on wikipedia that is now buried, one example being old forum threads containing sage wisdom from fellow enthusiasts on any given topic. You search for an interest and a half dozen relevant forums used to come up on page 1.
These days I rarely see a forum result appear unless I know the specific name of the forum to begin with and utilize the search by site domain operator.
Another problem these days, unrelated to search but dooming it in the process, is all these companies naming themselves or their products after irrelevant existing english words, rather than making up something unique. It's usually fine with major companies, but I think a lot of smaller companies/products shoot themselves in the foot with this and don't realize it. I was once looking for some good discussion and comparison on a bibliography tool called Papers, and that was just a pit of suffering getting anything relevant at all with a name like that.
Add inurl:forum to the query. Google used to have a filter "discussions", but they removed it for some reason. Nowadays I usually start with https://hn.algolia.com/ and site:reddit.com when I want to find a discussion.
The fact that Wikipedia exists, is frequently (though not always) quite good, has citations and references, and ranks highly or is used directly for "instant answers" ...
... still does nothing to answer the point that Web search itself is unambiguously and consistently poorer now than it was 5-10 years ago.
Yes, I find myself relying far more on specific domain searches, either in the Internet sense, or by searching for books / articles on topics rather than Web pages. Because so much more traditionally-published information is online, this actually means the net of online-based search has improved, but not for the most part because of improved Web-oriented resources (Webpages, discussions, etc.), but because old-school media is now Web-accessible.
This. More and more, I have been finding that good books provide better learning than the internet.
You search for quality books online, mostly through discussion forums as search fails here, or through following references of books and articles. Then spend time digesting them.
Wikipedia is surface level knowledge. Using wikipedia what is AM4 socket’s pinout? Do a google search and you find several people asking the question but no answers. On the other hand you can easily find that for an old 8086 cpu.
What’s sad is Google has generally indexed the pages I want, it’s just getting harder to actually find them.
Did you ever find the pin out manual for AM4? Your comment sent me down a google hole...
Clearly, they don’t want it available because their tech docs they host stop at AM3b. I was hoping an X470 (or other flavor) motherboard manufacturer would have something floating around...
Basically, I think you can divide search between commercial interest search and not commercial interest searches. I can find deep discussions of algorithms curated quite nicely. But information curtains, say, that will be as bad as the OP says.
> A new breakthrough heuristic today will look something totally different, just as meritocratic and possibly resistant to gaming.
I wonder how much of this could be obtained back by penalizing:
1. The number of javascript dependencies
2. The number of ads on the page, or the depth of the ad network
This might start a virtuous circle, but in the end, this is just a game of cat-and-mouse, and website might optimize for this as well.
What we might need to break this is a variety of search engines that uses different criteria to rank pages. I suspect it would be pretty hard, if not impossible, to optimize for all of them.
And in any case, frequently change the ranking algorithms to combat over-optimization by the websites (as that's classically done against ossification for protocols, or any overfitting to outside forces in a competitive system).
You could even have all this under one roof: one common search spider that feeds this ensemble of different ranking algorithms to produce a set of indices, and then a search engine front end that round-robins queries out between the different indices. (Don’t like your query? Spin the algorithm wheel! “I’m Feeling Lucky” indeed.)
The Common Crawl is a thing already. Unfortunately, a "full" text crawl of the internets is a YUUUGE amount of data to manage, and I can't think of anything that could change that in the foreseeable future. That's why I think providing a federated Web directory standard, ala ODP/DMOZ except not limited to a single source, would be a far more impactful development.
Unfortunately, a "full" text crawl of the internets is a YUUUGE amount of data to manage
Maybe instead of a problem, there is an opportunity here.
Back before Google ate the intarwebs, there used to be niche search engines. Perhaps that is an idea whose time has come again.
For example, if I want information from a government source, I use a search engine that specializes in crawling only government web sites.
If I want information about Berlin, I use a search engine that only crawls web sites with information about Berlin, or that are located in Berlin.
If I want information about health, I use a search engine that only crawls medical web sites.
Each topic is still a wealth of information, but siloed enough that the amount of data could be manageable to a small or medium-sized company. And the market would keep the niches from getting so small that they become useful. A search engine dedicated to Hello Kitty lanyards isn't going to monetize.
incorporating [5] https://curlie.org/ and Wikipedia and something like Yelp/YellowPages embedded in Open Streetmaps for businesses and points of interest, with a no frills interface showing the history (via timeslide?) of edits.
That's the problem that web directories solve. It's not that you're wrong, it's just largely orthogonal to the problem that you'd need a large crawl of the internets for, i.e. spotting sites about X niche that you wouldn't find even from other directly-related sites, and that are too obscure, new, etc. to be linked in any web directory.
Not really. A web directory is a directory of web sites. I can't search a web directory for content within the web sites, which is what a niche search engine would do.
You don’t really need to store a full text crawl if you’re going to be penalizing or blacklisting all of the ad-filled SEO junk sites. If your algorithm scores the site below a certain threshold then flag it as junk and store only a hash of the page.
Another potentially useful approach is to construct a graph database of all these sites, with links as edges. If one page gets flagged as junk then you can lower the scores of all other pages within its clique [1]. This could potentially cause a cascade of junk-flagging, cleaning large swathes of these undesirable sites from the index.
What if: SEO consultants aren’t gaming system, but the search and web is being optimized for “measurable immediate economic impact” that is ad revenue at this moment — due to web itself being in-monetizable and unable to generate value.
I don’t like the whole concept of SEO, I don’t like the way the web is today, but I think we should stop and think before resorting to “immoral few is destroying things, we unfuck it reclaim what we deserve” type simplification.
Merging js deps into one big resource isn't difficult. The number of ads point is interesting though. How would one determine what is an ad and what is an image? I have my ideas, but optimizing on this boundary sounds like it would lead to weird outcomes.
Adblockers have to solve that problem already. And it's actually really easy because "ads" aren't just ads unfortunately, they're also third-party code that's trying to track you as you browse the site. So it's reasonably easy to spot them and filter them out.
Back in the early days of banner ads, a CSS-based approach to blocking was to target images by size. Since advertising revolved around specific standards of advertising "units" (effectively: sizes of images), those could be identified and blocked. That worked well, for a time.
This is ultimately whack-a-mole. For the past decade or so, point-of-origin based blockers have worked effectively, because that's how advertising networks have operated. If the ad targets start getting unified, we may have to switch to other signatures:
- Again, sizes of images or DOM elements.
- Content matching known hash signatures, or constant across multiple requests to a site (other than known branding elements / graphics).
- "Things that behave like ads behave" as defined by AI encoded into ad blockers.
- CSS / page elements. Perhaps applying whitelist rather than blacklist policies.
- User-defined element subtraction.
There's little in the history of online advertising that suggests users will simply give up.
Some of those techniques will make the whole experience slow compared to the current network request filters and dns blockers.
And that will probably be blocked or severely locked down by your most popular browser, chrome.
I don't need to give advertisers data myself when someone else I know can. I really doubt it is easy to throw off chrome monopoly at this stage. I presume we will see a chilling effect before anything moves like IE.
I don't think DMOZ had ranking per se? They could mark "preferred" sites for any given category, but only a handful of them at most, and with very high standards, i.e. it needed to be the official site or "THE" definitive resource about X.
You are correct, the sites weren't "ranked" the same way that Google ranks sites now. But there were preferred sites, and each site had a description written by an editor who could be fairly unpleasant if they wanted to.
I had a site that appeared in DMOZ, and the description was written in such a way that nobody would want to visit it. But it was one of only a few sites on the internet at the time with its information, so it was included.
Google has taken on so many markets that I don't think they can do anything reasonably well (or disruptive) without conflicting interests. A breakup is overdue: if they didn't control both search and ads, the web would be a lot better nowadays. If they didn't control web browsers as well, standards would be much more important.
Create a core protocol at the same level as DNS etc., that web servers can use to offer an index of everything they serve/relay. A multitude of user-side apps may then query that protocol, with each app using different algorithms, heuristics and offering different options.
IF we had a distributable search protocol, index, and infrastructure ... the entire online landscape might look rather different.
Note that you'd likely need some level of client support for this. And the world's leading client developer has a strongly-motivated incentive to NOT provide this functionality integrally.
A distributed self-provided search would also have numerous issues -- false or misleading results (keyword stuffing, etc.) would be harder to vet than the present situation. Which suggests that some form of vetting / verifying provided indices would be required.
Even a provided-index model would still require a reputational (ranking) mechanism. Arguably, Google's biggest innovation wasn't spidering, but ranking. The problem now is that Google's ranking ... both doesn't work, and incentivises behaviours strongly opposed to user interests. Penalising abusive practices has to be built into the system, with those penalties being rapid, effective, and for repeat offenders, highly durable.
The problem of potential for third-party malfeasance -- e.g., engaging in behaviours appearing to favour one site, but performed to harm that site's reputation through black-hat SEO penalties, also has to be considered.
As a user, the one thing I'd most like to be able to do is specify blacklists of sites / domains I never want to have appear in my search results. Without having to log in to a search provider and leave a "personalised" record of what those sites are.
(Some form of truly anonymised aggregation of such blocklists would, of course, be of some use, and facilitating this is an interesting challenge.)
I too have been thinking about these things for a long time, and I also believe a better future is going to include "aggregation of such blocklists would, of course, be of some use, and facilitating this is an interesting challenge."
I decided it is time for us to have a bouncer-bots portal (or multiple) - this would help not only with search results, but also could help people when using twitter or similar - good for the decentralized and centralized web.
My initial thinking was these would be 'pull' bots, but I think they would be just as useful, and more used, if they were perhaps active browser extensions..
This way people can choose which type of censoring they want, rather than relying on a few others to choose.
I believe creating some portals for these, similar to ad-block lists - people can choose to use Pete'sTooManyAds bouncer, and or SamsItsTooSexyfor work bouncer..
ultimately I think the better bots will have switches where you can turn on and off certain aspects of them and re-search.. or pull latest twitter/mastodon things.
I can think of many types of blockers that people would want, and some that people would want part of - so either varying degrees of blocking sexual things, or varying bots for varying types of things.. maybe some have sliders instead of switches..
make them easy to form and comment on and provide that info to the world.
I'd really like to get this project started, not sure what the tooling should be - and what the backup would be if it started out as a browser extension but then got booted from the chrome store or whatever.
Should this / could this be a good browser extension? What language / skills required for making this? It's on my definite to do future list.
There are some ... "interesting" ... edge cases around shared blocklists, most especially where those:
1. Become large.
2. Are shared.
3. And not particularly closely scrutinised by users.
4. Via very highly followed / celebrity accounts.
There are some vaguely similar cases of this occurring on Twitter, though some mechanics differ. Celebs / high-profile users attract a lot of flack, and take to using shared blocklists. Those get shared not only among celeb accounts but their followers, though, because celebs themselves are a major amplifying factor on the platform, being listed effectively means disappearing from the platform. Particularly critical for those who depend on Twitter reach (some artists, small businesses, and others).
Names may be added to lists in error or malice.
This blew up summer of 2018 and carried over to other networks.
Some of the mechanics differ, but a similar situation playing out over informally shared Web / search-engine blocklists could have similar effects.
A sitemap simply tells you what pages exist, not what's on those pages.
Systems such as lunr.js are closer in spirit to a site-oriented search index, though that's not how they're presently positioned, but instead offer JS-based, client-implemented site search for otherwise static websites.
The basic principle of auditing is to randomly sample results. BlackHat SEO tends to rely on volume in ways that would be very difficult to hide from even modest sampling sizes.
If a good site is on shared hosting will it always be dismissed because of the signal of the other [bad] sites on that same host? (you did say at DNS level, not domain level)
In the specific case of date based searches they are pretty difficult because of how pages are ranked. For a long time (and still to a large extent) Google ranks 'newer' pages higher than 'relevant' pages. At Blekko[1] there was a lot of code that tried to figure out that actual date of the document (be it a forum post, news article, or blog post). That date would often be months or years earlier than the 'last change' information would have you think.
Sometimes its pretty innocuous, a CMS system updates every page with an updated copyright notice at the start of each year. Other times its less innocuous where the page simply updates the "related links" or side bar material and refreshes the content.
It is still an unsolved ranking relevance problem where a student written, 3 month old description of how AM modulation works ranks higher than a 12 year old, professor written description. There isn't a ranking signal for 'author authority'. I believe it is possible to build such a system but doing so doesn't align well with the advertising goals of a search engine these days.
“I can possibly not find anything deep enough about any topic by searching on Google anymore. It's just surface-level knowledge that I get from competing websites who just want to make money off pageviews.”
Is it possible that there is no site providing non fluffy content on your query? For a lot of niche subjects, there really are very few if any substantial content on that topic.
> Is it possible that there is no site providing non fluffy content on your query? For a lot of niche subjects, there really are very few if any substantial content on that topic.
“Very few if any substantial”
The problem is that google won’t even show me the very few anymore. It’s just fluff upon fluff and depth (or real insight at least) is buried in twitter threads and reddit/hn comments, and github issue discussion.
I fear the seo problem has not only killed knowledge propagation, but also thoroughly disincentivized smart people from even trying. And that makes me sad.
I mirror your sentiment. It used to be that you could use your google fu and you'd be able to find a dozen relevant forum posts or mail chains in plain text. It's much, much, harder to get the same standard of results. "Pasting stack traces and error messages. Needle in a haystack phrases from an article or book. None of it works anymore."
Yeah, if I know where I'm looking (the sites) then google is useful since I can narrow it to that domain. But if I don't know where to look then I'm SOL.
The serendipity of good results on Google.com is no longer there. And given the talent at google you have to wonder why.
So this point is such an interesting and common anti-pattern on the internet though:
1. Something is or provides access to good quality content.
2. Because of this quality, it gets more and more popular.
3. As popularity grows, and commercialization takes over, the incentive becomes to make things "more accessible" or "appealing" to the "average" user. More users is always better right!?
4. This works, and quality plummets.
5. The thing begins to lose popularity. Sometimes it collapses into total unprofitability. Sometimes it remains but the core users that built the quality content move somewhere else, and then that new thing starts to offer tremendous value in comparison to the now low quality thing.
It is only solvable for a short period of time. Then, when whatever replaces the current search is successful enough there will be an incentive to game the new system. So the only way to really solve this is by radical fragmentation of the search market or by randomizing algorithms.
Someone in a previous thread that I, unfortunately can’t remember, suggested that it might not just be the SEO but the internet that changed. Google used to be really good at ascertaining meaning from a panoply of random sources, but those sites are all gone now. The Wild West of random blogs and independent websites are basically dead in favor of content farms and larger scale media companies.
I’ve found more and more I have reverted to finding books instead of searching to find deeper knowledge. The only issue is it is easy to publish low quality books now. Depending on the topic you are looking into often if a book stands the test of time it is a worthwhile read. With tech books you have to focus on the authors credentials.
> However, it will be interesting to figure the heuristics to deliver better quality search results today.
If only there were some kind of analog for effective ways to locate information. Like if everything were written on paper, bound into collections, and then tossed into a large holding room.
I guess it's past the Internet's event horizon now, but crawler-primary searching wasn't the only evolutionary path to search.
Prior to Google (technically: AdWords revenue funding Google) seizing the market, human-currated directories were dominant [1, Virtual Library, 1991] [2, Yahoo Directory, 1994] [3, DMOZ, 1998].
Their weakness was always cost of maintenance (link rot), scaling with exponential web growth, and initial indexing.
Their strength was deep domain expertise.
Google's initial success was fusing crawling (discovery) with PageRank (ranking), where the latter served as an automated "close enough" approximation of human directory building.
Unfortunately, in the decades since we seem to have forgotten how useful hand-currated directories were, in our haste to build more sophisticated algorithms.
Add to that that the very structure of the web has changed. When PageRank first debuted, people were still manually tagging links to their friends' / other useful sites on their own. Does that sound like the link structure we have in the web now?
Small surprise results are getting worse and worse.
IMHO, we'd get a lot of traction out of creating a symbiotic ecosystem whereby crawlers cooperate with human currators, both of whose enriched output is then fed through machine learning algorithms. Aka a move back to supervised web search learning, vs the currently dominant unsupervised.
Mixing human curation with crawlers is probably something that'd help with search results quality, but the issue comes in trying to get it to scale properly. Directories like the Open Directory Project/DMOZ and Yahoo's directory had a reputation for being slow to update, which left them miles behind Google and its ilk when it came to indexing new sites and information.
This is problematic when entire categories of sites were basically left out of the running, since the directory had no way to categorise them. I had that problem with a site about a video game system the directory hadn't added yet, and I suspect others would have it for say, a site about a newer TV show/film or a new JavaScript framework.
You've also got the increase in resources needed (you need tons of staff for effective curation), and the issues with potential corruption to deal with (another thing which significantly effected the ODP's usefulness in its later years).
Federation would help with both breadth and potential corruption, compared to what we had with ODP/DMOZ. A federated Web directory (with common naming/categorization standards, but very little beyond that) would probably have been infeasible back then simply because the Internet was so much smaller and fewer people were involved (and DMOZ itself partially made up for that lack by linking to "awesome"-like link lists where applicable) - but I'm quite sure that it could work today, particularly in the "commercial-ish" domain where corruption worries are most relevant.
The results are human curated as much as google would like to publicly pretend otherwise.
I think a more fundamental problem is a large portion of content production is now either unindexable or difficult to index - Facebook, Instagram, Discord, and YouTube to name a few. Pre-Facebook the bulk of new content was indexable.
YouTube is relatively open, but the content and contexts of what is being produced is difficult to extract, if, for the only reason that people talk differently than they write. That doesn’t mean, in my opinion, that the quality of a YouTube video is lower than what would have been written in a blog post 15 years ago, but it makes it much more difficult to extract snippets of knowledge.
Ad monetization has created a lot of noise too, but I’m not sure without it, there would be less noise. Rather it’s a profit motive issue. Many, many searches I just go straight to Wikipedia and wouldn’t for a moment consider using Google for.
Frankly I think the discussion here is way better than the pretty mediocre to terrible “case study” that was posted.
Immediately before Google were search engines like AltaVista https://en.wikipedia.org/wiki/AltaVista (1995) and Lycos https://en.wikipedia.org/wiki/Lycos (1994) which were not directories like Yahoo. Google won by not being cluttered with non-search web portal clutter, and by the effectiveness of PageRank, and because by the late 1990s the web was too big to be indexed by a manually curated directory.
Perhaps it’s not always a new heuristic that is needed, but a better way to manage the externalities around current/preceding heuristics.
From a “knowledge-searching” perspective, at a very rudimentary level, it makes sense to look to sites/pages that are often cited (linked to) by others as better sources to rank higher up in the search results. It’s a similar concept to looking at how often academic papers are cited to judge how “prominent” of a resource they are on a particular topic.
However, as with academia, even though this system could work pretty well for a long time at its best (science has come a long way over hundreds of years of publications), that doesn’t mean it’s perfect. There’s interference that could be done to skew results in one’s favor, there’s funneling money into pseudoscience to turn into citable sources, there’s leveraging connections and credibility for individual gain, - the list goes on.
The heuristic itself in not innately the problem. The incentive system that exists for people to use the heuristic in their favor creates the issue. Because even if a new heuristic emerges, as long as the incentive system exists, people will just alter course to try to be a forerunner in the “new” system to grab as big a slice of the pie while they can.
That’s a tough nut for google (or anyone) to crack. As a company, how could they actually curate, maintain, and evaluate the entire internet on a personal level while pursuing profitability? That seems near impossible. Wikipedia does a pretty damn good job at managing their knowledge base as a nonprofit, but even then they are always capped by amount of donations.
It’s hard to keep the “shit-level” low on search results when pretty much anyone, anywhere, anytime could be adding in more information to the corpus and influencing the algorithms to alter the outcomes. It gets to a point where getting what you need is like finding a needle in a haystack that the farmer got paid for putting there.
> I have been thinking about the same problem since a few weeks. The real problem with search engines is the fact that so many websites have hacked SEO that there is no meritocracy left.
That's not actually the problem described here. His problem is actually a bit deeper rootet since he specified the exact parameters of what he wants to see, but got terrible results. He specified a search for "site:reddit.com" but the resilts he got were ireelevant and worse than the results that he would have got when searching reddit directly.
I don't say that SEO, sites that copy content and only want to genrate clicks and large sites that culminate everything are bad fkr the internet of today, but the level of results we get off of search engines today is with one word abysmal.
Wrong. The site query worked. The issue is that there is no clear way to determine information date, as pages themselves change. Since more recent results are favored, SEO strategy of freshness throws off date queries.
https://www.searchenginejournal.com/google-algorithm-history...
> As you can see, I didn’t even bother clicking on all of them now, but I can tell you the first result is deeply irrelevant and the second one leads to a 4 year old thread.
He also wrote
> At this point I visibly checked on reddit if there’ve been posts about buying a phone from the last month and there are.
Duckduckgo even recognized the date to be 4 years old and reddit doesn't hide the age of posts. There are newer more fitting posts, but they aren't shown. And again a quote
> Why are the results reported as recent when they are from years ago, I don’t know - those are archived post so no changes have been made.
So your argument (also it really is a problem) in this case is a red herring. The problem lies deeper since google seems to be unable to do something as simple as extracting the right date and ddg ignores it. Also since all the results are years old it adds to the confusion why the results don't match the query. (He also wrote that the better matches were indeed indexed, but not found)
You said this wrong thing:
He specified a search for "site:reddit.com" and claimed it was irrelevant. THAT IS NOT A RELEVANCY TERM. It correctly scopes searches to the specified site.
The entirety of the problem is the date query is not working, because of SEO for freshness. You also said this other wrong thing: "That's not actually the problem described here" . That is the problem here. The page shows up as being updated because of SEO.
The date in a search engine is the date the page was last judged to be updated, not one of the many dates that may be present on the page. When was the last reddit design update? Do you think that didn't cause the pages to change? Wake up.
Not to totally detract from your point, but my previous experience with SEO people showed that some SEO strategies actually not only improve page ranking, but also actual usability.
The first, and the most important perhaps was page load speed. We adopted a slightly more complicated pipeline on the server side, reduced the amount of JS required by the page, and made page loading faster. That improved both the ranking and actual usability.
The second was that SEO people told us our homepage contained too many graphics and too few text, so search engines didn't quite extract as much content from our pages. We responded by adding more text in addition to the fancy eye-catching graphics. That improved both the ranking and actual accessibility of the site.
I have noticed most HN comments with SEO in them take it as being bad bad bad and the reason for the death of good search, the need for powerful whatever..
I really wish everyone would qualify, and not just black-hat seo / whitehat - there are many types of SEO, often with different intentions.
I understand there has been a lot of google koolaid (and others) about how seo is evil it's poisoning the web, etc...
But now, or has it been a couple years how? google had a video come out saying an SEO firm is okay if they tell you it takes 6 months... they have upgraded their pagespeed tool which helps with seo, and were quite public about how they wanted ssl/https on everything and that that would help with google seo..
so there are different levels of SEO, someone mentioned an seo plugin I was using on a site as being a negative indicator, and I chuckled - the only thing that plugin does is try to fix some of the inherent obvious screwups of wordpress out of the box... things like no meta description which google flags on webmaster tools as multiple same meta descriptions.. also tries to alleviate duplicate content penalties by no-indexing archives or and categories or whatever.
So there is SEO that is trying to work with google, and then there is SEO where someone goes out and puts comments on 10,000 web sites only for the reason of ranking higher.. to me that is kind of grey hat if it was a few comments, but shady if it's hundreds and especially if it's automated..
but real blackhat stuff - hacking other web sites and adding links.. or having a site that is selling breast enlarging pills and trying to get people who type in a keyword for britney spears.. that is trying to fool people.
I have built sites with good info and had to do things to make them prettier for the ranking bot, but they are giving the surfer what they are looking for when they type 'whatever phrase'... I have also made web sites better when trying to also get them to show up in top results.
So it's not always seo=bad, sometimes seo=good for the algorythm and the users.
and sometimes it's madness - like extra fluff to keep a visitor on page longer to keep google happy like recipes - haha - many different flavors of it - and different intentions.
I've often thought one approach, though one I wouldn't necessarily want to be the standard paradigm, would be exclusivity based on usefulness.
So for example duckduckgo is still trying to use various providers to emulate essentially "early google without the modern google privacy violations", but when I start to think about many of the most successful company-netizens, one thing that stands out is early day exclusivity has a major appeal.
So I imagine a search engine that is only crawling the most useful websites and blogs and works on a whitelist basis. Instead of trying to order search results to push bad results down, just don't include them at all or give them a chance to taint the results. It would have more overhead, and would take a certain amount of time to make sure it was catching non-major websites that are still full of good info ... but once that was done it would probably be one of the best search panes in existence. I have also thought to streamline this, and I know it's cliche, but surely there could be some ml analysis applied to figuring out which sites are SEO gaming or click-baiting regurgitators and weed them out.
Just something I've been mulling over for a while now.
And how do you determine which websites are good other than checking if they are doing seo?
Is reddit.com good or bad? If a good site that does seo should it be taken out?
And what if what you're searching exists only in a non good website? Isn't it better to show a result from a non good website than showing nothing?
I too am asking people to start making lists of lame query returns. I have taken screen shots of some, even made a video about one... but a solid list in a spreadsheet perhaps would be helpful.. of course with results varying for different people / locations and month to month / year to year, having some screen shots would be helpful too.
Not sure if there is a good program for snapping some screen shots and pulling some key phrases and putting it all together well...
Many websites do have "hacked" (blackhat/shady) SEO, but these websites do not last long, and are entirely wiped out (see: de-ranked) every major algorithm update.
The major players you see on the top rankings today do utilize some blackhat SEO, but it's not at a level that significantly impacts their rankings. Blackhat SEO is inherently dangerous, because Google's algorithm will penalize you at best when it finds out -- and it always does -- and at worst completely unlist your domain from search results, giving it a scarlet letter until it cools off.
However, the bulk of all major websites primary utilize whitehat SEO, i.e "non-hacked," i.e "Google-approved" SEO to maintain their rankings. They have to, else their entire brand and business would collapse, either from being out-ranked or by being blacklisted for shady practices.
Additionally, Google's algorithim hasn't changed much at all from pagerank, in the grand scheme of things. If you can read between their lines, the biggest SEO factor is: how many backlinks from reputable domains do you have pointing at your website? Everything else, including blackhat SEO, are small optimizations for breaking ties. Sort of like PED usage in competitive sports; when you're at the elite level, every little bit extra can make a difference.
Google's algorithm works for its intended purposes, which is to serve pages that will benefit the highest amount of people searching for a specific term. If you are more than 1 SD from the "norm" searching for a specific term, it will likely not return a page that suits you best.
Google's search engine based on virality and pre-approval. "Is this page ranked highly by other highly ranked pages, and does this page serve the most amount of people?" It is not based on accuracy, or informational-integrity -- as many would believe by the latest Medic update -- but simply "does this conform to normal human biases the most?"
If you have a problem with Google's results, then you need to point the finger at yourself or at Google. SEO experts, website operators, etc. are all playing a game that's set on Google's terms. They would not serve such shit content if Google did not: allow it, encourage it, and greatly reward it.
Google will never change the algorithm to suit outliers, the return profile is too poor. So, the next person to point a finger at is you: the user. Let me reiterate, Google's search engine is not designed for you; it is designed for the masses. So there is no logical reason for you to continue using it the way you do.
If you wish to find "deep enough" sources, that task is on you, because it cannot be readily or easily monetized; thus, the task will not be fulfilled for free by any business. So, you must look at where "deep enough" sources lay: books, journals, and experts.
Books are available from libraries, and a large assortment of them are cataloged online for free at Library Genesis. For any topic you can think of, there is likely to be a book that goes into excruciating detail that satisfies your thirst for "deep enough."
Journals, similarly. Library Genesis or any other online publisher, e.g NIH, will do.
Experts are even better. You can pick their brains and get even more leads to go down. Simply, find an author on the subject -- Google makes this very easy -- and contact them.
I'm out of steam, but I really felt the need to debunk this myth that Google is a poor, abused victim, and not an uncaring tyrant that approves of the status quo.
> Google's algorithm works for its intended purposes, which is to serve pages that will benefit the highest amount of people searching for a specific term.
Does it? So for any product search, thrown-together comparison sites without actual substance but lots of affiliate links are really among the best results? Or maybe they are the most profitable result, and thus the one most able to invest in optimizing for ranking? Similarly, do we really expect results on (to a human) clearly hacked domains to be the best for anything, but Google will still put them in the top 20 for some queries? "Normal people want this crap" is a questionable starting point in many cases.
Over the long-term, Google's algorithm will connect the average person to the page most likely to benefit them, more than it won't.
There is no "best result."
Any page falling under "thrown-together comparison sites without actual substance but lots of affiliate links" are temporal inefficiencies that get removed after each major update.
Will more pop up? Yes, and they will take advantage of any ineffeciency or edge-cases in the algorithim to boost their rankings to #1.
Will they stay there for more than a few months? No. They will be squashed out, and legitimate players will over time win out.
This is the dichotomy between "churn and burn" businesses and "long term" businesses. You will make a very lucrative, and quick, buck going full blackhat, but your business won't last and you will be consistently need to adapt to each successive algo update. While long-standing "legit" businesses will only need to maintain market dominance -- something much easier to do than break into the market from ground zero, which churn and burners will have to do in perpetuity until they burn out themselves.
If you want to test this, go and find 10 websites you think are shady, but have top 5 rankings for a certain search phrase. Mark down the sites, keyword, and exact pages linked. Now, wait a few months. Search again using that exact phrase. More likely than not, i.e more than 5 out of 10, will no longer be in the top 5 for their respective phrases, and a couple domains will have been shuttered. I should note that "not deep info" is not "shady," because the results are for the average person. Ex. WebMD is not deep, but it's not shady either.
I implore people to try and get a site ranked with blackhat tricks and lots of starting capital, and see just how hard it is to keep ranked consistantly using said tricks. It's easy to speculate and make logical statements, but they don't hold much weight without first-hand experience and observation.
>Will they stay there for more than a few months? No. They will be squashed out, and legitimate players will over time win out.
This isn't true at all in my experience. As a quick test I tried searching for "best cordless iron", on the first page there is an article from 2018 that leads to a very broken page with filler content and affiliate links. [1] There are a couple of other articles with basically the exact same content rewritten in various ways also on the first page.
A quick SERP history check confirms that this page has returned in the top 10 results for various keywords since late 2018.
>It's easy to speculate and make logical statements, but they don't hold much weight without first-hand experience and observation.
This statement is a bit ironic given that it took me 1 keyword and 5 seconds of digging to find this one example.
Here's an xkcd [0] inspired idea. We have several search engines, each with some level of bias. We're not looking to crawl the whole internet because we can't compete with their crawlers. However, we could make a crawler to crawl their results, and re-rank the top N from each engine according to our own metric. Maybe even expose the parameters in an "advanced" search. I'm assuming this would violate some sort of eula though. Any idea if someone has tried this approach?
Edit: thinking more about this post's specific issue, I'm not sure what to do if all the crawlers fail. Could always hook into the search apis for github, reddit, so, wiki, etc. Full shotgun approach.
> The real problem with search engines is
the fact that so many websites have hacked
SEO that there is no meritocracy left.
I intend to announce the alpha test of my
search engine here on HN.
My search engine is immune to all SEO
efforts.
> I can possibly not find anything deep
enough about any topic by searching on
Google anymore.
In simple terms my search engine gives
users content with the meaning they want
and in particular stands to be very good,
by far the best, at delivering content
with "deep" meaning.
> I need something better.
Coming up.
> However, it will be interesting to
figure the heuristics to deliver better
quality search results today.
Uh, sorry, it's not fair to say that my
search engine is based on "heuristics".
I'm betting on my search engine being
successful and would have no confidence in
heuristics.
Instead of heuristics I took some new
approaches:
(1) I get some crucial, powerful new data.
(2) I manipulate the data to get the
desired results, i.e., the meaning.
(3) The search engine likely has by far
the best protections of user privacy.
E.g., search results are the same for any
two users doing the same query at
essentially the same time and, thus, in
particular, independent of any user
history.
(4) The search engine is fully intended to
be safe for work, families, and children.
For those data manipulations, I regarded
the challenge as a math problem and took a
math approach complete with theorems and
proofs.
The theorems and proofs are from some
advanced, not widely known, pure math with
some original applied math I derived.
Basically the manipulations are as
specified in math theorems with proofs.
> A new breakthrough heuristic today will
look something totally different, just as
meritocratic and possibly resistant to
gaming.
My search engine is "something totally
different".
My search engine is my startup. I'm a
sole, solo founder and have done all the
work. In particular I designed and wrote
the code: It's 100,000 lines of typing
using Microsoft's .NET.
The typing was without an IDE (integrated
development environment) and, instead, was
just into my favorite general purpose text
editor KEdit.
It's my first Web site: I got a good
start on Microsoft's .NET and ASP.NET (for
the Web pages) from
Jim Buyens, Web Database Development,
Step by Step, .NET Edition, ISBN
0-7356-1637-X, Microsoft Press.
The code seems to run as intended. The
code is not supposed to be just a "minimum
viable product" but is intended for first
production to peak usage of about 100
users a second; after that I'll have to do
some extensions for more capacity. I
wrote no prototype code. The code needs
no refactoring and has no technical
debt.
While users won't be aware of anything
mathematical, I regard the effort as a
math project. The crucial part is the
core math that lets me give the results.
I believe that that math will be difficult
to duplicate or equal. After the math and
the code for the math, the rest has been
routine.
Ah, venture capital and YC were not
interested in it! So I'm like the story
"The Little Red Hen" that found a grain of
wheat, couldn't get any help, then alone
grew that grain into a successful bakery.
But I'm able to fund the work just from my
checkbook.
The project does seem to respond to your
concerns. I hope you and others like it.
This sounds extremely implausible; especially claiming "immune to SEO" is like declaring encryption "unbreakable". A lot of human effort would be devoted to it if your engine became popular.
For a short answer, SEO has to do with keywords. My startup has nothing to do with keywords or even the English language at all. In particular, I'm not parsing the English language or any natural language; I'm not using natural language understanding techniques. In particular, my Web pages are so just dirt simple to use (user interface and user experience) that a child of 8 or so who knows no English should be able to learn to use the site in about 15 minutes of experimenting and about three minutes of watching someone use the site, e.g., via a YouTube video clip of screen captures. E.g., lots of kids of about that age get good with some video games without much or any use of English.
E.g., how are you and your spouse going to use keywords to look for an art print to hang on the wall over your living room?
Keyword search essentially assumes (1) you know what you want, (2) know that it exists, and (3) have keywords that accurately characterize it. That's the case, and commonly works great, for a lot of search, enough for a Google, Bing, and more in, IIRC, Russia and China. Also it long worked for the subject index of an old library card catalog.
But as in the post I replied to, it doesn't work very well when trying to go "deep".
Really, what people want is content with some meaning they have at least roughly in mind, e.g., a print that fits their artistic taste, sense of style, etc., for over the living room sofa. Well, it turns out there's some advanced pure math, not widely known, and still less widely really understood, for that.
Yes I encountered a LOT of obstacles since I wrote that post. The work is just fine; the obstacles were elsewhere. E.g., most recently I moved. But I'm getting the obstacles out of the way and getting back to the real work.
> Really, what people want is content with some meaning they have at least roughly in mind
Yes, but capturing meaning mathematically is somewhat an unsolved problem in both mathematics, linguistics and semiotics. Your post claims you have some mathematics but (obviously as it's a secret) doesn't explain what.
SEO currently relies on keywords, but SEO as a practice is humans learning. There is a feedback loop between "write page", "user types string into search engine" and "page appears at certain rank in search listing". Humans are going to iteratively mutate their content and see where it appears in the listing. That will produce a set of techniques that are observed to increase the ranking.
> Yes, but capturing meaning mathematically is somewhat an unsolved problem in both mathematics, linguistics and semiotics.
I've been successful via my search math. For your claim, as far as I know, you are correct, but actually that does not make my search engine and its math impossible.
> That will produce a set of techniques that are observed to increase the ranking.
Ranking? I can't resist, to borrow from one of the most famous scenes in all of movies: "Ranking? What ranking? We don't need no stink'n ranking".
Nowhere in my search engine is anything like a ranking.
So, do you only ever display one single result? Or do you display multiple results? Because if you display multiple results, they will be in a textual order, whether that's top to bottom or left to right, and that is a ranking.
People pay tens or even hundreds of thousands of dollars to move their result from #2 to #1 in the list of Google results.
My user interface is very different from Google's, so different there's no real issue of #1 or #2.
Actually that #1 or #2, etc. is apparently such a big issue for Google, SEO, etc. that it suggests a weakness in Google, one that my work avoids.
You will see when you play a few minutes with my site after I announce the alpha test.
Google often works well; when Google works well, my site is not better. But the post I responded to mentions some ways Google doesn't work well, and for those and some others my site stands to work much better. I'm not really in direct competition with Google.
stop vaguebooking and post it up on HN. if you're comfortable with where the product is at at the current moment then share it. it will never be finished so share it today.
His post was interesting to me since it mentioned some of the problems that I saw and that got me to work on my startup. And my post might have been interesting to him since it confirms that (i) someone else also sees the same problems and (ii) has a solution on the way.
For explaining my work fully, maybe even going open source, lots of people would say that I shouldn't do that. Indeed, that anyone would do a startup in Internet search seems just wack-o since they would be competing with Google and Bing, some of the most valuable efforts on the planet.
So that my efforts are not just wack-o, (i) I'm going for a part of search, e.g., solving the problems of rahulchhabra07, not currently solved well; (ii) my work does not really replace Google or Bing when they work well, and they do, what, some billions of times a second or some such?; (iii) my user interface is so different from that of Google and Bing that at least first cut my work would be like combining a racoon and a beaver into a racobeaver or a beavorac; and (iv) at least to get started I need the protection of trade secret internals.
Or, uh, although I only just now thought of this, maybe Google would like my work because it might provide some evidence that they are not all of search and don't really have a monopoly, an issue in recent news.
Thanks. I intend to announce the alpha test here at HN, and I will have an e-mail address for feedback (already do -- at least got that little item off my TODO list although it took 36 hours of on the phone mud wrestling with my ISP to set it up).
Ouch, please don't be a jerk on HN. I know it's common on the rest of the internet, but it will get an account banned on Hacker News. Would you mind reviewing the site guidelines and taking the spirit of this site to heart? We'd be grateful. https://news.ycombinator.com/newsguidelines.html
graycat is an elder here. He has an interesting history in technology, I dare say more interesting than most of us ever will. He deserves better than your reply (as would any other HN user). Check out these posts:
From my end, it looks like google search is very strongly prioritising paid clients and excluding references to everything else. Try a search or view from maps, it shows a world that only includes google ad purchasers.
Google has become a not very useful search - certainly not the first place I go when looking for anything except purchases. They've broken their "core business".
It kills my curiosity and intent with fake knowledge and bad experience. I need something better.
However, it will be interesting to figure the heuristics to deliver better quality search results today. When Google started, it had a breakthrough algorithm - to rank page results based on number of pages linking to it. Which is completely meritocratic as long as people don't game for higher rankings.
A new breakthrough heuristic today will look something totally different, just as meritocratic and possibly resistant to gaming.