One of the most important components of Pre-2010 Google's search system was its synonym discovery mechanism. Simply put, queries would be "expanded" with synonyms. Google automatically generated synonym choices that took into account the context of surrounding words, with the understanding that synonyms are highly context dependent. Steven Baker, John Lamping, and a couple of others were key engineers of the system.
Does anyone with a NLP background care to take some guesses on how the synonym extraction methodology worked? My only piece of information is that it likely used the query log itself to do so.
I was on the team too, with less impact than the names you mentioned. The team filed a number of patents that describe parts of how it worked. You can query a patent search engine with terms like [baker synonyms]. Looking now I think Steve was on most of the patents and you can also gather adjacent coauthor names from there.
[I am not a fan of patents, but to the extent they have any positives they in principle serve to share knowledge about how inventions work. Also I am not a lawyer but I think patents last 20 years from filing date and these were filed ~20 years ago maybe?]
I got a couple patents while at Google. I sent a nice readable 4 page design doc that I wrote to a patent lawyer, and I got back 40 pages of nonsense that I basically didn't understand.
I wish there was some kind of readability requirement for patents , if they are to continue to exist.
Ha, I had the exact same experience. The lawyers sent me what they wrote for me to verify that they didn't mess anything up, and it almost seemed like they had based the patent on an entirely different document than the one I wrote up for the invention disclosure—not because they actually messed it up, but because patentese is almost as far from regular English as any foreign language.
Same- I had a very trivial idea and talked to a few lawyers who turned it into something that was far, far more complicated (and clever, in fact I wondered why they didn't just write patents on their own all day long). At the time, each submitted patent gained you $1K and if it was accepted ( I don't recall the actual terms) was $5K. Easy money! Hopefully, my ex-employer won't abuse this patent....
Very cool that you worked on it! I've found most, I think, of the patents. They are.... as has been hinted at in this thread, very difficult to parse through and (imo) don't actually reveal much, though I may just lack the expertise! That's why I was hoping to get some NLP folks to speculate!
Your point on dates is something I did want to call out - I wouldn't be asking this if it wasn't ancient history. I have no interest in doing anything sinister. Just trying to explore a fun part of Internet history. Any shot I could shoot you an email to chat?
I just realised that this technique is absent from local/desktop search. Meaning that in most systems you’re expected to recall how something was phrased, if you want to have a chance of finding it.
I know “Google Desktop” used to be a product years ago. What’s the state of that space today?
We experimented with it sometime ago for our https://curiosity.ai app, initial training on your data was a bit heavy (at the time, probably fine by today's standards) but nice results if you had enough files. Needs to be done with care as for small datasets as there's not enough info for a model to learn and you end up introducing more noise than anything.
> in principle serve to share knowledge about how inventions work
Emphasis on "in principle". Most patents - especially software patents - are completely unintelligible. They also tend to describe the system enough that they can sue people that do the same thing, but no-where near enough that you could actually implement it based on the patent.
Also under-rated feature of 2010-era search was Matt Cutts, author of the article. He was an outlier at Google in that he did real community engagement as well as anti-spam, which is a huge contrast to today’s Google and how the internet has reacted to present-day SEO.
While the Matt Cuts era search tech is interesting, it’s crucial to keep in mind that the dataset was very different then too as a result of Matt Cutts’ own attitude towards spam and SEO.
Back in 2010 LDA was big and Google had used probabilistic networks e.g. Rephil / large noisy-OR networks as models
Would the same things work today given how SEO spam and Google ads work? The same models are probably useful but it’s the noise and the long tail of the data that makes the problem hard.
I was a fan of LDA but would not agree that it is 'probably useful' today. It's an unsupervised clustering algorithm based on Gibbs sampling. Like k-means, it's gonna return a few buckets that will have to be reviewed by a human for data exploration. In this case instead of neatly labeled buckets, these are unlabeled distributions of distributions (lists of single word tokens). If you do some kind of multiword tokenization preprocessing, it'll return a few lists of words and multiword tokens for each document. How is this useful to an end user? Even internally, they're not useful embeddings/vectorizations. Would love to hear some contrary opinions
In many applications like especially Google's display ad targeting market, the "accuracy" of the clusters isn't so import as the lift in key metrics (e.g. click rates or revenue) and the overall efficiency of the method. Indeed the clustering algo might get things "dead wrong" but somehow surface something that causes clicks and revenue to increase. LDA offered much improvement over e.g. TF-IDF models, just as t-SNE improved on LDA, and now LLM embeddings are on average better and potentially cheap to compute.
LDA could be useful if your success metric is perplexity; k-means is useful if vector distance is very meaningful for your problem. Also well-studied algorithms are generally useful for initial studies in a new, unknown dataset. As always with ML, the dataset and setting are just as important as the model and algorithm.
If you have access to the query log, aka "who makes which query in what context"), you can use see which queries are "close" to others in context.
For example, with session, you can detect manual query rewriting, and use this as a signal to see which queries are close to others in the time context. You can do various fancy things from just that.
Nowadays, a simple way to start would be to use SOTA LLMs to generate synonyms offline, and use this for query expansion at query time. At least in a context where queries are small, that should give decent results. This has however diminishing returns because of cost (the more synonyms the more expensive querying the index), and also you lose precision with diminishing returns on increased recall.
Ofc, for complex search like google, I am sure it is much more complicated
Re: LLMs, I was trying to better understand how pre-LLM search worked, hence the interest in the topic.
Any chance you have any open source links that discuss how you practically operate a system based on the concept you describe (manual query rewrite w/i a session as your data set)? Perhaps it's obvious to an NLP person how to reduce that "idea" to practice, but it is not to me!
You're definitely right about the idea though - a former Search engineer obliquely mentioned that this sort of session based manual query rewriting was very core to how the synonym system worked.
It is hard to find modern references on this. When I led a search group, coming from an ML but non search background, I found the following most useful
Kind of like a distance/similarity metric for words, but not an embedding. Joint probability. I once used it to automatically group words into multiword tokens. The wiki page is informative. But wouldn't know what they were using.