I never had coffee. I drink Diet Coke and I enjoy tea ( I have some once every week or so ). I have been writing code for over 25 years now, and I never felt that I was missing out but then again not sure I’d know either way.
> One trend I’m most curious about right now is the rise of no or low alcohol drinks. I noticed this in Europe years ago — that nearly everywhere you went they not only offered, but touted non-alcoholic beverages in bars — highlighting the opposite approach the US was taking: pushing more extreme alcoholic drinks like Four Loko and the like, with non-alcoholic drinks being almost a taboo here. That is changing, quite quickly, it seems.
Our (https://bestprice.gr/) services/“programs” generate three different types of events:
- Short events (no longer than ~1k in size) where the cost to generate and transmit is very low(sent to dedicated service via UDP). We can generate dozens of them before we need to care about the cost to do so. They are binary-encoded. The service that receives those datagrams generates JSON representations of those events and forwards them to all connected clients(firehose) and also publishes them to a special TANK partition.
- Timing samples. We capture timing traces for requests that may take longer-than-expected time to be processed, and random samples from various requests for other reasons. They capture the full context of a request(complete with annotations, hierarchies, etc). Those are also persisted to TANK topics
So, effectively, everything’s available on TANK. This has all sorts of benefits. Some of them include:
- We can, and, have all sort of consumers who process those generates events, looking for interesting/unexpected events and reacting to them (e.g notifying whoever needs to know about them, etc)
- It’s trivial to query those TANK topics using `tank-cli`, like so:
tank-cli -b <tank-endpoint> -t apache/0 get -T T-2h+15m -f "GET /search" | grep -a Googlebot
This will fetch all events starting 2 hours ago, for up to 15 minutes later, that include “GET /search”
All told, we are very happy with our setup, and if we were to start over, we’d do it the same way again.
I like how the blog post title is, likely, based on Google's Page and Brin seminal paper "The Anatomy of a Large-Scale Hypertextual Web Search Engine". :)
This blog post, and other in the series, mention RocksDB is used for the index, but it's not explicitly described how and to what end. I 'd love to know the details.
SymSpell is great for tiny datasets(based on edit distance/operations, and dictionary size).
It’s fast, but the generated index will likely be very large; furthermore, you can’t practically encode rules and transition costs (e.g cost from “f” to “ph” being zero for getting better results for [ifone]).
An FST is the better choice. We (https://www.bestprice.gr/) used SymSpell in the past, but have since switched to an FST based design, where the FST is used to match prefixes. We also store all queries encoded in the FST, together with their frequency, sorted by the query.
For each distinct corrected prefix, we determine the range in that (query, frequency) list using binary search, and then just iterate in that list from [first, last] in that range, and compute a score based on a language model and the edit cost(prefix to corrected prefix) (i.e noisy channel).
We encode many different rules in the FST including language domain ones, and building the index takes a few seconds for millions of queries. The language model is based on 5-grams and Stupid Backoff.
[Disclaimer: I'm the author of SymSpell]
>>"SymSpell is great for tiny datasets".
Dictionaries with one million word entries are no problem, even for a maximum edit distance of 4. For comparison, the Oxford English Dictionary contains 301,100 main entries and 616,500 word-forms in total.
https://towardsdatascience.com/symspell-vs-bk-tree-100x-fast...
SymSpell can be augmented with a Weighted Edit distance giving a higher priority to pairs that are close to each other on the keyboard layout or which sound similar (e.g. Soundex or other phonetic algorithms which identify different spellings of the same sound).
There are two SymSpell implementations with a Weighted Edit distance available:
https://github.com/MighTguY/customized-symspellhttps://github.com/searchhub/preDict
For a few million queries we needed to index, many of them close to 30 characters in length(some even longer than that), the generated index size for max edit distance 3, was really large.
So we used 3 indices (one for unigrams, bigrams, and trigrams respectively) -- and during query processing we would segment the input query and for each segment we ‘d consult any of those 3 indices that made sense and would keep the top-K ones, and then we ‘d proceed to the next segment and we ‘d consider the n-gram sequence of the “suffix” and “prefix” matches between the “carried” top-K from the previous segment and the current/next segment, and so on, until we have exhausted the query. Segmentation was particularly hard to get right.
It was a fairly involved process but it was what worked for us -- again, SymSpell is _very_ fast for short tokens, but we had to execute maybe 1000s of such SymSpell queries when processing an input query and it adds up quickly. (We will probably open-source our SymSpell implementation soon).
[Disclaimer: work at Cliqz]
Both techniques have their share of upsides and downsides, infact we also use an FST based model to perform splits i.e donaldtrump ---> donald trump. The problem with the FST approach is when the prefix has a typo. I tried the spell correct on the website you linked to, it seems to fail when the first character is incorrect. Thanks for the information though, it is always interesting to read and learn from other approaches :)
As for prefixes and typos, in our case, the FST is used for both matching a prefix and spelling correction of the prefix.
The first character is treated specially; the chance of getting the first character wrong is very low(according to our findings) so only completions/corrections that score very high get to be matched.
Just to provide some additional information, our library Keyvi(https://github.com/KeyviDev/Keyvi) has a very fast implementation of an FST based spell correct.
What does each of those 200 dimensions represent/encode?
This is a very interesting and pragmatic design. I tried a couple of queries and the results were great for most of them, but not so much for esoteric ones, or for queries involving names, which makes sense according to the implementation specifics outlined in their blog posts.
I never heard of this company before, and I am very interested in IR technologies( https://github.com/phaistos-networks/Trinity ). I am glad they are trying and it looks like they have a chance for something great there.
Ahrefs.com is also working on a large scale web search engine, but other than their CEO announcing it on Twitter, there hasn’t been any other update since.
By default, there's no attempt to connect a particular semantic meaning with a particular dimension - it's worth noting that all the popular methods of calculating word vectors can/will give results that can differ by an arbitrary linear transformation, so in the event that they contain a very particular "semantic bit", it's still likely to be "smeared" across all 200 dimensions - you could have a linear (unambigious, reversible) transformation to a different 200-dimension space where that particular factor is isolated in a separate dimension, but you would have to explicitly try and do that.
So the default situation is that each individual dimension means "nothing and everything"; if you had some specific factors which you know beforehand and that you want to determine, then you could calculate a transformation to project all the vectors to a different vector-space where #1 means thing A, #2 means thing B, etc - for example, there some nice experiments with 'face' vectors that can separate out age/masculinity/hair length/happiness/etc out of the initial vectors coming out of some image analysis neural network with an unclear meaning of each separate dimension.
This is the idea behind non-negative matrix formulation (NMF). As the name implies, it forces the entries of the embedding matrices (for both the reduced document and term matrix) to be nonnegative, which results in a more interpretable “sum of parts” representation. You can really see the difference (compared to LSA/SVD/PCA, which does not have this constraint) when it’s applied to images of faces. Also, NMF has been shown to be equivalent to word2vec. The classic paper is here: http://www.cs.columbia.edu/~blei/fogm/2019F/readings/LeeSeun...
PS—There should be a negative sign on the (2,2) entry of the first matrix.
Is this the same intent that a 'variation autoencoder' would perform?
Also, is it possible in non-variational implementations (like this one) that some of the dimensions represent multiple groups? For example, not just 0.5 and -0.5 groups, but also a 0.0 group in the middle. Then your rotation wouldn't be sufficient, you would need to increase the dimensionality to cleanly separate the groups.
In general, the individual coordinates of a word embedding (and hence a sentence embedding) have no semantic meaning, at least not in any "normal" sense.
The n-dimensional space into which the embeddings have been represented is mainly just a space such that similarity between two vectors/embeddings in this space has some semantic meaning, e.g. query or word similarity.
The math behind Word2Vec and GloVe is different. Most word embedding models have been shown to be equivalent to some version of matrix factorization, though, and, at least if you're comfortable with linear algebra, the SVD formulation makes it relatively easy to get an intuitive grasp on what the dimensions in an embedding really "mean".
Word embeddings are unsupervised learning, so the features are not chosen, only the number of features. The model then learns the scalars for each feature as a single vector depending on the algorithm/architecure.
When using CBOW, for instance, with a set window size N, the features learned for a single term are based on the order of the preceding N terms.
This will result in similar vectors for terms appearing in the same context. It has its pros and cons though - a great example being “the reservation is confirmed” vs “the reservation is cancelled” - where confirmed/cancelled will have similar features.
It seems odd to me that they ended up computing the average of the word vectors which is way too simple to represent meaning. Similarity metric over this semantic vector is not accurate.
- This is just one of the techniques used for recall. The aim at this point is to be super fast and scalable, and not super accurate. Once we reach the precision stage of the ranking (as defined in https://0x65.dev/blog/2019-12-06/building-a-search-engine-fr...), we can afford to do fancier matching.
Weighted based on what, do you keep IDF values of some sort?
Even then it's hard to imagine how this performs great if these are plain Word2vec vectors. Saying it's just the recall step is a bit hand wavy as these will be actually selecting the documents you will be performing additional scoring on and may very well end up excluding a multitude of great results.
In any case, once more these are very interesting to read and as a search nerd, and I can't help but wonder about all the alternatives considered.
Was Gumbo used for parsing crawled pages for indexing reasons ? How was it used internally ? I presume they use some headless chrome like technology for most of that nowadays.
No, the indexer used a custom HTML parser that was ancient (IIRC, when I was there it wasn't even fully HTML5-compliant - it doesn't have to be, because if you're tokenizing pages for indexing the only really important parts are text, anchors, meta tags, formatting, and headers, and as long as you basically get those right it's not important that you have the exact parse tree that a browser would.) It was very tightly optimized for speed, eg. it was SAX-style, with callbacks instead of heap allocations; the data that it returned to the caller was indexes into the original source data rather than case-normalizing and copying buffers; and the liberties it took with the HTML spec were largely so that it could minimize branching and state during the parse.
Headless Chrome existed within the indexing system while I was there but it was turned on for only a very small section of the index for CPU reasons. Their public statements since have indicated it's used for everything now.
Gumbo grew out of a templating language that I was working on with a larger team within Search Infrastructure and Maps. AFAIU the templating language is still being used, though I think Gumbo isn't being used for it anymore (its syntax diverged from pure HTML5 at some point). It was really intended for tools like refactoring & static analysis: that's why it's slower than the indexing HTML parser; why it was done in C rather than C++ (easier binding to scripting languages); why the API is intended to be round-trippable, with access to the original text that data structures refer to; and why it put a premium on generating accurate parses over speed.