let token_raw = "FROM";
let index = keywords.binary_search(token_raw);
let token = tokens[index];
As a binary search, that’s O(log N) string comparisons. In this worst case, it may try comparing with INSERT, then AS, then FROM before deciding index is 1, but for more keywords you have more comparisons.
With perfect hashing, you devise a single hash function that hashes all possible values uniquely. Now we have something like this:
… with all the numbers calculated by hashing the strings at compile time. Then, instead of O(log N) string comparisons, you only need one hash and one hash table lookup:
let token_raw = "FROM";
let token = tokens[hash(token_raw)];
The lexer knows it has a token, but it still has to match the token identifier to the corresponding keyword. Whether this happens at the lexing or parsing stage doesn't matter much, it's still expensive when strcmp is called a bunch of times for every keyword token.
Lexers don't do strcmps, they use a finite state machine to determine tokens. For instance if we have the tokens ALTER, AND, ANY, SELECT, FROM it will compile into a state machine like this:
switch(str[i++]){
case 'A':
switch(str[i++]){
case 'L': ...
case 'N':
switch(str[i++]){
case 'D':
found token!
case 'Y:
found token!
default:
syntax error!
}
...
}
case 'S': ...
case 'F': ...
...
}
Lexers already do this, and in each place where a token is found, it can know which token it was based on the DFA state that it's in.
If you have a lot of states (postgresql has > 400 keywords to identify), the assembly encoding the switch will be several kb big, so you'll have cache misses.
Perfect hashing trades more computing for less cache misses.
[edit] Also, you shouldn't forget that sparse switch...case aren't O(1).
That's the advantage of the state machine, but in such a table the lengths are already known at compile-time, and therefore the final check will be optimized to use a word wise aligned memcmp, which beats the branchy lexer code by miles.
I don't understand what you mean, can you elaborate? Are you talking about perfect hashing? Or are you talking about doing multiple strcmps? Or about doing the state machine until you reach a unique alternative, and match the remainder using strcmp?
What I was trying to say is that the lexer is already doing a state machine, and the keyword can be determined from the state of the state machine, rather than re-determining it from the token substring itself.
A perfect hash is one in which their are no collisions for a given data set, which guarantees O(1), or constant, lookup time. This patch introduces a perl script which computes a perfect hash function at build time given the current set of SQL keywords.
The old keyword lookup method uses binary search. Binary search is a search algorithm that must compare at most O(log n) entries before finding a match.
Making 1 keyword comparison instead of log n keyword comparisons yields the 20% speedup.
> Making 1 keyword comparison instead of log n keyword comparisons yields the 20% speedup.
Worthwhile to note that that speedup is for the raw parsing, without parse-analysis (i.e. looking up the tables referenced and such), planning (figuring out how to execute the query) and execution. So end-user impact will be much smaller than 20%. But for some cases, e.g. when many columns are referenced, it'll be measurable.
Using a perfect hash table is definitely cool, but it's hard to imagine a query with so many keywords that this would have a measurable impact on total query execution time.
Due to the desire to have keywords that aren't fully reserved (we have keyword classes that are fully reserved, can appear as type names, can appear as column names, or can appear nearly everywhere) we/PG need to look up whether a specific string is a keyword or an identifier in a number of places in the scanner/parser.
Addendum:
And that indeed means that column references, unless quoted with "", used to go through the binary search and now go through the perfect hash.