Can someone put this in lay terms for a casual postgresql user?

chrismorgan · on Jan 11, 2019

Parsing SQL is made faster.

Imagine a language with five keywords, for simplicity.

  keywords = ["AS", "FROM", "INSERT", "SELECT", "WHERE"];
  tokens = [As, From, Insert, Select, Where];

Now you have a token that you need to interpret:

  let token_raw = "FROM";
  let index = keywords.binary_search(token_raw);
  let token = tokens[index];

As a binary search, that’s O(log N) string comparisons. In this worst case, it may try comparing with INSERT, then AS, then FROM before deciding index is 1, but for more keywords you have more comparisons.

With perfect hashing, you devise a single hash function that hashes all possible values uniquely. Now we have something like this:

  tokens = {914576: As, 73932: From, 5791456: Insert, 21596: Select, 86548: Where};

… with all the numbers calculated by hashing the strings at compile time. Then, instead of O(log N) string comparisons, you only need one hash and one hash table lookup:

  let token_raw = "FROM";
  let token = tokens[hash(token_raw)];

jules · on Jan 11, 2019

Why is this necessary at all? The lexer is recognising tokens with a finite state machine, so it already knows which keyword it's got.

gizmo · on Jan 11, 2019

The lexer knows it has a token, but it still has to match the token identifier to the corresponding keyword. Whether this happens at the lexing or parsing stage doesn't matter much, it's still expensive when strcmp is called a bunch of times for every keyword token.

jules · on Jan 11, 2019

Lexers don't do strcmps, they use a finite state machine to determine tokens. For instance if we have the tokens ALTER, AND, ANY, SELECT, FROM it will compile into a state machine like this:

  switch(str[i++]){
    case 'A': 
      switch(str[i++]){
        case 'L': ...
        case 'N': 
          switch(str[i++]){
            case 'D': 
              found token!
            case 'Y:
              found token!
            default:
              syntax error!
          }
       ...
      }
    case 'S': ...
    case 'F': ...
    ...
  }

Lexers already do this, and in each place where a token is found, it can know which token it was based on the DFA state that it's in.

Jyaif · on Jan 11, 2019

If you have a lot of states (postgresql has > 400 keywords to identify), the assembly encoding the switch will be several kb big, so you'll have cache misses.

Perfect hashing trades more computing for less cache misses.

[edit] Also, you shouldn't forget that sparse switch...case aren't O(1).

rurban · on Jan 11, 2019

That's the advantage of the state machine, but in such a table the lengths are already known at compile-time, and therefore the final check will be optimized to use a word wise aligned memcmp, which beats the branchy lexer code by miles.

jules · on Jan 11, 2019

I don't understand what you mean, can you elaborate? Are you talking about perfect hashing? Or are you talking about doing multiple strcmps? Or about doing the state machine until you reach a unique alternative, and match the remainder using strcmp?

What I was trying to say is that the lexer is already doing a state machine, and the keyword can be determined from the state of the state machine, rather than re-determining it from the token substring itself.

lopopolo · on Jan 11, 2019

A perfect hash is one in which their are no collisions for a given data set, which guarantees O(1), or constant, lookup time. This patch introduces a perl script which computes a perfect hash function at build time given the current set of SQL keywords.

The old keyword lookup method uses binary search. Binary search is a search algorithm that must compare at most O(log n) entries before finding a match.

Making 1 keyword comparison instead of log n keyword comparisons yields the 20% speedup.

anarazel · on Jan 11, 2019

> Making 1 keyword comparison instead of log n keyword comparisons yields the 20% speedup.

Worthwhile to note that that speedup is for the raw parsing, without parse-analysis (i.e. looking up the tables referenced and such), planning (figuring out how to execute the query) and execution. So end-user impact will be much smaller than 20%. But for some cases, e.g. when many columns are referenced, it'll be measurable.

the_duke · on Jan 11, 2019

Column names aren't keywords though, are they?.

Using a perfect hash table is definitely cool, but it's hard to imagine a query with so many keywords that this would have a measurable impact on total query execution time.

anarazel · on Jan 11, 2019

Due to the desire to have keywords that aren't fully reserved (we have keyword classes that are fully reserved, can appear as type names, can appear as column names, or can appear nearly everywhere) we/PG need to look up whether a specific string is a keyword or an identifier in a number of places in the scanner/parser.

Addendum: And that indeed means that column references, unless quoted with "", used to go through the binary search and now go through the perfect hash.

Edit: Typo

the_duke · on Jan 11, 2019

Ah that makes sense.

Thanks for that insight!