The lexer knows it has a token, but it still has to match the token identifier t...

jules · on Jan 11, 2019

Lexers don't do strcmps, they use a finite state machine to determine tokens. For instance if we have the tokens ALTER, AND, ANY, SELECT, FROM it will compile into a state machine like this:

  switch(str[i++]){
    case 'A': 
      switch(str[i++]){
        case 'L': ...
        case 'N': 
          switch(str[i++]){
            case 'D': 
              found token!
            case 'Y:
              found token!
            default:
              syntax error!
          }
       ...
      }
    case 'S': ...
    case 'F': ...
    ...
  }

Lexers already do this, and in each place where a token is found, it can know which token it was based on the DFA state that it's in.

Jyaif · on Jan 11, 2019

If you have a lot of states (postgresql has > 400 keywords to identify), the assembly encoding the switch will be several kb big, so you'll have cache misses.

Perfect hashing trades more computing for less cache misses.

[edit] Also, you shouldn't forget that sparse switch...case aren't O(1).

rurban · on Jan 11, 2019

That's the advantage of the state machine, but in such a table the lengths are already known at compile-time, and therefore the final check will be optimized to use a word wise aligned memcmp, which beats the branchy lexer code by miles.

jules · on Jan 11, 2019

I don't understand what you mean, can you elaborate? Are you talking about perfect hashing? Or are you talking about doing multiple strcmps? Or about doing the state machine until you reach a unique alternative, and match the remainder using strcmp?

What I was trying to say is that the lexer is already doing a state machine, and the keyword can be determined from the state of the state machine, rather than re-determining it from the token substring itself.