Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The lexer knows it has a token, but it still has to match the token identifier to the corresponding keyword. Whether this happens at the lexing or parsing stage doesn't matter much, it's still expensive when strcmp is called a bunch of times for every keyword token.


Lexers don't do strcmps, they use a finite state machine to determine tokens. For instance if we have the tokens ALTER, AND, ANY, SELECT, FROM it will compile into a state machine like this:

  switch(str[i++]){
    case 'A': 
      switch(str[i++]){
        case 'L': ...
        case 'N': 
          switch(str[i++]){
            case 'D': 
              found token!
            case 'Y:
              found token!
            default:
              syntax error!
          }
       ...
      }
    case 'S': ...
    case 'F': ...
    ...
  }
Lexers already do this, and in each place where a token is found, it can know which token it was based on the DFA state that it's in.


If you have a lot of states (postgresql has > 400 keywords to identify), the assembly encoding the switch will be several kb big, so you'll have cache misses.

Perfect hashing trades more computing for less cache misses.

[edit] Also, you shouldn't forget that sparse switch...case aren't O(1).


That's the advantage of the state machine, but in such a table the lengths are already known at compile-time, and therefore the final check will be optimized to use a word wise aligned memcmp, which beats the branchy lexer code by miles.


I don't understand what you mean, can you elaborate? Are you talking about perfect hashing? Or are you talking about doing multiple strcmps? Or about doing the state machine until you reach a unique alternative, and match the remainder using strcmp?

What I was trying to say is that the lexer is already doing a state machine, and the keyword can be determined from the state of the state machine, rather than re-determining it from the token substring itself.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: