OK, the basics. But do not stop reading here if you want to write a parser. Ther...

kazinator · on Aug 30, 2023

Flex is fine for UTF-8 because:

- You don't have UTF-8 everywhere in a language. You might have it only in comments and string literals, in which case you can just be 8-bit clean and don't bother.

- If you have UTF-8 in identifiers, you can recognize it via a few easy patterns.

- The tokenizer doesn't have to do full validation on the UTF-8; it can defer that to some routines. I.e. it has to recognize a potentially valid UTF-8 sequence, capture it in yytext[], and then a proper UTF-8 recognizer can be invoked on yytext which will do things like reject overlong forms. You can further sanitize it to reject characters that you don't want in an identifier, like various kinds of Unicode spaces or whatever your rules are.

Multiple encodings though? Hmph. You can write a custom YYINPUT macro in Lex/Flex where you could convert any input encoding, normalizing it into UTF-8. Or read through a stream filtering abstraction which reads various encodings on its input end, pumping out UTF-8.

In TXR Lisp, when you do (read "(1 2 3)") what happens is that the string is made of wide characters: code points. But the parser takes UTF-8. So, a UTF-8 stream is created which scans the wide string as UTF-8:

  2> (make-string-byte-input-stream "(1 2 3)")
  #<byte-input-stream b7b713b0>
  3> (get-byte *2)
  40
  4> (get-byte *2)
  49
  5> (get-byte *2)
  32

The effect is that Flex and Bison are parsing a UTF-32 string. Based on this principle, we could do other encodings.