OK, the basics. But do not stop reading here if you want to write a parser. There are more modern tools to look at (e.g., antlr).
Warning 1: parsing Unicode streams well is awkward with flex -- it's from an age where ASCII ruled. But handling multiple input incodings may get weird. If it is only UTF-8, maybe it works, because that's essentially bytes. But I find a hand-written scanner more convenient (the grammar is seldom too complex for that). But regexps based on General_Category or ID_Start etc.? Difficult...
Warning 2: for various reasons, usually flexibility, conflict resolving, error reporting, and/or error recovery, many projects move from bison to something else, even a handwritten recursive descent parser. It's longer, but not that difficult.
- You don't have UTF-8 everywhere in a language. You might have it only in comments and string literals, in which case you can just be 8-bit clean and don't bother.
- If you have UTF-8 in identifiers, you can recognize it via a few easy patterns.
- The tokenizer doesn't have to do full validation on the UTF-8; it can defer that to some routines. I.e. it has to recognize a potentially valid UTF-8 sequence, capture it in yytext[], and then a proper UTF-8 recognizer can be invoked on yytext which will do things like reject overlong forms. You can further sanitize it to reject characters that you don't want in an identifier, like various kinds of Unicode spaces or whatever your rules are.
Multiple encodings though? Hmph. You can write a custom YYINPUT macro in Lex/Flex where you could convert any input encoding, normalizing it into UTF-8. Or read through a stream filtering abstraction which reads various encodings on its input end, pumping out UTF-8.
In TXR Lisp, when you do (read "(1 2 3)") what happens is that the string is made of wide characters: code points. But the parser takes UTF-8. So, a UTF-8 stream is created which scans the wide string as UTF-8:
Warning 1: parsing Unicode streams well is awkward with flex -- it's from an age where ASCII ruled. But handling multiple input incodings may get weird. If it is only UTF-8, maybe it works, because that's essentially bytes. But I find a hand-written scanner more convenient (the grammar is seldom too complex for that). But regexps based on General_Category or ID_Start etc.? Difficult...
Warning 2: for various reasons, usually flexibility, conflict resolving, error reporting, and/or error recovery, many projects move from bison to something else, even a handwritten recursive descent parser. It's longer, but not that difficult.