I worked on a project where lexing speed was of concern. Because it was C++ I didn't have the easy option of reaching for regexes so I instead wrote a hand-rolled lexer as suggested in the post.
But later I found a tool, re2c, where you write regular expressions in your source code and it expands them out to a fast parser using every last trick (gotos and lookup tables, see an example of the generated code here[1]). The result had the delightful and rare property where the final code was both higher-level (regexes are a succinct way of expressing lexing) and faster than the lower-level hand-rolled code.
I guess I should have been more clear in my advice about not using regular expressions: if you have a tool that allows glueing multiple regular expressions together and erasing the cost of the abstraction (e.g. no allocation of temporary match result objects, substrings, etc) - then certainly that would be a preferred way to code the lexer. Just don't sprinkle a bunch of unconnected `new RegExp()` / `re.exec` around the code and expect that to be reach peak lexing speed.
Basically JS engines do string concatenation lazily: they create a rope, and then the rope is flattened when needed [0]. When appending/concatenating one character at a time and flattening after each concatenation, we avoid quadratic behavior (copying the same chars over and over again while flattening) by overallocating so we can simply append while flattening. The source code comment has the full details.
I just don't understand the obsession with configuration oriented programming these days. Sure, it looks nice for common paths, but as soon as you're doing anything they don't have an option for, you have to write code just like before. But now the choice toy have to write is some awful hack bolted on to the side of your configuration. It's a mess.
I can recommend Mastering Regular Expressions by Jeffrey Friedl. He goes into great detail on regexp performance and the reason various expressions are slow.
I picked up an older revision from abebooks or ebay. The Python 2.7 online docs recommend revision 1 and states later revs don't cover Python.
But later I found a tool, re2c, where you write regular expressions in your source code and it expands them out to a fast parser using every last trick (gotos and lookup tables, see an example of the generated code here[1]). The result had the delightful and rare property where the final code was both higher-level (regexes are a succinct way of expressing lexing) and faster than the lower-level hand-rolled code.
https://github.com/ninja-build/ninja/blob/3082aa69b7be2a2a06...