Adventures in the land of substrings and RegExps

evmar · on Dec 3, 2016

I worked on a project where lexing speed was of concern. Because it was C++ I didn't have the easy option of reaching for regexes so I instead wrote a hand-rolled lexer as suggested in the post.

But later I found a tool, re2c, where you write regular expressions in your source code and it expands them out to a fast parser using every last trick (gotos and lookup tables, see an example of the generated code here[1]). The result had the delightful and rare property where the final code was both higher-level (regexes are a succinct way of expressing lexing) and faster than the lower-level hand-rolled code.

https://github.com/ninja-build/ninja/blob/3082aa69b7be2a2a06...

mraleph · on Dec 4, 2016

I guess I should have been more clear in my advice about not using regular expressions: if you have a tool that allows glueing multiple regular expressions together and erasing the cost of the abstraction (e.g. no allocation of temporary match result objects, substrings, etc) - then certainly that would be a preferred way to code the lexer. Just don't sprinkle a bunch of unconnected `new RegExp()` / `re.exec` around the code and expect that to be reach peak lexing speed.

aisofteng · on Dec 3, 2016

I don't understand how anyone could implement an O(n^2) substring implementation and not think, "this seems really wrong".

taeric · on Dec 3, 2016

I'm not sure I understand your post. There are tradeoffs to different ways of implementing substring. They were fairly well covered in this page.

Are there points you disagree with?

magicalist · on Dec 3, 2016

The substring implementation is O(n), it's the loop popping off a letter at a time that ends up O(n^2).

mrkgnao · on Dec 3, 2016

Now I'm really interested in the SpiderMonkey/V8 thing.

Also,

> Solution’s simplicity and elegance might provide a welcomed retreat from fighting Webpack configs.

I chuckled.

jandem · on Dec 4, 2016

> Now I'm really interested in the SpiderMonkey/V8 thing.

I think SpiderMonkey is 60x faster than V8 there because of an optimization we have to avoid quadratic behavior while flattening ropes, it's described here: http://searchfox.org/mozilla-central/rev/0f92c398ce929a3f3d9...

Basically JS engines do string concatenation lazily: they create a rope, and then the rope is flattened when needed [0]. When appending/concatenating one character at a time and flattening after each concatenation, we avoid quadratic behavior (copying the same chars over and over again while flattening) by overallocating so we can simply append while flattening. The source code comment has the full details.

[0] https://en.wikipedia.org/wiki/Rope_(data_structure)

imagist · on Dec 3, 2016

Yeah, I did too.

I just don't understand the obsession with configuration oriented programming these days. Sure, it looks nice for common paths, but as soon as you're doing anything they don't have an option for, you have to write code just like before. But now the choice toy have to write is some awful hack bolted on to the side of your configuration. It's a mess.

draw_down · on Dec 3, 2016

I laughed at this part:

> However when I did necessary changes to the less_dart code I discovered that it actually became several times slower. Hmm.

It's almost like those kuh-razy V8 people knew what they were doing and wrote all that code for a reason.

__derek__ · on Dec 3, 2016

The author is a compiler engineer on the V8 team.

mraleph · on Dec 4, 2016

> It's almost like those kuh-razy V8 people

I don't get what's the connection between V8, less_dart and naive implementation that `RegExp.matchAsPrefix` had in Dart VM.

OneOneOneOne · on Dec 3, 2016

I can recommend Mastering Regular Expressions by Jeffrey Friedl. He goes into great detail on regexp performance and the reason various expressions are slow.

I picked up an older revision from abebooks or ebay. The Python 2.7 online docs recommend revision 1 and states later revs don't cover Python.