> To be fair, the regex crate can’t match canonically equivalent Unicode strings...

hayley-patton · on Aug 13, 2022

> And yes, derivatives build full DFAs. Those are a no-go in a general purpose regex engine.

Lazily compute derivatives? From memory Rust regex computes and caches a DFA lazily from the NFA, for comparison.

burntsushi · on Aug 13, 2022

Feel free to show me an implementation that does it with an alphabet of bytes, not codepoints, while maintaining support for Unicode classes.

It doesn't strike me as something amenable to lazy compilation. But maybe I'm wrong.

mananaysiempre · on Aug 12, 2022

No, I don’t think anybody can do it.

Unless I can’t read, ICU4C does not even do streaming normalization, and the buffer-at-a-time one is around 100—200 MB/s or so[1], which looks ... mildly amusing next to your engine :) My own attempt at streaming normalization is currently about two times slower, because it has deeper (and smaller) tables and is just generally dumb, but then ICU4C is not particularly smart either. I expect that normalization at 2x ICU speed or more is possible, but that still won’t make normalize-then-match competitive with no normalization, while encoding all canonical equivalents into the DFA sounds downright hellish (also strictly speaking impossible, given normalization requires unbounded buffer space, but that part is probably solvable).

UTS #18 level 2 (which I’m aware you explicitly did not attempt) kind of says some things about canonical equivalence and then goes “nah, normalization is difficult, let’s go match grapheme clusters, also maybe just prenormalize to NFD”[2]. Which just looks weird, but apparently it means they did try to impose a requirement then gave up more than a decade ago[3]. Does anything even fully implement level 2, I wonder?

So, meh, but not really directed at you—regex is supernaturally fast as far as I’m concerned. Perhaps I’m annoyed at Unicode normalization for being (not really complicated but) so punishingly difficult to do quickly it isn’t really clear how you could wire it into a regex matcher.

There is also the question of how normalization-invariant regexes should even work: should “<C-caron>.” match “<C-cedilla><caron>”? UTS #18 (with invariance requirement restored) seems to imply yes, but it’s hard to imagine when anyone would actually want that. (Note that both pattern and haystack are normalized—the corresponding regex for NFD input would be something like “(C<caron>[[:ccc=0:][:ccc>=230:]]|C[[:ccc>0:]&[:ccc<230:]]<caron>)”, but nothing implements that syntax.) So while UTS #18 gives a possible interpretation of how Unicode regexes should work, I’m not entirely convinced it’s the right one, or even right enough to go on the crusade of implementing a normalization-invariant regex engine.

[1] Or so I thought, but my tests used UTF-32 while apparently UTF-16 is much faster: https://tzlaine.github.io/text/doc/html/boost_text__proposed.... Ugh.

[2] https://www.unicode.org/reports/tr18/#Canonical_Equivalents

[3] https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_...

burntsushi · on Aug 12, 2022

> Does anything even fully implement level 2, I wonder?

Not sure about backtrackers, but I know ICgrep attempted quite a bit of level 2: http://international-characters.com/icgrep

But I'm not sure how far they got. And most of the links on that page are dead. :-(

And yeah, UTS#18 actually used to have a level 3 (custom tailoring), but they removed it.

I'm content with level 1 support. The regex crate is just about there and that actually makes it have better Unicode support than the vast majority of regex engines. :-)

Level 2 is indeed hard.