> To be fair, the regex crate can’t match canonically equivalent Unicode strings, which I also think is pretty meh
Yeah not "meh" at all. :-) I don't think any regex engine can do it? Does icu's regex engine even do it? Certainly no fsm regex engine does.
UTS#18 doesn't even ask regex engines to do it. If you need it, it just suggests canonicalizing both the regex and the haystack first.
And yes, derivatives build full DFAs. Those are a no-go in a general purpose regex engine.
There's also the issue of complement and intersection being somewhat difficult to reason about.
I'll also note that I have never given any serious effort to actually trying to implement complement or intersection. I just don't see any viable implememtation path to it in the first place. I don't know where I would start. Everything I cam think of involves extreme compile-time costs that would just be totally unacceptable in a general purpose regex engine.
But, I am not really an "ideas" person. I see myself more as an engineer. So my inability should not really be taken for gospel. Be skeptical of expertise. That's the foundation of science. :)
Unless I can’t read, ICU4C does not even do streaming normalization, and the buffer-at-a-time one is around 100—200 MB/s or so[1], which looks ... mildly amusing next to your engine :) My own attempt at streaming normalization is currently about two times slower, because it has deeper (and smaller) tables and is just generally dumb, but then ICU4C is not particularly smart either. I expect that normalization at 2x ICU speed or more is possible, but that still won’t make normalize-then-match competitive with no normalization, while encoding all canonical equivalents into the DFA sounds downright hellish (also strictly speaking impossible, given normalization requires unbounded buffer space, but that part is probably solvable).
UTS #18 level 2 (which I’m aware you explicitly did not attempt) kind of says some things about canonical equivalence and then goes “nah, normalization is difficult, let’s go match grapheme clusters, also maybe just prenormalize to NFD”[2]. Which just looks weird, but apparently it means they did try to impose a requirement then gave up more than a decade ago[3]. Does anything even fully implement level 2, I wonder?
So, meh, but not really directed at you—regex is supernaturally fast as far as I’m concerned. Perhaps I’m annoyed at Unicode normalization for being (not really complicated but) so punishingly difficult to do quickly it isn’t really clear how you could wire it into a regex matcher.
There is also the question of how normalization-invariant regexes should even work: should “<C-caron>.” match “<C-cedilla><caron>”? UTS #18 (with invariance requirement restored) seems to imply yes, but it’s hard to imagine when anyone would actually want that. (Note that both pattern and haystack are normalized—the corresponding regex for NFD input would be something like “(C<caron>[[:ccc=0:][:ccc>=230:]]|C[[:ccc>0:]&[:ccc<230:]]<caron>)”, but nothing implements that syntax.) So while UTS #18 gives a possible interpretation of how Unicode regexes should work, I’m not entirely convinced it’s the right one, or even right enough to go on the crusade of implementing a normalization-invariant regex engine.
But I'm not sure how far they got. And most of the links on that page are dead. :-(
And yeah, UTS#18 actually used to have a level 3 (custom tailoring), but they removed it.
I'm content with level 1 support. The regex crate is just about there and that actually makes it have better Unicode support than the vast majority of regex engines. :-)
Yeah not "meh" at all. :-) I don't think any regex engine can do it? Does icu's regex engine even do it? Certainly no fsm regex engine does.
UTS#18 doesn't even ask regex engines to do it. If you need it, it just suggests canonicalizing both the regex and the haystack first.
And yes, derivatives build full DFAs. Those are a no-go in a general purpose regex engine.
There's also the issue of complement and intersection being somewhat difficult to reason about.
I'll also note that I have never given any serious effort to actually trying to implement complement or intersection. I just don't see any viable implememtation path to it in the first place. I don't know where I would start. Everything I cam think of involves extreme compile-time costs that would just be totally unacceptable in a general purpose regex engine.
But, I am not really an "ideas" person. I see myself more as an engineer. So my inability should not really be taken for gospel. Be skeptical of expertise. That's the foundation of science. :)