Anyone know how it handles ligatures? Depending on font and tooling the word "fi...

kranner · 2024-09-18T07:43:35 1726645415

Seems to work well when it's searching the PDF text layer as ligatures are a font rendering effect. You're right — ligatures are not as common in modern books.

Might be iffier in OCR mode: it seems to use Tesseract, which is known to have issues recognising ligatured text.

shellac · 2024-09-18T11:48:43 1726660123

The (standard) ripgrep regex engine has full unicode support. My reading of that is that it should handle such equivalences like matching the decomposed version.

burntsushi · 2024-09-18T11:59:46 1726660786

It does not. Almost no regex engine does that.

To add more color to this, the precise details of what "Unicode support" means are documented here: https://github.com/rust-lang/regex/blob/master/UNICODE.md

In effect, all of UTS#18 Level 1 is covered with a couple caveats. This is already a far cry better than most regex engines, like PCRE2, which has limited support for properties and no way to do subtraction or intersection of character classes. Other regex engines, like Javascript, are catching up. While UTS#18 Level 1 make ripgrep's Unicode support better than most, it does not make it the best. The third party Python `regex` library, for example, has very good support, although it is not especially fast[1].

Short of building UTS#18 2.1[2] support into the regex engine (unlikely to ever happen), it's likely ripgrep could offer some sort of escape hatch. Perhaps, for example, an option to normalize all text searched to whatever form you want (nfc, nfd, nfkc or nfkd). The onus would still be on you to write the corresponding regex pattern though. You can technically do this today with ripgrep's `--pre` flag, but having something built-in might be nice. Indeed, if you read UTS#18 2.1, you'll note that it is self-aware about how difficult matching canonical equivalents is, and essentially suggests this exact work-around instead. The problem is that it would need to be opt-in and the user would need to be aware of the problem in the first place. That's... a stretch, but probably better than nothing.

[1]: https://github.com/BurntSushi/rebar?tab=readme-ov-file#summa...

[2]: https://unicode.org/reports/tr18/#Canonical_Equivalents

shellac · 2024-09-18T17:27:28 1726680448

Thanks very much for clarifying that. It did seem unlikely: I remember NSString (ask you parents...) supported this level of unicode equivalence, and it was quite a burden. Normalising does feel like the only tractable method here, and if you have an extraction pipeline anyway (in rga) maybe it's not so bad.

burntsushi · 2024-09-18T21:34:35 1726695275

Yes, rga could support this in a more streamlined manner than rg, since rga has that extraction pipeline with caching. ripgrep just has a hook for an extraction pipeline.

gcr · 2024-09-18T13:06:44 1726664804

For the purpose of searching, wouldn’t it be sufficient to do NFC normalization for text? Could hide that behind a command line flag even…

burntsushi · 2024-09-18T13:26:05 1726665965

Can you say how that differs from what I suggested in my last paragraph? I legitimately can't tell if you're trying to suggest something different or not.

As UTS#18 2.1 says, it isn't sufficient to just normalize the text you're searching. It also means the user has to craft their regex appropriately. If you normalize to NFC but your regex uses NFD, oops. So it's probably best to expose a flag that lets you pick the normalization form.

And yes, it would have to be behind a CLI flag. Always doing normalization would likely make ripgrep slower than a naive grep written in Python. Yes. That bad. Yes. Really. And it might now become clear why a lot of tools don't do this.

virtualritz · 2024-09-18T10:38:27 1726655907

> I notice that the word "Office" in the title of this article is not rendered with a ligature by Chrome

Chrome mobile on Android does render Office with what looks like at least an fi ligature for me (it should use an ffi one but still).

Maybe it depends on the font?

miki123211 · 2024-09-18T12:03:12 1726660992

> I don't have any sense of how common ligature usage is anymore

It's much more common in PDFs than it is on the web, at least when the underlying plaintext is concerned.