> It couldn’t be further from the truth. Unix paths don’t need to be valid UTF-8...

eska · on April 14, 2020

In the Rust std one can easily use the lossless presentation with file APIs, and print a lossy version in error messages. I find this to be good enough.

Spivak · on April 14, 2020

I mean it doesn't have to assume an encoding for printing, it just has to have a sane way of turning the path into something human readable.

Look you're right that this ship has sailed but ideally we would have decided on a way to display and encode binary for file paths.

oconnor663 · on April 14, 2020

I dunno. That sounds like proposing to render "foo.txt" as "Zm9vLnR4dA==" or "[102, 111, 111, 46, 116, 120, 116]" or something. I think you probably meant something like "print the regular characters if the string is UTF-8, or a lossless fallback representation of the bytes otherwise." That's a good idea, and I think a lot of programs do that, but at the same time "if the string is UTF-8" is problematic. There's no reliable way for us to know what strings are or are not intended to be decoded as UTF-8, because non-UTF-8 encodings can coincidentally produce valid UTF-8 bytes. For example, the two characters "&!" are the same bytes in UTF-8 as the character "Ω" is in UTF-16. This works in Python:

    assert "&!".encode("UTF-8").decode("UTF-16le") == "Ω"

So I think I want to claim something a bit stronger:

1) Users demand, quite rightly, to be able to read paths as text. 2) There is no reliable way to determine the encoding of a string, just by looking at its bytes. And Unix doesn't provide any other metadata. 3) Therefore, useful Unix programs must assume that any path that could be UTF-8, is UTF-8, for the purpose of displaying it to the user.

Maybe in an alternate reality, the system locale could've been the reliable source of truth for string encodings? But of course if we were starting from scratch today, we'd just mandate UTF-8 and be done with it :)

zajio1am · on April 15, 2020

> 2) There is no reliable way to determine the encoding of a string, just by looking at its bytes. And Unix doesn't provide any other metadata. 3) Therefore, useful Unix programs must assume that any path that could be UTF-8, is UTF-8, for the purpose of displaying it to the user.

No, there is locale settings (in envvars) and software should assume path encoding based on locale encoding.

It is true that today locale setting is usualy utf-8 based, but if i use non-utf-8 based locale then tools should not assume paths are in utf-8 and recode in.

rurban · on April 16, 2020

No, the proposal is not for crazy encoding schemes, like for domain names, that's up to the presentation layer. The need is to follow the unicode security guidelines for identifiers. A path is an identifier, not binary chunk. Thus it needs to follow some rules. Lately some filesystem drivers agreed, but it's still totally insecure all over.