Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> It couldn’t be further from the truth. Unix paths don’t need to be valid UTF-8

Yes but, most programs expect to be able to print filepaths at least under some circumstances, like printing error messages. Even if a program is fully correct and doesn't assume an encoding in normal operation, it still has to assume one for printing. Filepaths that aren't utf-8 lead to a bunch of ����� in your output (at best). So I think it's fair to say that Unix paths are assumed to be utf-8 by almost all programs, even if being invalid utf-8 doesn't actually cause a correct program to crash.



In the Rust std one can easily use the lossless presentation with file APIs, and print a lossy version in error messages. I find this to be good enough.


I mean it doesn't have to assume an encoding for printing, it just has to have a sane way of turning the path into something human readable.

Look you're right that this ship has sailed but ideally we would have decided on a way to display and encode binary for file paths.


I dunno. That sounds like proposing to render "foo.txt" as "Zm9vLnR4dA==" or "[102, 111, 111, 46, 116, 120, 116]" or something. I think you probably meant something like "print the regular characters if the string is UTF-8, or a lossless fallback representation of the bytes otherwise." That's a good idea, and I think a lot of programs do that, but at the same time "if the string is UTF-8" is problematic. There's no reliable way for us to know what strings are or are not intended to be decoded as UTF-8, because non-UTF-8 encodings can coincidentally produce valid UTF-8 bytes. For example, the two characters "&!" are the same bytes in UTF-8 as the character "Ω" is in UTF-16. This works in Python:

    assert "&!".encode("UTF-8").decode("UTF-16le") == "Ω"
So I think I want to claim something a bit stronger:

1) Users demand, quite rightly, to be able to read paths as text. 2) There is no reliable way to determine the encoding of a string, just by looking at its bytes. And Unix doesn't provide any other metadata. 3) Therefore, useful Unix programs must assume that any path that could be UTF-8, is UTF-8, for the purpose of displaying it to the user.

Maybe in an alternate reality, the system locale could've been the reliable source of truth for string encodings? But of course if we were starting from scratch today, we'd just mandate UTF-8 and be done with it :)


> 2) There is no reliable way to determine the encoding of a string, just by looking at its bytes. And Unix doesn't provide any other metadata. 3) Therefore, useful Unix programs must assume that any path that could be UTF-8, is UTF-8, for the purpose of displaying it to the user.

No, there is locale settings (in envvars) and software should assume path encoding based on locale encoding.

It is true that today locale setting is usualy utf-8 based, but if i use non-utf-8 based locale then tools should not assume paths are in utf-8 and recode in.


No, the proposal is not for crazy encoding schemes, like for domain names, that's up to the presentation layer. The need is to follow the unicode security guidelines for identifiers. A path is an identifier, not binary chunk. Thus it needs to follow some rules. Lately some filesystem drivers agreed, but it's still totally insecure all over.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: