Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Still doesn't solve the fact that filesystems across different OS's allow invalid UTF8 sequences in the filenames.

Maybe 99% of apps do not care, but even a simple "cp" tool should care. Filenames (and maybe other named resoureces) should be treated completely differently, and not blindly assumed that they are utf8 compatible.



Are you saying that operating systems (i.e. the kernel) should check and enforce encodings in filenames?

1) Why?

2) Bye bye backward compatibility and interoperability


> 2) Bye bye backward compatibility and interoperability

It's already not really a thing.

Traditional unices allow arbitrary bytes with the exception of 00 and 2f, NTFS allows arbitrary utf-16 code units (including unpaired surrogates) with the exception of 0000 and 002f, and I think HFS+ requires valid UTF-16 and allows everything (including NUL).

The OS then adds its own limitations e.g. win32 forbids \, :, *, ", ?, <, >, | (as well as a few special names I think) and OSX forbids 0000 and 003a (":"), the latter of which gets converted to and from "/" (and similarly forbidden) by the POSIX compatibility layer.

The latter is really weird to see in action, if you have access to an OSX machine: open a terminal, try to create a file called "/" and it'll fail. Now create one called ":". Switch over to the Finder, and you'll see that that file is now called "/" (and creating a file called ":" fails).

Oh yeah and ZFS doesn't really care but can require that all paths be valid UTF8 (by setting the utf8only flag).


> Traditional unices allow arbitrary bytes with the exception of 00 and 2f, NTFS allows arbitrary utf-16 code units (including unpaired surrogates) with the exception of 0000 and 002f.

For just Windows -> Linux you can represent everything by mapping WTF-16 to WTF-8.


It sounds like they're saying the opposite. All programs dealing with filenames need to be able to support an arbitrary stream of bytes, they can't just assume UTF-8.


1) Nope. 2) Yes, we need to keep backward compatibility.

What I'm saying is that promoting UTF8 everywhere, without specifically stressing the fact that filesystems (in general) do no observe UFT8, leads to API/LIB designs that lack good support there.

Path/filename/dirname/whatever should be a different kind of "string".


Backward compatibility is a laudable goal and is not to be broken lightly. But sometimes, things are so fundamentally broken that we would be far better off with a clean break.

Interoperability is quite possibly a good argument for coming up with some reasonable restrictions on filenames. Today you could easily (case sensitive names, special characters, etc.) create a ZIP file or similar that cannot be successfully extracted on this platform or that.

In an excellent article, David A. Wheeler [1] lays out a compelling case against the status quo. TL;DR: bad filenames are too hard to handle correctly. Programs, standards, and operating systems already assume there are no bad filenames. Your programs will fail in numerous ways when they encounter bad filenames. Some of these failures are security problems.

He concludes: "In sum: It’d be far better if filenames were more limited so that they would be safer and easier to use. This would eliminate a whole class of errors and vulnerabilities in programs that “look correct” but subtly fail when unusual filenames are created (possibly by attackers)." He goes on to consider many ideas towards getting to this goal.

[1] https://dwheeler.com/essays/fixing-unix-linux-filenames.html


To me, that's a design flaw. Would we really be any worse off if we simply declared filenames must be UTF-8?

That seems to be the only case where a user-visible and user-editable field is allowed to be an arbitrary byte sequence, and its primary purpose seems to be allowing this argument to pop up on HN every month.

I've never seen any non-malicious use of it. All popular filesystems already disallow specific sets of ASCII characters in names. Any database which needs to save data in files by number has no problem using safe hex filenames.


Sure we could declare that but then what? Non-unicode filenames won't suddenly disappear. Operating systems won't suddenly enforce unicode. Filesystems will still allow non-unicode names.

Simply declaring it doesn't help anybody. In the meantime your application still needs to handle non-unicode filenames otherwise those malicious ones are free to be malicious.


I'd assume that the proper place for defining what's a valid filename would be on the filesystem level, so a filesystem of standard ABC v123 would not allow non-unicode names; so non-unicode filenames would either get refused or modified upon copying/writing them to the filesystem.

This is not new, this would match the current behavior of the OS/filesystem enforcing other character restrictions such as when writing (for example) a file name with an asterisk or colon to a FAT32 USB flash drive.


If unicode had a set of "explictly this byte" codepoints, it should be simple to deal with, just pass the invalid bytes of the filename in that way.


Unicode deals with text, so such a set of codepoints is a non-starter, anyway.


Once you lose the expectation of being able to work with non-unicode filenames, those files will quickly get renamed and cease to be a problem.


How can you rename them if you can only use unicode paths?


You would need to use some special utility created just for that purpose.


As long as the tool for renaming files handles non-utf8 filenames you'd be fine.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: