Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The history of Git and Subversion handling filenames makes me think that the opposite is true: A VCS which doesn't handle arbitrary byte-strings will have weird edge cases which prevent users from adding files or accessing them, possibly even “losing” data in a local checkout. This is especially tedious because it'll appear to work for a while until someone first tries to commit an unusual file or checks it out with a previously-unused client.


My understanding is, you can't treat the filename as an arbitrary bytestring, since you have to transcode it across platforms, otherwise the filename won't show up properly everywhere. E.G. if I make a file named "test" on unix, it will be UTF-8 (assuming sane unix). If on windows I create a file with the filename "test", encoded as UTF-8, it will show up as worthless garbage in explorer.exe since it will decode it to UTF-16.

So VCS needs to know the filename encoding in order to work properly.


The actual text isn't an arbitrary byte string. There is logical data and then there is its representation. char, short, int, string can all logically refer to the number 0 but the representation is completely different. With char it is even possible to represent the same number in two ways. As a binary 0 or as the character code for 0. Allowing byte strings as the physical representation is not a bad idea to stay future proof but you will have to provide additional information by storing the character encoding that was used to create the arbitrary byte string. If you fail to do that then this information will have to provided through convention and that's how we get "stuck" with UTF-8 and I although I like UTF-8 this doesn't feel like the right solution. If everyone agrees to use UTF-8 then we should stop pretending that something is just an arbitrary byte string and formalize UTF-8.

The idea of an arbitrary byte string is fooling people into believing something that is not true. Developers falsely think their software can handle any character encoding. However, once you decide to support only a single character encoding you will notice that if something better comes along you need a way to differentiate the old and new codec. Then you decide to add a field that declares the character encoding type and suddenly it's obvious that your arbitrary byte string is a bad way of dealing with the problem. That byte string has meaning. Don't throw that meaning away.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: