So there is this character, the BOM, which is explicitly allowed by UTF-8 standard. And while it is not really necessary, nor recommended, you are still allowed to put it in your UTF-8 text files to signal that this text file is encoded in UTF-8.
Then there are all of these *nix programs, which don't know how to deal with these BOM's, simply because the first character in the file they're reading from isn't the shebang or <?php or something like that, but is instead this completely allowable BOM. and everybody believes MS should be fixing their product? am i missing something?
I think the author simply believes people should not be using Windows notepad to edit scripts intended to be run on *nix. I can't think of many situations in which it would make sense to do that anyway.
"In UTF-8, the BOM corresponds to the byte sequence <EF16 BB16 BF16>. Although there
are never any questions of byte order with UTF-8 text, this sequence can serve as signature
for UTF-8 encoded text where the character set is unmarked."
This is like one of those moments where you are quickly brought back to a painful memory from a more troubled time in your life, so distant in mindset, if not time itself, that it has faded to be less a memory of life itself than that of a bad dream.
You shake it off, and it recedes back into the cobwebs, not to be jarred again until the next time something flies by in your RSS feed about a new Microsoft product or feature that is a poorly conceived sugar-coated re-implementation of cron, grep, sed, bash, vim, awk, ssh, unix pipes, or lisp.
Ok, but everything handles the UTF-8 "BOM" just fine. Adding it only causes problems when you ./file.sh that has a BOM, and haven't setup your binfmt_misc magic properly. Fix that, and it works fine.
Ok, but everything handles the UTF-8 "BOM" just fine.
For small values of "everything". I can't even count the number of bizarre errors I encounter, only to discover that one of the files I'm working on has been corrupted by somebody carelessly using Notepad. Usually it's not obvious what's going on -- the error will be something like "Error parsing file: illegal byte before start of message body", which provides not much help when it's 23:30 and I'm staring at a text editor wondering what the hell byte it's talking about.
And therein lies the insanity that is a BOM: it's a character that's meant to be invisible, even to your UTF-8 capable text editor. With this in mind, it's not clear what is supposed to be able to view or edit this character, short of a hex editor. Somebody else mentioned even cat ignores it; how are you supposed to easily tell that it's there or not?
IMHO a BOM is completely ridiculous to have on a UTF-8 file, because UTF-8 has no ambiguity over endian-ness that a BOM needs to resolve, and it breaks the principle that UTF-8 can serve as ASCII when all the codepoints <0x7F.
Is there any point to using it in UTF-8 besides pain, suffering, and "well UTF-16 and UTF-32 have it"?
"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature."
That's right, the UTF-8 BOM is neither required nor recommended. I rest my case.
That isn't the case - I have an example right here of a large expensive application that uses an old Java XML parser (some ancient version of Xerces I think) that gets very upset if you feed it an XML document that has a BOM.
Because they just created a plain vanilla Windows install in a machine not connected to the net? And various pathological EOM apps still use something like a shell script?
(the reverse of this is kind-of the only situation in which I use vi, which none-the-less I can use...)
Shit happens...
I mean, the degree to which users wind-up wedded to the default applications in a given OS should not be underestimated or chalked-up to "morons"...
I leave it to you, critics, to select something, anything, that is a fundamental tool available on every *nix machine, and just stop using it.
Hint: nobody but a purist is going to stop using `cat` or `cd` because of a minor incompatibility with a small percentage of computers.
I love Linux, I love Unix, but I also love that Windows had and still has Notepad. One of Microsoft's most successful apps in my eyes, that honestly works. Hell, in any case if it didn't have at least one flaw I'd start getting suspicious wondering if M$ really did write it after all.
Ok, it wasn't perfect, but the only time CR-LF was an issue was moving to Unix machines, so I made a habit of doing dos2unix/unix2dos or perl one-liners. I suppose I should redact my use of just works. Notice you capitalize it and I don't- I think that explains everything.
Just Works: No user intervention required whatsoever, could be used by a 2nd grader with no prior knowledge
just works: does it's job with perhaps a bug or two that can be worked around
I'd agree with you, if Notepad did in fact work. MS changed Paint for Windows 7, they could have more easily fixed Notepad. (Have you ever tried to open a file without \r's?)
They upgraded Wordpad. I get the feeling Notepad was left alone because they do not feel it needed upgrading or debugging. Wordpad's goal always was to be a document editor, while Notepad is a text editor, and it certainly edits text.
Seems like this could well just be a classic case of one person says 'This is a game-ending bug!' and the other person saying, 'Wait, what? That 'bug' is in the spec!'
A long time ago, someone decided that if your file was 100% ASCII and you chose to save it as UTF-8 and you opened the file up again and added some >0x007f character and later saved again that you should not be prompted
The author makes this sound like an unreasonable request, but I don't see the problem. All ASCII text is also UTF-8; if a file is valid ASCII, it should be opened and saved in UTF-8 mode.
P.S. Isn't there some tool on UNIX that does this correctly?
I must confess I never faced this problem. What annoys me no end is windows end-line CRLF pairs.
That insanity must end.
Oh.. And a simple solution to this problem is simply to ban Windows. It also solves a whole lot of other problems and creates an incentive for keeping sane corporate networks (with interoperable applications and so on).
I wonder if a crippled, totally brain-dead, but formally correct implementation is not Microsoft's way to discourage usage of a standard while adhering to it in order to get government business...
That dialog box is a linguistic nightmare. What's with the OK and Cancel buttons? I thought Windows was changing to "Save" and "Don't Save" like they should have done all along.
I think that dialog box is from his personal, private build that he mentions in the footnotes. Therefore, he probably doesn't really care how user friendly it is. I'm actually surprised its as well written as it is.