Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Stop using Windows Notepad (msdn.com)
51 points by nreece on April 1, 2010 | hide | past | favorite | 34 comments


So there is this character, the BOM, which is explicitly allowed by UTF-8 standard. And while it is not really necessary, nor recommended, you are still allowed to put it in your UTF-8 text files to signal that this text file is encoded in UTF-8. Then there are all of these *nix programs, which don't know how to deal with these BOM's, simply because the first character in the file they're reading from isn't the shebang or <?php or something like that, but is instead this completely allowable BOM. and everybody believes MS should be fixing their product? am i missing something?


I think the author simply believes people should not be using Windows notepad to edit scripts intended to be run on *nix. I can't think of many situations in which it would make sense to do that anyway.


http://plan9.bell-labs.com/sys/doc/utf.html

Doesn't say anthing about BOM.


"In UTF-8, the BOM corresponds to the byte sequence <EF16 BB16 BF16>. Although there are never any questions of byte order with UTF-8 text, this sequence can serve as signature for UTF-8 encoded text where the character set is unmarked."

http://www.unicode.org/versions/Unicode5.2.0/ch16.pdf


You can always trust a standards body to screw something up.


This is like one of those moments where you are quickly brought back to a painful memory from a more troubled time in your life, so distant in mindset, if not time itself, that it has faded to be less a memory of life itself than that of a bad dream.

You shake it off, and it recedes back into the cobwebs, not to be jarred again until the next time something flies by in your RSS feed about a new Microsoft product or feature that is a poorly conceived sugar-coated re-implementation of cron, grep, sed, bash, vim, awk, ssh, unix pipes, or lisp.

Yes, I too was once a full-time Windows user.


Ok, but everything handles the UTF-8 "BOM" just fine. Adding it only causes problems when you ./file.sh that has a BOM, and haven't setup your binfmt_misc magic properly. Fix that, and it works fine.


  Ok, but everything handles the UTF-8 "BOM" just fine.
For small values of "everything". I can't even count the number of bizarre errors I encounter, only to discover that one of the files I'm working on has been corrupted by somebody carelessly using Notepad. Usually it's not obvious what's going on -- the error will be something like "Error parsing file: illegal byte before start of message body", which provides not much help when it's 23:30 and I'm staring at a text editor wondering what the hell byte it's talking about.


And therein lies the insanity that is a BOM: it's a character that's meant to be invisible, even to your UTF-8 capable text editor. With this in mind, it's not clear what is supposed to be able to view or edit this character, short of a hex editor. Somebody else mentioned even cat ignores it; how are you supposed to easily tell that it's there or not?

IMHO a BOM is completely ridiculous to have on a UTF-8 file, because UTF-8 has no ambiguity over endian-ness that a BOM needs to resolve, and it breaks the principle that UTF-8 can serve as ASCII when all the codepoints <0x7F.

Is there any point to using it in UTF-8 besides pain, suffering, and "well UTF-16 and UTF-32 have it"?

The answer, straight from the horse's mouth.

http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf , pg. 36

"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature."

That's right, the UTF-8 BOM is neither required nor recommended. I rest my case.


"cat -v" saved lot of my time in cases like that.

I had similar problem with nbsp, until I moved nbsp from second level (shift+space) to third level in XKB configuration in my Gnome/Fedora/Linux.


It would be nice if everything did, but some major pieces of software don't. For example, the Java I/O library:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058

This is one of those problems that also won't be fixed to preserve (bug) compatibility.


OK, but most of the UNIX toolchain handles it just fine. Even cat knows not to output the BOM to the terminal :)

Emacs and Perl are also fine. What more do you need?


Does cat really? The BOM character just happens to also be the "zero-width non-breaking space", so it should be invisible when printed.


Despite running under a form of Unix, my iPod is crippled in that I cannot run a "cat nasty_notepad_corrupted_file.txt | od" to check.

Should be simple.


That isn't the case - I have an example right here of a large expensive application that uses an old Java XML parser (some ancient version of Xerces I think) that gets very upset if you feed it an XML document that has a BOM.


It's always easy to upset an XML parser...


Hmm. I am as bewildered by this as the author. Why would anyone edit shell scripts with notepad?


A terrifying fear of vim modes, maybe. I guess people stick with what they're used to. I don't know, I'm trying here, I really am.


joe and nano may be your friends.

If one is afraid of Vim, I would not recommend Emacs.


Because they just created a plain vanilla Windows install in a machine not connected to the net? And various pathological EOM apps still use something like a shell script?

(the reverse of this is kind-of the only situation in which I use vi, which none-the-less I can use...)

Shit happens...

I mean, the degree to which users wind-up wedded to the default applications in a given OS should not be underestimated or chalked-up to "morons"...


The only excusable reason I can think of is a "one-off" edit on a foreign computer that doesn't already have your editor of choice installed.


The real question is why would anyone edit shell scripts under Windows?


I do it sometimes, when my job doesn't allow other OS's. But I install GVim. :p


I agree with the author. Right or wrong, Notepad:

1: is on every Windows machine ever

2: just works

I leave it to you, critics, to select something, anything, that is a fundamental tool available on every *nix machine, and just stop using it.

Hint: nobody but a purist is going to stop using `cat` or `cd` because of a minor incompatibility with a small percentage of computers.

I love Linux, I love Unix, but I also love that Windows had and still has Notepad. One of Microsoft's most successful apps in my eyes, that honestly works. Hell, in any case if it didn't have at least one flaw I'd start getting suspicious wondering if M$ really did write it after all.


Does Notepad support large files nowadays, and line endings other than CR-LF? Because for a long time Notepad definitely did not Just Work.


Ok, it wasn't perfect, but the only time CR-LF was an issue was moving to Unix machines, so I made a habit of doing dos2unix/unix2dos or perl one-liners. I suppose I should redact my use of just works. Notice you capitalize it and I don't- I think that explains everything.

Just Works: No user intervention required whatsoever, could be used by a 2nd grader with no prior knowledge

just works: does it's job with perhaps a bug or two that can be worked around


I'd agree with you, if Notepad did in fact work. MS changed Paint for Windows 7, they could have more easily fixed Notepad. (Have you ever tried to open a file without \r's?)


They upgraded Wordpad. I get the feeling Notepad was left alone because they do not feel it needed upgrading or debugging. Wordpad's goal always was to be a document editor, while Notepad is a text editor, and it certainly edits text.

Seems like this could well just be a classic case of one person says 'This is a game-ending bug!' and the other person saying, 'Wait, what? That 'bug' is in the spec!'


  A long time ago, someone decided that if your file was 100% ASCII and you chose to save it as UTF-8 and you opened the file up again and added some >0x007f character and later saved again that you should not be prompted
The author makes this sound like an unreasonable request, but I don't see the problem. All ASCII text is also UTF-8; if a file is valid ASCII, it should be opened and saved in UTF-8 mode.

  P.S. Isn't there some tool on UNIX that does this correctly?
Yes, nearly all of them.


I haven't been to his blog in a while, but Raymond Chen has made various elucidating comments pertaining to Notepad, over the years.

http://www.google.com/#hl=en&source=hp&q=site%3Ablog...


I must confess I never faced this problem. What annoys me no end is windows end-line CRLF pairs.

That insanity must end.

Oh.. And a simple solution to this problem is simply to ban Windows. It also solves a whole lot of other problems and creates an incentive for keeping sane corporate networks (with interoperable applications and so on).


I wonder if a crippled, totally brain-dead, but formally correct implementation is not Microsoft's way to discourage usage of a standard while adhering to it in order to get government business...

POSIX subsystem, anyone?


That dialog box is a linguistic nightmare. What's with the OK and Cancel buttons? I thought Windows was changing to "Save" and "Don't Save" like they should have done all along.


I think that dialog box is from his personal, private build that he mentions in the footnotes. Therefore, he probably doesn't really care how user friendly it is. I'm actually surprised its as well written as it is.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: