Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
What is the unit of a text column number? (foonathan.net)
59 points by amar-laksh on Oct 20, 2021 | hide | past | favorite | 43 comments


> How many spaces is a tab? GCC seems to say “8”, for some reason

On old mechanical typewriters without adjustable tab stops (which were on the carriage, not the part you typed BTW, which skeuomorphically survived in Word’s tab stop interface), the tab key (those that had tab keys) slid the carriage left 8 spaces. You could often push it back a bit if you wanted. This carried over to TTYs which were grossly electromechanical devices and from there to glass teletypes and terminals.

So it’s the proper default.

Typewriters were pretty direct. Often they omitted the 1 and 0 digits (just use l and O). Of course there was not a VT or LF — you just stretched out your right hand and turned the platen. For FF you gripped the paper and pulled. On some models, to backspace you just pushed the carriage. And to delete (called rubout on a some old terminals) you painted or XXXed out the offending text — or used a pen. Even on legal documents (which is why they still both write out use digits to specify numbers)


I recollect a typing class (which did exist) where we were admonished never to just grip the paper and yank, but to pull the release lever and ease the sheet out. But people never did that in the movies, so it was a hard sell.

Which is to say there were movies where writers worked typewriters, generally just yanking a sheet, balling it up, and throwing it toward a wastebasket, and missing. Actual writing was too boring, even then, to film.


I wonder how many HN readers have no direct experience with a typewriter? Probably quite a few. This is a good description.

At least some (Royals I think) there was a lever on the right that you would operate for CR (carriage return) and line feed. Pull a little bit to the left, and it would feed a single line, pull the remainder, and it would return the carriage, so you could just do a couple of short pulls and advance a couple of lines more precisely than yanking the paper.


Yanking was just to get the whole sheet out so you could ball it up and hurl it. The Carriage Return lever was there from earliest times.

It was decades later that they got a RETURN key (what they call Enter nowadays) that would do the LF/CR for you. Eventually there was a tape with white ink so when you backspaced, you could white out what you had mis-typed by mistyping it again! Before that, you would overtype a "/" and bull ahead. Or stop and (literally!) paint it over with a little brush.

The inventor of the little brush and the fast-drying paint became one of the first female business tycoons. (There were tycoons then, too. And impresarios. Dunno what happened to those.)


> The inventor of the little brush and the fast-drying paint became one of the first female business tycoons.

Bette Nesmith Graham indeed became a (minor) tycoon from Liquid Paper but died in her 50s.

Her son Michael was a member of the Monkees and Bette Nesmith Graham, produced Repo-man. I've met him; he's a nice guy.


I’m 32 years old and have never used a typewriter in my life. So I’d imagine the majority of HN users haven’t either.


On the other hand, HN is full of people who a) type all day for a living and b) enjoy tech history and quirky things.

I'm 31 and have used at least 2 typewriters in my life - obviously at no point did I need to for either school or work, but there's something romantic about them and I often fantasise about getting one to write with... though I'm fairly sure it would end up gathering dust as a (possibly pretentious-seeming) ornament in the corner of the room, so I avoid the impulse to buy one.


You could just keep a nice portable one in its case and bring it out for special occasions.

Just the idea of a portable. Almost but not quite like a portable piano in my mind.


That would cover the "pretentious seeming" aspect, but my main reason is that I already have too many things like that which I don't really need, and I don't live in a huge mansion to keep accruing random stuff like that!



This is also where Windows CRLF for a new line comes from.

CR is Carriage Return, which on a typewriter you could do by moving the carriage back to the start of the line. Often there was a lever for this.

LF is Line Feed, accomplished by turning the platen. The horizontal position on the page stayed the same, just one line down vertically.

Having these separate made it possible to easily fill in tabular data and printed forms of various sorts. It also made it possible to type some (rather messy) accented characters by using CR and hitting space until you reached the character to accent, then typing an appropriate punctuation mark. A number of modern Compose Key sequences follow the same patterns, eg o" for ö.

Later typewriters added ways to do this automatically; eg the IBM Selectric had a "RETURN" key. This both returned the carriage to the start of the line and fed down one line. It was still possible to perform either action separately to fill in tables and other forms.


Windows did it because DOS did it, and DOS did it because almost everybody else did. IIRC Multics was the first to have just a newline character (^j) and Unix adopted it, along with case sensitivity etc etc. The Mac was the only system I ever saw to use ^M as newline.

Multics is also why @ is erase line and # erase character in the Unix tty driver. Oho, you’ve never seen that? It’s because disabling that became the default in the early 90s or perhaps even late 80s, but I’m pretty sure it’s still there. Probably made it into POSIX for back compatibility with some gereatric system.


^M for new line on the Mac was a holdover from the Apple ][, so you can blame Woz for that one. I don’t know about other contemporaneous 8-bit systems like Commodore or Tandy


I hate to say it, but Windows has the line breaks right. Why did Unix see fit to just type further and further to the right on a never ending paper?


If you want to get a newline without moving all the way to the left, you can use the vertical tab character.

No reason to complicate line endings for a rare circumstance like this.


That's not the Unix way. The Unix way is to compose crappy, half-assed solutions like CRLF from parts, rather than create dedicated solutions like VT.


Do one thing and do it well! There's no reason a line feed should also need to return the carriage. What is this, systemd-lf? What if I want to swap the carriage return implementation for the BSD one?


> Later typewriters added ways to do this automatically

All in at least the last 70 years did (I had a Remington that would be at least 70 years old now). There is a lever on manual typewriters on the right-hand side that you yank leftwards that does a CR and then a LF (or vice-versa, can't remember).


Edit: Thinking about it, the Remington would be about 90 years old now - it was ancient when I got it, and I'm quite ancient now. And mechanically, it must have been LF first.


Yep, and often that lever had two positions, so you yank it partway for a LF, and the rest of the way for a CR.

But the really early manual typewriters didn't have this. Typewriters are old, the first commercial one was invented in 1865 and manufactured in 1870, and the first commercially successful typewriter was manufactured in 1874. Neither had such a lever.


Also, some typewriters did not have !: you had to type ' BACKSPACE .

Vaguely related to people typing O for 0 and l for 1: UK number plates do not distinguish between 0 and O and between 1 and I so software that handles registration numbers must take account of that.


Also typewriters had a thing called 'pitch', which refers to the number of characters per horizontal inch. There were 10-pitch and 12-pitch typewriters, although both had the same height, 6 lines per inch vertically.


> Dare I say and ask: How many spaces is a tab?

I just checked (because I had a hazy memory of this doing something weird): tab characters printed raw to a PTY and looked at in a terminal emulator/tmux/etc., aren't any fixed number of spaces wide.

Rather, a tab, at least when rendered onto a PTY, advances the cursor to the next tabstop — i.e. to the next virtual column that is a multiple of 8.

Which has the dreadful implication that, to handle column prediction when there could ever be tabs in the input, you actually need to fully model the rendering behavior of a PTY.

And don't get me started on predicting the width of text printed raw containing ANSI escape codes!

The fact that libncurses works at all, while dealing with all of this, is a never-ending source of amazement to me.


> you actually need to fully model the rendering behavior of a PTY

I think 'input contains control sequences' can be filed away along with 'input contains vertical tab' as 'you're on your own, kid'.

Just handling newlines and tabs is relatively trivial. I think it's even a k&r exercise.

> libncurses works at all, while dealing with all of this

ncurses filters out escape sequences. You can, too (which is probably a nicer approach than attempting to model them).


It filters out escape sequences on input, yes. But ncurses does generate its own escape sequences (obviously, that's how it does what it does); and so it still has to have a full internal PTY <-> what-a-terminal-emulator-would-do mapping model for what it itself has emitted to the screen, in order to be able to do "damage region" repainting correctly without e.g. suddenly making some text lose color because it jumped to a "display cell" that actually corresponds in the PTY stream to both the text and some SGR escape codes.


> suddenly making some text lose color because it jumped to a "display cell" that actually corresponds in the PTY stream to both the text and some SGR escape codes

Not how the terminal works. Try this:

  printf " \e[34mhiii\r\e[0mxx\n"
This:

- Prints a space

- Switches to blue

- Prints hiii

- Moves to the left of the screen

- Turns off all attributes

- Prints 'xx' (such that the cell where the colour started has definitely been overwritten)

This should result in a default-coloured 'xx' followed by blue 'iii'.

You do not need a complete model of the terminal. All you need is calculate dirty regions -> redraw them line-by-line in the naive manner.

SGR escapes aren't associated with specific cells; you can think of them as commands to the terminal. 'Set your default foreground colour to ...' They could as easily go along a separate channel (barring synchronization overhead, etc.).


> Rather, a tab, at least when rendered onto a PTY, advances the cursor to the next tabstop — i.e. to the next virtual column that is a multiple of 8.

Multiple of 8 is just convention, though. AFAIU, it's up to the display software--editor, terminal emulator, etc--where to place tab stops. The kernel TTY and PTY layers don't know anything about tab stops, AFAIU. In the case of terminals there are control sequences available for programs to query and update the tab stops; but all the logic is still at the ends. Contrast that with line buffering and certain escape sequences, where character sequences are captured and interpreted by the terminal device driver.

Also note that it's tab stops, plural. The default tab stops are multiples of 8, but you can tell a terminal (similar to other display software, like MS Word) to place tab stops at discrete positions that aren't multiples of anything, e.g. 10, 12, 18, 27, etc.


When we were taught to type, we used long-stroke hammer typewriters. The tab stop was a bit of metal with notches & clips. You set the tab stop with a ruler, and each time you pressed tab it went to next stop. If your tab stop was worn out, the carriage would go right past the stop!


The Unix kernel has tab knowledge - see tcsetattr(), specifically the c_oflag OXTABS bit:

If OXTABS is set, tabs are expanded to the appropriate number of spaces (assuming 8 column tab stops).

(XTABS if you're not using a real unix.)


Ok, now do vertical tab


> grapheme clusters [...] portable

If only. Boundary locations are dependent on unicode version. If your terminal uses one and your compiler another—boom.

  * * *
In my c compiler I distinguish 'acol' and 'vcol'. The former stands for 'actual column', and the latter for 'visual column'. The actual column is a byte offset, which can be used to identify the offending location in source, while the visual column represents a physical offset in glyph-widths.

The issue given in TLA of tab widths being different can be resolved by making the compiler expand the tabs itself. This does nothing for e.g. emoji, if there is disagreement about the version of unicode in use.

Vim seems to do something similar. Given 'set ruler', if I type a tab followed by an em dash, I am told that I am in column '5-10'. 5 is 3 bytes from the em dash (in utf8) and 1 byte from the tab, plus 1. 10 is 8 glyph-widths from the tab and 1 glyph-width from the em dash, plus 1.

  * * *
However, my approach to error-handling should perhaps not be taken as representative. E.G.

> But nobody looks at an error message and manually navigates to the location using the column information!

I do. And I also dislike rustc's error messages, which apparently receive universal acclaim.


I want to know more about why you disklike rustc error messages. I always found them legible, precise and helpful at the same time.


They are exceedingly verbose.

The most important part of a compiler error, to me, is location information. Use of colour (and possibly other formatting) to draw the eye directly to the site of the error is valuable. Sometimes this is enough, and the issue immediately becomes obvious. Sometimes it is not.

In the case where it is, and there are multiple errors, the proliferation of text obscures the display of code itself. I can scroll, sure; but the human-computer communication channel has high latency, so it is desirable to minimize synchronization overhead and round-trips.

In the case where it is not, my main priority is to identify the problem. I don't want to know why some form is problematic, I want to know which problem it is. With terse error messages there's a decent chance I'll memorize them over time, and recognise a given error message all at once by shape, rather than reading them word by word. Just as I memorize words and recognise them all at once, by shape, rather than letter by letter.

  * * *
This conflict of 'why' vs 'which' is didactic. An inexperienced programmer might in fact want to know why their code is not working. But my own experience at this point is that, if it is not presently clear what is wrong with my code, what I need to study is my own code, not the error message.


> This does nothing for e.g. emoji, if there is disagreement about the version of unicode in use.

Incidentally, this can be 'corrected' by making the compiler reset the cursor position to where it thinks it should be after drawing each grapheme. Is that really better than the alternative? Can we just kill this infrastructure already? ¯\_(ツ)_/¯


Two things:

1) I recently implemented truncating long lines for CLI tool and went with a hybrid approach using both graphemes and virtual columns -- I'd only truncate a line between graphemes, but when counting how much space was used up I would use virtual columns. In the case of something like , this means things tend to error on the "safe" side of truncating a line too early if the virtual column approach counts it as 4 columns wide rather than 2.

2) I wanted to test something with the scientist emoji and managed to crash Ruby's repl, irb, simply by pasting it into the repl and then backspacing over it. (It was clearly confused about the position of the cursor, and the stacktrace pointed to an error in a line_editor.rb file.) I was on Ruby 2.7.1, but it looks like it's been fixed in 3.0.0!


> Most Chinese, Japanese or Korean characters are rendered twice as wide as most other characters, even in a monospace font.

Not just CJK characters, but also a lot of non-Latin characters and symbols (a canonical example being ↑). In the East Asian Width standard [1] they are classified as "ambiguous", which can be half-width or full-width depending on the user choice. (By the way thank you much for pointing this out, this is super non-obvious to non-CJK developers and consequently affects CJK developers!)

[1] https://unicode.org/reports/tr11/


> a canonical example being ↑

Renders as a single column for me, in a terminal, on both Terminator under Linux & iTerm2 under OS X. (And in the monospace font on the browser, too.)


It greatly depends on both the font and user settings. As an example of the latter, iTerm2 has the relevant configuration in the session preferences: "Ambiguous characters are double-width".


If you're measuring text for terminal display, you might like my "widecharwidth" library. It tries to be what wcwidth should have been.

http://github.com/ridiculousfish/widecharwidth


> Emojis easily combine many many code points into one symbol. It begins with flag emojis such as , which is actually a special “E” followed by “U”, continues with emojis such as (scientist), which is (person) glued together with (microscope) using a special joiner code point, and ends at the absolute pinnacle of code point combinations - the family emoji . How do you make a family? Easy, you take a person (with optional skin tone and gender modifier) and glue it with another person, as well as their children. That way you can easily end up with a single “character” consisting of ten or more code points!

Why do we create unnecessary complexities and then refuse to dismantle them when we start having problems with them? We're so unable to admit mistakes?


We’re unable to fix mistakes that gone too widespread. HN strips emoji, afaik. Here is the unicorn: .


The title maybe should be enhanced to: "What is the Unicode unit of a text column number?"


There isn't one. Unless you're using a fixed-width font character placement doesn't equal the human placement (length); and if you're limited by technical means you usually care about the raw encoding (byte/octets?) length of the underlying encoding.

In both cases, for anyone that cares about the limit, the starting byte / octet for the next character is probably sufficient knowledge and offset, and update that accordingly with some movement iterations jumping multiple units if the traversed element is wide.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: