> some languages are insanely complex to implement I don't understand this, is t...

dalke · on March 17, 2015

Languages aren't all just simple alphabets like English. Some languages use ligatures to combine characters. In English, things like 'fi' and 'ffl' can be done almost automatically, and is optional, but other languages have stronger and more important rules.

As a simple example, in German the ligature ß is not a simple ligature for 'ss' but a combination of two previous ligatures; long s with round s ("ſs") and long s with (round) z "ſʒ"). Various spelling reforms have simplified the orthography, but "Maßen" and "Massen" are still different words.

Quoting from a Wikipedia page, "Urdu (one of the main languages of South Asia), which uses a calligraphic version of the Arabic-based Nasta`liq script, requires a great number of ligatures in digital typography. InPage, a widely used desktop publishing tool for Urdu, uses Nasta`liq fonts with over 20,000 ligatures"

Then there are rules for presentation. "Complex text layout ... refers to the typesetting of writing systems in which the shape or positioning of a grapheme depends on its relation to other graphemes." - http://en.wikipedia.org/wiki/Complex_text_layout . Cursive English is closest we have to complex text layout; while there are "cursive" fonts where each of the characters is in cursive the letters don't merge. Now imagine a language where smooth connections and fancy curlicues in the "right" places were essential for being seen as erudite, and where "right" depended on 5 years of learning.

peterfirefly · on March 17, 2015

Yes, if the way the characters look depends on the other characters in the word including some that are nowhere near being neighbours. Especially if there were weird and complex rules about how this changed depending on the kind of word being written.

There are a number of writing systems that are evil like that, including several Indic ones.

The code that handles this complicated process is often called a shaper. Choosing and combining the correct glyphs involves a complicated dance between that and the font(s), possibly including large tables (and code!) in the font itself on top of what the shaper does.

exelius · on March 17, 2015

It really depends on the language, and it's not totally about the glyphs. Text entry is a huge challenge for languages like Mandarin where glyphs can have multiple pronunciations and meanings depending on context. Consider that Mandarin (which shares many, but not all glyphs with Japanese kanji) has upwards of 20,000 different glyphs, and that other languages have a similar level of complexity, and it becomes hard to find an encoding standard capable of handling all of that complexity and variance.

What constitutes a "glyph" isn't even consistent - in some languages a glyph is a syllable, in some (like English) it's less than a syllable, and in yet others a single glyph can be an entire word.

In a language like Japanese, multiple glyphs are often combined to create new composite glyphs with different meanings. For example, the word for "forest" is a glyph comprised of 3 "tree" glyphs, but has an unrelated pronunciation.

How do you handle text entry between these differences? It may seem like a pedantic question, but it makes sense to define the characters in the way they will be written, or else the text entry scheme will be so complex you'll need an interpreter to convert from some entry scheme into the Unicode format. I think this is the problem the Unicode Consortium is grappling with - and it's not an easy problem. I don't claim to have the answers here; but I do recognize the complexity.

beat · on March 17, 2015

User interface isn't the problem, though - bitwise representation is the problem. How do we represent all the valid characters in Unicode? Data entry is an entirely separate issue (as is display).

Crito · on March 17, 2015

http://en.wikipedia.org/wiki/Complex_text_layout

Hypothetically you could construct a language where the glyphs are easy to generate procedurally on the fly by people who are fluent in that language, but who's full space of possible glyphs is staggeringly massive.

Suppose a language with tens or hundreds of thousands of "base" glyphs, but with a unique variant on each glyph depending on what is to the left and right of it. With that alone, for N base glyphs, you could have N^2 variants of each glyph.

I don't know if that sort of language exists. I don't see any reason why it couldn't though.