Han unification makes things *really* hard for programmers. You end up with code...

cplease · on March 17, 2015

You don't determine language based on codepage. I give you ASCII text; what language is it?

com2kid · on March 17, 2015

BINGO.

But because of Han unification I all of a sudden DO need to know the language.

The same Unicode code point needs to be rendered differently for a user in Mainland China versus a user in Japan or else the user may not be able to read the text! Even if the user can read the character, they are going to experience a degradation in reading speed and comprehension, and be generally frustrated. Not to mention showing the wrong character is insensitive to the customer's culture, and if I pick to and stick just one set of characters to use, I end up (being accused at least of) promoting cultural hegemony based on which character set I go with.

astrange · on March 17, 2015

In what situations do you need to do this, but don't need to show any other data (dates and times, localized UI, user timezone, culturally appropriate fonts, RTLness) that involves knowing the user's languages and locale?

This can happen if the user is intentionally reading mixed-language text or text not in their computer's UI language, of course. In that case different CJK languages also have different preferred fonts, so having language tagging or just guessing is pretty important.

com2kid · on March 17, 2015

> In what situations do you need to do this, but don't need to show any other data (dates and times, localized UI, user timezone, culturally appropriate fonts, RTLness) that involves knowing the user's languages and locale?

For drawing a given glyph, there is normally a lookup into a font table that involves solely the string of Unicode code points coming in.

Except if any characters in the CJK Unified Ideograph range. Then my function call suddenly has to jump out to read environment variables, which are hopefully setup correctly.

My code to do a lookup into a font file should not depend upon the users environment variables due to a space saving optimization made two decades ago.

astrange · on March 17, 2015

> For drawing a given glyph, there is normally a lookup into a font table that involves solely the string of Unicode code points coming in.

Why are you implementing OpenType? It's got working libraries already.

But if you are getting into that, glyphs in a font are stored by "glyph name", not necessarily by code point. There's a bunch more steps than that.

- Font substitution: Find fonts that cover every character in the text. The order of your search list depends on the language.

- Text layout and line breaking: for best results, you don't want to line break in the middle of a word, and you need to place punctuation on the correct side of right-to-left sentences. I think both of these need dictionaries.

- Choosing individual glyphs: it's complicated! http://ilovetypography.com/OpenType/opentype-features.html

You have to read the GSUB tables and do a bunch of expected features, like ligatures, automatic fractions, beginning of word special forms (see Zapfino), &c. This includes language specific glyphs, but fonts can also just choose glyphs with a random number generator.

- Drawing the glyph. Remember not to draw each one individually, or a translucent line of overlapping characters (like in Indian languages) will look bad.

Each glyph actually comes with a custom program to do the hinting! It's even more complicated: https://developer.apple.com/fonts/TrueType-Reference-Manual/...

Luckily I don't think it depends on much external state.

cplease · on March 17, 2015

Sorry, Han glyphs render the same in Chinese and Japanese.

Regarding simplified versus traditional, no one is seriously unifying those.

There's some minor disagreements as to when a minor stylistic or historical variant deserves a separate glyph, but this isn't about rendering different glyphs in Chinese or Japanese. If Unicode is doing its job no one should have difficulty reading unified Han characters in one font regardless of language.

1ris · on March 17, 2015

Well, if you find Hiragana/Katakana it's Japanese, if you find Chữ Nôm it's Vietnamese. Otherwise it's Chinese (Well, given the definition of "language" is very hard in the context of Chinese).

From a purly theoritical perspecitve the Han unification looks like a great idea. Image the horror of normalisations if it didn't happen. ; and greek questionmark would have been a joke in comparison.

theon144 · on March 17, 2015

Isn't the idea technically that the code shouldn't even have to guess? Why isn't this the case?

Someone · on March 17, 2015

Imagine a world where the British always write the lowercase letter g as a single-story glyph (http://en.m.wikipedia.org/wiki/G#Typographic_variants). The colonies start writing it identically, but after a while, they start writing it as a double-story g.

After a century or so, nobody in he colonies writes the single-story variant, and all Brits always do.

The unicode consortium studies the case and concludes that there is only a single g with variations in the way it is written. Because of that, it creates a single code point for the lowercase 'g' character.

Now, suppose a web page stores the text 'goto' in Unicode as the code points for 'g', 'o', 't', and 'o'. To write the code that renders that string, you will find you need to know whether the text is written in British English or in colonial English.

Tomte · on March 17, 2015

No, you don't. It's the same letter and it's always "goto".

Your local setup determines the look of the glyph, so nobody sees an unfamiliar form.

But maybe you'd like to encode typefaces/fonts in the Unicode code points, as well? To make sure that I'm seeing the exact same arrangement of pixels you want me to see?

dalke · on March 17, 2015

This is the core of the Han Unification debate.

"G" and "g" are the same letter. They started off as stylistic forms of a unicameral alphabet. Over time they took on separate meanings, and now we have a bicameral alphabet, where the two forms have different code points.

Of course, over the last 2000 years, we've developed rules for how to use them. "I was reading a nice book on Polish polish on the way from Reading to Nice" contains three pairs of words where the capitalization changes the meaning and pronunciation. (In simplified form, "What do you know about polish?" is different than "What do you know about Polish?")

If there were a simple rule to specify capitalization, eg, only the first letter of a sentence, and it were easy to detect the start of a sentence, then the alternate you might say it's pointless to have both "g" and "G"; we should have only a single form and let the local setup determine how to display it. (Something like the Greek sigma, which has the form ς when used at the end of a word, though Unicode has them as two different glyphs.)

In Someone's nice example, it's easy to think of how the two divergent forms of 'g' might take on their own meaning. Perhaps the Americans have decided that double-story g was the sign of true patriots, and that single-story g was for traitors. (Akin to the shibboleth of how to say 'H' in Northern Ireland; aitch was Protestant, haitch was Catholic, and using the wrong version could get you into trouble.) Perhaps they started to use the new 'g' preference as a currency symbol, in the way that £ is the same letter as L, from the Latin libra pondo.

Tomte · on March 17, 2015

Absolutely. But the two "g" haven't diverged, yet.

We don't give out code points to speculative future developments.

If and when they diverge one will get its very own code point.

dalke · on March 17, 2015

The premise of Someone's hypothetical was to explain to theon144 why it might be both hard and important to guess. The hypothetical assumed that the difference already existed. It echos the larger context of Han unification that com2kid started, but with an example that's a lot easier for native English speakers to understand.

So while I agree that they haven't yet diverged, that's outside the context of the hypothetical, where they have diverged.

Tomte · on March 17, 2015

That's wrong. Someone's hypothetical had no diverging meaning involved, only a stylistic choice in the presentation of the same letter.

And he is still wrong when it comes to the claimed necessity to "guess".

The user has set up his system correctly. "Guessing" only comes into play if you want to force your stylistic variation on others.

And that is obviously a bad idea, for all the reasons called out before, like familiarity and readability.

Let's not kid ourselves. When it comes to Han unification opposition there is mostly one issue at play: plain racism. "Our holy script shall not be defiled by those dirty bastards". And that works in all directions.

Someone · on March 17, 2015

The issue is that, in this example, both the colonials and the British think the two 'g' characters are different letters, just as people nowadays think 'g' and 'G' are different characters (historically, that is at least up for debate: http://en.m.wikipedia.org/wiki/Letter_case#History). Americans will want to see a different character when quoting Shakespeare inside American-English text (globalization starts with a different letter than globalisation)

because Unicode has only one character, writing that text in a text editor or storing it in a text column in a database becomes impossible.

Yes, there are workarounds such as using escape characters or html, but those are a nuisance that could be avoided by including both variants in Unicode.

The unicode consortium is entitled to think differently, but they cannot expect everybody to be happy with their choice.

Anderkent · on March 17, 2015

Isn't this the question of a font? In which case the client chooses if they want to use a font with a double-story or a single-story g?

Someone · on March 17, 2015

When reading 'gas', the client will have to figure out whether that is about a liquid, in which case it has to choose a colonial 'g' (when written with a British 'g' 'gas' always is a liquid). If the meaning is that of a gaseous substance, the client will have to do additional work to determine what kind of 'g' to write.

alblue · on March 18, 2015

You seem to be misguided about British English, in which gas is actually gaseous and not in fact an abbreviation for petrol.