Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I respectfully disagree. If Japanese ideograms and Chinese ideograms actually used different code points (i.e. no "Han unification"), then the problem wouldn't exist - the phone could trivially use a Japanese font for Japanese text, and a Chinese font for Chinese text.


No. Using different code points for the same character used in different languages creates big problems. It would be like having different code points for 'A' depending on whether it was used in English, Spanish, German, etc. If you somehow ended up writing "color" with both 'o' characters from the Spanish ABCs and the others from the English ABCs, you'd have a real mess when it came to sorting, searching, name matching (what language is "Hans"?) etc. It is far more convenient to allow the character sequence "color" or "Hans" to be language independent, even if the font choices, pronunciation, sort order, etc., are language dependent.

Chinese, Japanese, and Korean writers face similar issues. The characters they use to write the name of China or Japan, the ten digits, the characters for year, month, and day in dates, and so many thousands of others in Chinese characters are what they all consider to be the same characters. That is not all characters, but it includes so many that insisting on different code points by language would make a real mess. Hong Kong has many characters that are unique to HK Cantonese. So, should Cantonese have a full set of all Chinese characters that are the "Cantonese characters"? How about Shanghainese, then? Or Hakka? Teochew (Chaozhou) or a dozen Chinese languages? Full, independent sets of all Chinese characters for each? Suppose you accidentally used an input method in HK and wrote the name of some Beijing gov't ministry using characters that looked identical to their Mandarin counterparts but were entirely different codepoints? Now what? You can't find your search term? You mess up the database and have two identical-looking keys?

No, Han unification is not conceptually different from unifying ABCs used by English and Spanish speakers, Cyrillic used by Russians or Serbs, etc., except that there are many more characters, so the boundary between what should be unified and what shouldn't contains more items in the gray zone to cause debate. Having no Han unification at all wouldn't solve all problems, it would create all sorts of absurdity.


Sure, having no unification at all would be bad, but the issue is with the gray zone. Some characters are written identically in each CJK language, but among those that aren't the amount of difference varies widely. The trouble is that Unicode leaves separate codepoints for each version that somebody, somewhere decided were "different enough" (even when they are the same character historically and linguistically) but merges many characters with (consistent, well-defined) differences because somebody felt they were close enough for horseshoes. People often think that characters were only merged if they were linguistically the same, but that's not the case.

Also, comparisons like "different ABCs for English and Spanish" are spurious and unhelpful. If you could tell an English "b" from a Spanish one by looking at it, the comparison would be sound.


Actually, Cyrillic is an interesting case. The Unicode standard does define completely-separate codepoints for the Cyrillic letters, even for the ones that look "just like" letters of the Latin Alphabet. Greek letters that look exactly like Latin letters get the same treatment.

It's difficult to come up with a logical explanation for why European languages that use their own alphabet get their own codepoints, but ideographic languages need to be "unified", even though the actual letters as used in those languages look different.

The "Han unification" was fundamentally a bad idea, and persists for historical reasons. Back when (some) people thought a fixed-width 16-bit character representation would be "enough", it made sense to try to reduce the number of "redundant" code points. Now that Unicode has expanded to a much-larger code space, I would think they'd choose differently.

Unfortunately, that kind of sweeping change is unlikely any time soon.


When you phrase it this way, "using different code for the same characters" it sounds obvious, but the problem precisely is whether they are the same characters or not. Are they like sans serif (used in English) or Gothic (traditionally used in German), or are they like the Roman alphabet and the Cyrillic alphabet, different but from the same root?

Gather 5 linguists, you won't get them to agree on that. Unicode says they're the same, but not everyone agrees with them, and the practical problems are real.


But.. you already posted a solution to your problem - use chinese font for chinese and japanese font for japanese. There is no problem. I mean, you would also presumably use a western font for latin and another separate font for cyrillic, since japanese fonts universally have absolutely dreadful kerning on latin (and often omit cyrillic entirely).


Having to know metadata about text is precisely what made the pre-unicode days so bad. If you're writing a word document, sure, choose your fonts, but if you're rendering a web page that doesn't happen to declare its language, things aren't so simple. (And if you're writing software meant to correctly handle user input in multiple languages, good luck...)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: