Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I agree.

Most devanagari glyphs don't have their own codepoint. Marathi/Hindi/Sanskrit (which use devanagari) have a bunch of basic consonants and some vowels (which can appear independently or as modifiers). All the glyphs are logically formed by mixing the two, so the glyph for "foo" would be the consonant for "f"[1] plus the vowel modifier for "oo". When typing this in Unicode, you would do the same, type फ then ू, creating फू.

It gets interesting when we get to consonant clusters. As mentioned in [1], the consonants have a schwa by default, so the consonant for s followed by the consonant for k with a vowel modifier for the "y" sound would not be "sky", but instead something like "səky" (suh-ky).

So how do we write these? We can do this in two ways. One way is to use the no-vowel modifier, which looks like a straight tail[3] (on स, the consonant for "s", or "sə", the tail looks like स्), and follow that by the other consonant. So "sky" would be स् कै [2]. This method is rarely used, and the straight-tail is only used when you want to end a word with a consonant[4].

The more common way of doing multiple consonants are by using glyphs known as consonant clusters or conjuncts[5]. For "sky", we get स्कै, which is a partial glyph for स stuck to the glyph for कै. For most clusters you can take the known "partial" form of the first glyph and stick it to the full form of the second glyph, but there are tons of exceptions, eg द+द=द्ध, ह+म=ह्म (the second character was broken), and whatnot. See http://en.wikipedia.org/wiki/Devanagari#Biconsonantal_conjun... if you want a full table.

There aren't individual Unicode codepoints for this, not even codepoints for the straight-tail form of the consonants. I typed स्कै as स्+कै which was itself typed as स + ् + क + ै. This isn't an irregular occurrence either, consonant clusters (with a vowel modifier!) are pretty common in these languages[6].

I personally don't see anything wrong with having to use combining characters to get a single glyph. If it's possible and logical for it to be broken up that way, it's fine. With this trick, it's possible to represent Devanagari as a 128-codepoint block (http://en.wikipedia.org/wiki/Devanagari_%28Unicode_block%29), including a lot of the archaic stuff. It's either that, or you make a characters for every combined glyph there is, which is a lot[7]. One could argue that things like o-umlaut get their own codepoint, but स्क doesn't, but o-umlaut is one of maybe 10 such characters for a given European language, whereas स्क is one of around 700 (and that number is being conservative with the "usefulness" of the glyph, see [7]).

The article is never quite clear about which glyph Aditya finds lacking for his name (sadly, I don't know Bengali so I can't figure it out), but from the comments it seems like it it's something which can be inputted in Unicode, just not as a single character. That's okay, I guess. And if it's not showing up properly, that's a fault of the font. (And if it's hard to input, a fault of the keyboard).

It becomes a Unicode problem when:

- There is no way to input the glyph as a set of unicode code points, or - The input method for the glyph as a set of unicode code points can also mean and look like something else given a context (fonts can only implement one, so it's not fair to blame them)

[1]: well, fə, since the consonants are schwa'd by default. Pronounced "fuh" (ish)

[2]: the space is intentional here so that I can type this without it becoming something else, but in written form you wouldn't have the space. Also it's not exactly "sky", but close enough.

[3]: called paimodi ("broken foot") in Marathi

[4]: which is pretty rare in these languages. In some cases however, words that end with consonant-vowel combinations do get pronounced as if they end with a consonant (http://en.wikipedia.org/wiki/Schwa_deletion_in_Indo-Aryan_la...), but they're still written as if they ended with a vowel (this is true for my own name too, the schwa at the end is dropped). Generally words ending with consonants are only seen in onomatopoeia and whatnot.

[5]: called jodakshar ("joined word") in Marathi

[6]: Almost as common as having two side by side consonants in English. We like vowels, so it's a bit less common, but still common.

[7]: technically infinite, though that table of 700-odd biconsonantal conjuncts would contain all the common ones (assuming we still have the vowel modifying diacritics as separate codepoints), and adding a few triconsonantal conjuncts would represent almost all of what you need to write Marathi/Hindi/Sanskrit words. It wouldn't let you represent all sounds though, unless you still have the ् modifier, in which case why not just use that in the first place?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: