Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You are correct. The lesson is that code points are low level representation and the semantics does not transfer across languages in Unicode. If you don't know the context you can get characters that don't render correctly or you may split string into parts in the wrong

UTF-8 strings can be treated as ASCII or Latin-1 replacements. If you want to deal with full Unicode with all cases you need locale.

Exercise for the reader: Try to write radix tree data and rope data structure that works with every language in all cases with Unicode.

http://cldr.unicode.org/ is your friend.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: