Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> It is unclear whether unpaired surrogate byte sequences are supposed to be well-formed in CESU-8.

According to the Unicode Technical Report #26 that defines CESU-8[1], CESU-8 is a Compatibility Encoding Scheme for UTF-16 ("CESU"). In fact, the way the encoding is defined, the source data must be represented in UTF-16 prior to converting to CESU-8. Since UTF-16 cannot represent unpaired surrogates, I think it's safe to say that CESU-8 cannot represent them either.

[1] http://www.unicode.org/reports/tr26/






Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: