Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Perhaps counting code points is useful in that it gives a lower bound on the length in any reasonable encoding.

But "á" is two code points (U+61, U+301). If you're looking for some lower bound (whatever that means), shouldn't it be 1? I imagine if you're looking for something like information density, the count of UTF-8 code units would at least be somewhat more informative than the count of code points.

I guess the crux of this whole point is that a sequence of code points is arbitrary in the same way as a sequence of bytes; neither "code point" nor "byte" necessarily corresponds to something that a user would see as a unit in human text. So why are we not using the simpler abstraction?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: