Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Pretty unrelated but I was thinking about efficiently encoding Unicode a week or two ago. I think there might be some value in a fixed length encoding but UTF-32 seems a bit wasteful. With Unicode requiring 21 (20.09) bits per code point packing three code points into 64 bits seems an obvious idea. But would it be worth the hassle for example as internal encoding in an operating system? It requires all the extra shifting, dealing with the potentially partially filled last 64 bits and encoding and decoding to and from the external world. Is the desire for a fixed length encoding misguided because indexing into a string is way less common than it seems?


Opinions: no it’s not worth the hassle. Yes, "fixed length" is misguided. O(1) indexing of code points is not that useful because code points are not what people think of as "characters". (See combining code points.) http://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/


When you use an encoding based on integral bytes, you can use the hardware-accelerated and often parallelized "memcpy" bulk byte moving hardware features to manipulate your strings.

But inserting a codepoint with your approach would require all downstream bits to be shifted within and across bytes, something that would be a much bigger computational burden. It's unlikely that anyone would consider saddling themselves with that for a mere 25% space savings over the dead-simple and memcpy-able UTF-32.


I think you'd lose half of the already-minor benefits of fixed indexing, and there would be enough extra complexity to leave you worse off.

In addition, there's a 95% chance you're not dealing with enough text for UTF-32 to hurt. If you're in the other 5%, then a packing scheme that's 1/3 more efficient is still going to hurt. There's no good use case.

Coding for variable-width takes more effort, but it gives you a better result. You can divide strings appropriate to the use. Sometimes that's code points, but more often it's probably characters or bytes.

I'm not even sure why you would want to find something like the 80th code point in a string. It's rare enough to not be a top priority.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: