Pretty unrelated but I was thinking about efficiently encoding Unicode a week or...

SimonSapin · on May 27, 2015

Opinions: no it’s not worth the hassle. Yes, "fixed length" is misguided. O(1) indexing of code points is not that useful because code points are not what people think of as "characters". (See combining code points.) http://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/

SiVal · on May 28, 2015

When you use an encoding based on integral bytes, you can use the hardware-accelerated and often parallelized "memcpy" bulk byte moving hardware features to manipulate your strings.

But inserting a codepoint with your approach would require all downstream bits to be shifted within and across bytes, something that would be a much bigger computational burden. It's unlikely that anyone would consider saddling themselves with that for a mere 25% space savings over the dead-simple and memcpy-able UTF-32.

Dylan16807 · on May 27, 2015

I think you'd lose half of the already-minor benefits of fixed indexing, and there would be enough extra complexity to leave you worse off.

In addition, there's a 95% chance you're not dealing with enough text for UTF-32 to hurt. If you're in the other 5%, then a packing scheme that's 1/3 more efficient is still going to hurt. There's no good use case.

Coding for variable-width takes more effort, but it gives you a better result. You can divide strings appropriate to the use. Sometimes that's code points, but more often it's probably characters or bytes.

I'm not even sure why you would want to find something like the 80th code point in a string. It's rare enough to not be a top priority.