There are aspects I didn't fully understand, maybe someone can explain : on one hand I understand that Object arrays were very slow and raised issues, that PyArrow strings solved the issue (how ?) at the expense of a huge dependency, and that the approach here also solves the issue in a lighter way, but it seems to be somewhat the same way as Object arrays : "So, the actual array buffer doesn’t contain any string data—it contains pointers to the actual string data. This is similar to how object arrays work, except instead of pointers to Python objects, they’re pointers to strings" => OK so... if they are stored in a similar way, how is that more performant or better than regular Object arrays ?
PyArrow string arrays store the entries (string values) contiguously in memory, so access is quicker, while object arrays have pointers to scattered memory locations in the heap.
I agree, I couldn't really figure how the new numpy string data type makes it work though.
They’re stored on the DType instance. This requires that there’s only one DType instance per “owned” array buffer, which I figured out how to do along with Sebastian Berg and others using the new DType system.