-4 badge of honor :D Either way, I suggest to the readers who might feel upset o...

burntsushi · on April 21, 2024

Go's strings aren't poor design. The only difference between a Go string and a Rust &str/String is that the latter is required to be valid UTF-8. In Go, a string is only conventionally valid UTF-8. It is permitted to contain invalid UTF-8. This is a feature, not a bug, because it more closely represents the reality of data encoded in a file on Unix. Of course, this feature comes with a trade-off, because Rust's guarantee that &str/String is valid UTF-8 is also a feature and not a bug.

I wrote more about this here: https://blog.burntsushi.net/bstr/#motivation-based-on-concep...

I mention gecko as an example repository that contains data that isn't valid UTF-8. But it isn't unique. The cpython repository does too. When you make your string type have the invariant that it must be valid UTF-8, you're giving up something when it comes to writing tools that process the contents of arbitrary files.

SAI_Peregrinus · on April 22, 2024

Go strings aren't necessarily text. Rust strings are text, as long as you consider things like emoji or Egyptian hieroglyphics to be text. Lots of confusion has come from the imprecise meaning of "string", whether it's referring to arbitrary byte sequences, restricted byte sequences (e.g. not containing 0x00), arbitrary sequences of characters with some encoding, restricted sequences of characters with some encoding, or something else is often unclear. And when it's a restricted sequence what those restrictions are is also often unclear.

You sometimes need a way to operate on entirely arbitrary sequences of bytes. This is mostly easy, it's been a long time since non-octet bytes were relevant in most situations, so the vast majority of the time you can just assume they're all octets.

You sometimes need a way to operate on arbitrary text. This inherently requires knowing how that text is encoded, but as long as you know that it's mostly easy.

You sometimes need a way to operate on text-like things that aren't necessarily text, like the output of old CLI programs that used the BEL character to alert the user to events. Or POSIX filenames. Or text where you don't know the encoding. This is where the bugs lie, where we make unchecked assumptions about the data that turn out to be invalid.

burntsushi · on April 22, 2024

You didn't really respond directly to anything I said, nor anything I said in the blog I linked (that I also wrote). You also seem to be speaking to me as if I'm some spring chicken. I'm not. I'm on the Rust libs-api team and I'm in favor of the &str/String API design (including its UTF-8 requirement). I wrote ripgrep. I've spent 10 years working on regex engines. I understand text encodings and the design space of string data types. I've implemented string data types. It might help to understand things a little better by perusing the bstr crate API[1]. Notice that it doesn't require valid UTF-8, yet assumes by convention that the string is UTF-8. And this assumption provides a path to implementing things like "iterate over all grapheme clusters" with sensible semantics when invalid UTF-8 is seen.

You'll notice that I didn't say "Go's string design is good and we should all use it." I made an argument that's Go's string design is not poor and provided an argument for why that is. In particular, I described trade offs and a particular pragmatic point on which abdicating the UTF-8 requirement makes for a more seamless experience when dealing with arbitrary file content.

> but as long as you know that it's mostly easy. [..snip..] Or text where you don't know the encoding.

You don't know. That was my whole point! I gave real-world concrete examples of popular things (Mozilla and CPython repositories) that contain text files that aren't entirely valid UTF-8. They are only mostly valid UTF-8. If I instead treated them as malformed and refused to process them in my command line utilities or libraries, I would get instant bug reports.

> Go strings aren't necessarily text.

I would generally consider this to be an incorrect statement. The more precise statement is that Go strings may contain invalid UTF-8. But the operations defined on strings treat strings as text. For example, if you iterate over the codepoints in a Go string, you'll get U+FFFD for bytes that are invalid UTF-8. By your own reasoning, U+FFFD must be considered text because it can also appear in a Rust &str/String. Despite the fact that a Go string and a []byte can represent arbitrary sequences of bytes, a Go string is not the same thing as a []byte. Aside from mutability and growability, the operations on them (both those provided as a library and those provided by the language definition itself) are what distinguish them. They are what make a `string` text, even when it contains invalid UTF-8.

There are deep trade offs here, but the UTF-8-is-required does have downsides that UTF-8-by-convention does not have. And of course, vice versa.

[1]: https://docs.rs/bstr

SAI_Peregrinus · on April 22, 2024

Sorry, I was trying to expand on your points, not contradict any of them!

burntsushi · on April 22, 2024

Apparently I completely misinterpreted. My apologies. Thanks for clarifying.

im3w1l · on April 22, 2024

A lot of what makes c string handling hard to use is the decision they made that api's should write into a user supplied buffer rather allocate one for you.