Either way, I suggest to the readers who might feel upset over this statement to explore something outside of C and C++, liking which, when it comes to strings, is nothing short of Stockholm syndrome.
I'm working on a UTF-8 string library for C# and across the last 6-8 months explored string design in Rust, Swift, Go, C, C++ and a little in other languages. C and C++ were, by far, most horrifying in the amount of footguns as well as the average effort required to perform trivial operations (including transcoding discussed here).
Strings are not easy. But it does not mean their complexity has to be unjustified or unreasonable, which it is in C++ and C (for reasons somewhat different although overlapping). The problem comes from the fact that C and C++ do not enjoy the benefit of the hindsight that Rust had designing its string around being UTF-8 exclusive with special types to express either opaque, ANSI or UTF-16 encodings to deal with situations where UTF-8 won't do.
But I assure you, there will be strong negative correlation here between complaining about string complexity and using Rust, or C#/Java or even Go. Keep in mind that Go's strings are still a poor design that lets you arbitrarily tear code points and foregoes richness and safety of Rust strings. Same, to an extent, applies to C# and Java strings, though they are also safe mostly through a quirk of UTF-16 where you can only ever tear non-BMP code points, which happen infrequently at the edges of substrings or string slices as the offsets are produced by scanning or from known good constants.
If, at your own peril, you still wish to stay with C++, then you may want to look at QString from Qt which is how a decent string type UX should look like.
Go's strings aren't poor design. The only difference between a Go string and a Rust &str/String is that the latter is required to be valid UTF-8. In Go, a string is only conventionally valid UTF-8. It is permitted to contain invalid UTF-8. This is a feature, not a bug, because it more closely represents the reality of data encoded in a file on Unix. Of course, this feature comes with a trade-off, because Rust's guarantee that &str/String is valid UTF-8 is also a feature and not a bug.
I mention gecko as an example repository that contains data that isn't valid UTF-8. But it isn't unique. The cpython repository does too. When you make your string type have the invariant that it must be valid UTF-8, you're giving up something when it comes to writing tools that process the contents of arbitrary files.
Go strings aren't necessarily text. Rust strings are text, as long as you consider things like emoji or Egyptian hieroglyphics to be text. Lots of confusion has come from the imprecise meaning of "string", whether it's referring to arbitrary byte sequences, restricted byte sequences (e.g. not containing 0x00), arbitrary sequences of characters with some encoding, restricted sequences of characters with some encoding, or something else is often unclear. And when it's a restricted sequence what those restrictions are is also often unclear.
You sometimes need a way to operate on entirely arbitrary sequences of bytes. This is mostly easy, it's been a long time since non-octet bytes were relevant in most situations, so the vast majority of the time you can just assume they're all octets.
You sometimes need a way to operate on arbitrary text. This inherently requires knowing how that text is encoded, but as long as you know that it's mostly easy.
You sometimes need a way to operate on text-like things that aren't necessarily text, like the output of old CLI programs that used the BEL character to alert the user to events. Or POSIX filenames. Or text where you don't know the encoding. This is where the bugs lie, where we make unchecked assumptions about the data that turn out to be invalid.
You didn't really respond directly to anything I said, nor anything I said in the blog I linked (that I also wrote). You also seem to be speaking to me as if I'm some spring chicken. I'm not. I'm on the Rust libs-api team and I'm in favor of the &str/String API design (including its UTF-8 requirement). I wrote ripgrep. I've spent 10 years working on regex engines. I understand text encodings and the design space of string data types. I've implemented string data types. It might help to understand things a little better by perusing the bstr crate API[1]. Notice that it doesn't require valid UTF-8, yet assumes by convention that the string is UTF-8. And this assumption provides a path to implementing things like "iterate over all grapheme clusters" with sensible semantics when invalid UTF-8 is seen.
You'll notice that I didn't say "Go's string design is good and we should all use it." I made an argument that's Go's string design is not poor and provided an argument for why that is. In particular, I described trade offs and a particular pragmatic point on which abdicating the UTF-8 requirement makes for a more seamless experience when dealing with arbitrary file content.
> but as long as you know that it's mostly easy. [..snip..] Or text where you don't know the encoding.
You don't know. That was my whole point! I gave real-world concrete examples of popular things (Mozilla and CPython repositories) that contain text files that aren't entirely valid UTF-8. They are only mostly valid UTF-8. If I instead treated them as malformed and refused to process them in my command line utilities or libraries, I would get instant bug reports.
> Go strings aren't necessarily text.
I would generally consider this to be an incorrect statement. The more precise statement is that Go strings may contain invalid UTF-8. But the operations defined on strings treat strings as text. For example, if you iterate over the codepoints in a Go string, you'll get U+FFFD for bytes that are invalid UTF-8. By your own reasoning, U+FFFD must be considered text because it can also appear in a Rust &str/String. Despite the fact that a Go string and a []byte can represent arbitrary sequences of bytes, a Go string is not the same thing as a []byte. Aside from mutability and growability, the operations on them (both those provided as a library and those provided by the language definition itself) are what distinguish them. They are what make a `string` text, even when it contains invalid UTF-8.
There are deep trade offs here, but the UTF-8-is-required does have downsides that UTF-8-by-convention does not have. And of course, vice versa.
A lot of what makes c string handling hard to use is the decision they made that api's should write into a user supplied buffer rather allocate one for you.
Either way, I suggest to the readers who might feel upset over this statement to explore something outside of C and C++, liking which, when it comes to strings, is nothing short of Stockholm syndrome.
I'm working on a UTF-8 string library for C# and across the last 6-8 months explored string design in Rust, Swift, Go, C, C++ and a little in other languages. C and C++ were, by far, most horrifying in the amount of footguns as well as the average effort required to perform trivial operations (including transcoding discussed here).
Strings are not easy. But it does not mean their complexity has to be unjustified or unreasonable, which it is in C++ and C (for reasons somewhat different although overlapping). The problem comes from the fact that C and C++ do not enjoy the benefit of the hindsight that Rust had designing its string around being UTF-8 exclusive with special types to express either opaque, ANSI or UTF-16 encodings to deal with situations where UTF-8 won't do.
But I assure you, there will be strong negative correlation here between complaining about string complexity and using Rust, or C#/Java or even Go. Keep in mind that Go's strings are still a poor design that lets you arbitrarily tear code points and foregoes richness and safety of Rust strings. Same, to an extent, applies to C# and Java strings, though they are also safe mostly through a quirk of UTF-16 where you can only ever tear non-BMP code points, which happen infrequently at the edges of substrings or string slices as the offsets are produced by scanning or from known good constants.
If, at your own peril, you still wish to stay with C++, then you may want to look at QString from Qt which is how a decent string type UX should look like.