Its not an AI issue, just a small matter of having lots of rules. Moreover this ...

lilyball · on March 17, 2015

> Unicode implementations are required to evaluate these two versions as being equal in string comparisons

What do you mean by "required"? There's different forms of string equality. It's plausible to have string equality that compares the actual codepoint sequence, vs string equality that compares NFC or NFD forms, and there's string equality that compares NFKC or NFKD forms. And heck, there's also comparing strings ignoring diacritics.

Any well-behaving software that's operating on user text should indeed do something other than just comparing the codepoints. In the case of searching a document, it's reasonable to do a diacritic-insensitive search, so if you search for "e" you could find "é" and "ê". But that's not true of all cases.

PaulAJ · on March 18, 2015

Its part of the Unicode standard. See http://en.wikipedia.org/wiki/Unicode_equivalence for details.

(OK, so "required" might be overstating it; you are perfectly free to write a program that doesn't conform to the standard. But most people will consider that a bug unless there is a good reason for it)

lilyball · on March 18, 2015

Unicode defines equivalence relations, yes. But nowhere does is a program that uses Unicode required to use a equivalence relation whenever it wishes to compare two strings. It probably should use one, but there are various reasons why it might want strict equality for certain operations.

stevejones · on March 17, 2015

In some languages those accented characters would be different letters, sometimes appearing far away from each other in collation order. In other cases they are basically the same letter. Whereas in Hungarian 'dzs' is a letter.

lilyball · on March 17, 2015

Different languages can define different collation rules even when they use the same graphemes. For example, in Swedish z < ö, but in German ö < z. Same graphemes, different collation.

vidarh · on March 18, 2015

And we may even have more than one set of collation rules within the same language.

E.g. Norwegian had two common ways of collating æ,ø,å and their alternative forms ae, oe and aa. Phone books used to collate "ae" with æ, "oe" with ø and "aa" with å, while in other contexts "ae", "oe" and "aa" would often be collated based on their constituent parts. It's a lot less common these days for the pairs to be collated with æøå, but still not unheard of.

Of course it truly becomes entertaining to try to sort out when mixing in "foreign" characters. E.g I would be inclined to collate ö together with ø if collating predominantly Norwegian strings, since ö used to be fairly commonly used in Norway too, but these days you might also find it collated with "o".