Search code examples
swiftswift-string

swift string diacriticInsensitive not working correct


I am doing diacritic conversion on string. In Swedish it converts the letters åäö to aao. but iphone keyboard has åäö these letters. I couldn't understand why it converted these 3 letters. Is there an error in my code? Shouldn't the letters on the keyboard be converted?

print("åäö".folding(options: .diacriticInsensitive, locale: Locale(identifier: "sv"))) -> output aao

my iphone keyboard: enter image description here


Solution

  • This precisely matches the meaning of diacriticInsensitive. UTR #30 covers this. "Diacritic removal" includes "stroke, hook, descender" and all other "diacritics" returning the "related base character." While in Swedish å is considered a separate letter for sorting purposes, it still has a "base character" of (Latin) a. (Similarly for ä and ö.) This is a complex problem in Swedish, but the results should not be surprising.

    The ultimate rules are in Unicode's DiacriticFolding. These rules are not locale specific. It's possible that Foundation applies some additional locale rules, but clearly not in this case. The relevant Unicode folding rule is:

    0061 030A;  0061    # å → a LATIN SMALL LETTER A, COMBINING RING ABOVE → LATIN SMALL LETTER A
    

    Many cultures have subtle definitions of what is "a letter" vs "an extension of another letter" vs "a half-letter" vs "a non-letter symbol." When computing diacritics, the Turkish "İ" has a base form of "I", but "i" does not have a base form of "ı". That's bizarre, but true, because it's treating "basic latin" as the base alphabet. ("Basic Latin" is itself a bizarre classification, with letters j, u, and w being somewhat modern additions. But still we call it "Latin.")

    Unicode tries to "thread the needle" on these complex issues, with varying success. It tends to be biased towards Romance languages (and particularly Western European countries). But it does try. And it has a focus on what users will expect. So should a search for "halla" find "Hallå." I'm betting that most Swedes would consider that "close enough."

    Keyboards are designed to be useful to the cultures they're created for, so whether a particular symbol appears on the keyboard shouldn't be assumed to be making any strong claim about how the alphabet works. The iOS Arabic keyboard includes the half-letter "ء". That isn't making a claim about how the alphabet works. It's just saying that ء is somewhat commonly typed when writing Arabic.