Search code examples
swiftsplitfoundation

Why does splitting a string with accents crash?


This should be simple enough…

import Foundation

let str:String = "Beyonce\u{301} and Tay"
print(str)

print(str.components(separatedBy: CharacterSet(charactersIn: "e")))

Which compiles fine, until I run the executable:

// Beyoncé and Tay
// Illegal instruction (core dumped)

I suspect Swift has a hard time dealing with the combining '\u{65}' accent mark, but given how much the language stresses its grapheme-based string model, I thought it would be pretty clear that splitting "Beyonce\u{301} and Tay" on 'e' should just give ["B", "yonce\u{301} and Tay"] since the 'e\u{301}' should be interpreted as a single grapheme instead of an 'e' plus a combining acute.

Splitting on a single character does not crash:

print(str.components(separatedBy: "e"))
// ["B", "yoncé and Tay"]

My swift version is

swiftc -version
Swift version 3.0-dev (LLVM 3e3d712024, Clang 09ad59b006, Swift fdf6ee20e4)
Target: x86_64-unknown-linux-gnu

Solution

  • It looks like there is a bug in the Linux port for Swift. I won't address that in my answer. The code below was tested on Mac OS X.

    You ran into normalization problem of Unicode. The letter é can be expressed in 2 ways, considered identical by Swift:

    let s1 = "e\u{301}" // letter e + combining acute accent
    let s2 = "\u{0e9}"  // small letter e with acute
    
    s1.characters.count // 1
    s2.characters.count // 1
    s1 == s2            // true
    

    That's because Swift's String, like its predecessor NSString has very good support for Unicode. But if you delve deeper, you start to see some differences:

    s1.utf16.count // 2
    s2.utf16.count // 1
    

    So even when s1 and s2 are equal, they are stored differently: using 2 or 1 code points. components(seperatedBy: ) is blind to this fact. It iterates over all the code points in your string and split if it finds the letter e. Converting between one form to another is called normalization and affects how the function works:

    let str1 = "Beyonce\u{301} and Tay"
    let str2 = str1.precomposedStringWithCanonicalMapping // normalize the string to Form C
    
    let charset = CharacterSet(charactersIn: "e")
    str1.components(separatedBy: charset) // ["B", "yonc", "́ and Tay"]
    str2.components(separatedBy: charset) // ["B", "yoncé and Tay"]
    

    References: