This should be simple enough…
import Foundation
let str:String = "Beyonce\u{301} and Tay"
print(str)
print(str.components(separatedBy: CharacterSet(charactersIn: "e")))
Which compiles fine, until I run the executable:
// Beyoncé and Tay
// Illegal instruction (core dumped)
I suspect Swift has a hard time dealing with the combining '\u{65}'
accent mark, but given how much the language stresses its grapheme-based string model, I thought it would be pretty clear that splitting "Beyonce\u{301} and Tay"
on 'e'
should just give ["B", "yonce\u{301} and Tay"]
since the 'e\u{301}'
should be interpreted as a single grapheme instead of an 'e'
plus a combining acute.
Splitting on a single character does not crash:
print(str.components(separatedBy: "e"))
// ["B", "yoncé and Tay"]
My swift version is
swiftc -version
Swift version 3.0-dev (LLVM 3e3d712024, Clang 09ad59b006, Swift fdf6ee20e4)
Target: x86_64-unknown-linux-gnu
It looks like there is a bug in the Linux port for Swift. I won't address that in my answer. The code below was tested on Mac OS X.
You ran into normalization problem of Unicode. The letter é
can be expressed in 2 ways, considered identical by Swift:
let s1 = "e\u{301}" // letter e + combining acute accent
let s2 = "\u{0e9}" // small letter e with acute
s1.characters.count // 1
s2.characters.count // 1
s1 == s2 // true
That's because Swift's String
, like its predecessor NSString
has very good support for Unicode. But if you delve deeper, you start to see some differences:
s1.utf16.count // 2
s2.utf16.count // 1
So even when s1
and s2
are equal, they are stored differently: using 2 or 1 code points. components(seperatedBy: )
is blind to this fact. It iterates over all the code points in your string and split if it finds the letter e
. Converting between one form to another is called normalization and affects how the function works:
let str1 = "Beyonce\u{301} and Tay"
let str2 = str1.precomposedStringWithCanonicalMapping // normalize the string to Form C
let charset = CharacterSet(charactersIn: "e")
str1.components(separatedBy: charset) // ["B", "yonc", "́ and Tay"]
str2.components(separatedBy: charset) // ["B", "yoncé and Tay"]
References: