The character ๐ฉโ๐ฉโ๐งโ๐ฆ (family with two women, one girl, and one boy) is encoded as such:
U+1F469
WOMAN
,
โU+200D
ZWJ
,
U+1F469
WOMAN
,
U+200D
ZWJ
,
U+1F467
GIRL
,
U+200D
ZWJ
,
U+1F466
BOY
So it's very interestingly-encoded; the perfect target for a unit test. However, Swift doesn't seem to know how to treat it. Here's what I mean:
"๐ฉโ๐ฉโ๐งโ๐ฆ".contains("๐ฉโ๐ฉโ๐งโ๐ฆ") // true
"๐ฉโ๐ฉโ๐งโ๐ฆ".contains("๐ฉ") // false
"๐ฉโ๐ฉโ๐งโ๐ฆ".contains("\u{200D}") // false
"๐ฉโ๐ฉโ๐งโ๐ฆ".contains("๐ง") // false
"๐ฉโ๐ฉโ๐งโ๐ฆ".contains("๐ฆ") // true
So, Swift says it contains itself (good) and a boy (good!). But it then says it does not contain a woman, girl, or zero-width joiner. What's happening here? Why does Swift know it contains a boy but not a woman or girl? I could understand if it treated it as a single character and only recognized it containing itself, but the fact that it got one subcomponent and no others baffles me.
This does not change if I use something like "๐ฉ".characters.first!
.
Even more confounding is this:
let manual = "\u{1F469}\u{200D}\u{1F469}\u{200D}\u{1F467}\u{200D}\u{1F466}"
Array(manual.characters) // ["๐ฉโ", "๐ฉโ", "๐งโ", "๐ฆ"]
Even though I placed the ZWJs in there, they aren't reflected in the character array. What followed was a little telling:
manual.contains("๐ฉ") // false
manual.contains("๐ง") // false
manual.contains("๐ฆ") // true
So I get the same behavior with the character array... which is supremely annoying, since I know what the array looks like.
This also does not change if I use something like "๐ฉ".characters.first!
.
This has to do with how the String
type works in Swift, and how the contains(_:)
method works.
The '๐ฉโ๐ฉโ๐งโ๐ฆ ' is what's known as an emoji sequence, which is rendered as one visible character in a string. The sequence is made up of Character
objects, and at the same time it is made up of UnicodeScalar
objects.
If you check the character count of the string, you'll see that it is made up of four characters, while if you check the unicode scalar count, it will show you a different result:
print("๐ฉโ๐ฉโ๐งโ๐ฆ".characters.count) // 4
print("๐ฉโ๐ฉโ๐งโ๐ฆ".unicodeScalars.count) // 7
Now, if you parse through the characters and print them, you'll see what seems like normal characters, but in fact the three first characters contain both an emoji as well as a zero-width joiner in their UnicodeScalarView
:
for char in "๐ฉโ๐ฉโ๐งโ๐ฆ".characters {
print(char)
let scalars = String(char).unicodeScalars.map({ String($0.value, radix: 16) })
print(scalars)
}
// ๐ฉโ
// ["1f469", "200d"]
// ๐ฉโ
// ["1f469", "200d"]
// ๐งโ
// ["1f467", "200d"]
// ๐ฆ
// ["1f466"]
As you can see, only the last character does not contain a zero-width joiner, so when using the contains(_:)
method, it works as you'd expect. Since you aren't comparing against emoji containing zero-width joiners, the method won't find a match for any but the last character.
To expand on this, if you create a String
which is composed of an emoji character ending with a zero-width joiner, and pass it to the contains(_:)
method, it will also evaluate to false
. This has to do with contains(_:)
being the exact same as range(of:) != nil
, which tries to find an exact match to the given argument. Since characters ending with a zero-width joiner form an incomplete sequence, the method tries to find a match for the argument while combining characters ending with a zero-width joiners into a complete sequence. This means that the method won't ever find a match if:
To demonstrate:
let s = "\u{1f469}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}" // ๐ฉโ๐ฉโ๐งโ๐ฆ
s.range(of: "\u{1f469}\u{200d}") != nil // false
s.range(of: "\u{1f469}\u{200d}\u{1f469}") != nil // false
However, since the comparison only looks ahead, you can find several other complete sequences within the string by working backwards:
s.range(of: "\u{1f466}") != nil // true
s.range(of: "\u{1f467}\u{200d}\u{1f466}") != nil // true
s.range(of: "\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}") != nil // true
// Same as the above:
s.contains("\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}") // true
The easiest solution would be to provide a specific compare option to the range(of:options:range:locale:)
method. The option String.CompareOptions.literal
performs the comparison on an exact character-by-character equivalence. As a side note, what's meant by character here is not the Swift Character
, but the UTF-16 representation of both the instance and comparison string โ however, since String
doesn't allow malformed UTF-16, this is essentially equivalent to comparing the Unicode scalar representation.
Here I've overloaded the Foundation
method, so if you need the original one, rename this one or something:
extension String {
func contains(_ string: String) -> Bool {
return self.range(of: string, options: String.CompareOptions.literal) != nil
}
}
Now the method works as it "should" with each character, even with incomplete sequences:
s.contains("๐ฉ") // true
s.contains("๐ฉ\u{200d}") // true
s.contains("\u{200d}") // true