I have a string that can contain arbitrary Unicode characters and I want to get a prefix of that string whose UTF-8 encoded length is as close as possible to 32 bytes, while still being valid UTF-8 and without changing the characters' meaning (i.e. not cutting off an extended grapheme cluster).
Consider this CORRECT example:
let string = "\u{1F3F4}\u{E0067}\u{E0062}\u{E0073}\u{E0063}\u{E0074}\u{E007F}\u{1F1EA}\u{1F1FA}"
print(string) // 🏴🇪🇺
print(string.count) // 2
print(string.utf8.count) // 36
let prefix = string.utf8Prefix(32) // <-- function I want to implement
print(prefix) // 🏴
print(prefix.count) // 1
print(prefix.utf8.count) // 28
print(string.hasPrefix(prefix)) // true
And this example of a WRONG implementation:
let string = "ar\u{1F3F4}\u{200D}\u{2620}\u{FE0F}\u{1F3F4}\u{200D}\u{2620}\u{FE0F}\u{1F3F4}\u{200D}\u{2620}\u{FE0F}"
print(string) // ar🏴☠️🏴☠️🏴☠️
print(string.count) // 5
print(string.utf8.count) // 41
let prefix = string.wrongUTF8Prefix(32) // <-- wrong implementation
print(prefix) // ar🏴☠️🏴☠️🏴
print(prefix.count) // 5
print(prefix.utf8.count) // 32
print(string.hasPrefix(prefix)) // false
What's an elegant way to do this? (besides trial&error)
I discovered that String
and String.UTF8View
share the same indices, so I managed to create a very simple (and efficient?) solution, I think:
extension String {
func utf8Prefix(_ maxLength: Int) -> Substring {
if self.utf8.count <= maxLength {
return Substring(self)
}
var index = self.utf8.index(self.startIndex, offsetBy: maxLength+1)
self.formIndex(before: &index)
return self.prefix(upTo: index)
}
}
maxLength == 32
and startIndex == 0
):The first case (utf8.count <= maxLength
) should be clear, that's where no work is needed.
For the second case we first get the utf8-index 33
, which is either
So if we now move our index back one character (with formIndex(before:)
) this will jump to the first extended grapheme cluster boundary before index
which in case A and B is one character before and in C the start of that character.
I any case, the utf8-index will now be guaranteed to be at most 32
and at an extended grapheme cluster boundary, so prefix(upTo: index)
will safely create a prefix with length ≤32.
…but it's not perfect.
In theory this should also be always the optimal solution, i.e. the prefix's count
is as close as possible to maxLength
but sometimes when the string ends with an extended grapheme cluster consisting of more than one Unicode scalar, formIndex(before: &index)
goes back one character too many than would be necessary, so the prefix ends up shorter. I'm not exactly sure why that's the case.
EDIT: A not as elegant but in exchange completely "correct" solution would be this (still only O(n)):
extension String {
func utf8Prefix(_ maxLength: Int) -> Substring {
if self.utf8.count <= maxLength {
return Substring(self)
}
let endIndex = self.utf8.index(self.startIndex, offsetBy: maxLength)
var index = self.startIndex
while index <= endIndex {
self.formIndex(after: &index)
}
self.formIndex(before: &index)
return self.prefix(upTo: index)
}
}