Search code examples
swiftstringunicodeutf-8

Swift 5: String prefix with a maximum UTF-8 length


I have a string that can contain arbitrary Unicode characters and I want to get a prefix of that string whose UTF-8 encoded length is as close as possible to 32 bytes, while still being valid UTF-8 and without changing the characters' meaning (i.e. not cutting off an extended grapheme cluster).

Consider this CORRECT example:

let string = "\u{1F3F4}\u{E0067}\u{E0062}\u{E0073}\u{E0063}\u{E0074}\u{E007F}\u{1F1EA}\u{1F1FA}"
print(string)                    // 🏴󠁧󠁢󠁳󠁣󠁴󠁿🇪🇺
print(string.count)              // 2
print(string.utf8.count)         // 36

let prefix = string.utf8Prefix(32)  // <-- function I want to implement 
print(prefix)                    // 🏴󠁧󠁢󠁳󠁣󠁴󠁿
print(prefix.count)              // 1
print(prefix.utf8.count)         // 28

print(string.hasPrefix(prefix))  // true

And this example of a WRONG implementation:

let string = "ar\u{1F3F4}\u{200D}\u{2620}\u{FE0F}\u{1F3F4}\u{200D}\u{2620}\u{FE0F}\u{1F3F4}\u{200D}\u{2620}\u{FE0F}"
print(string)                    // ar🏴‍☠️🏴‍☠️🏴‍☠️
print(string.count)              // 5
print(string.utf8.count)         // 41

let prefix = string.wrongUTF8Prefix(32)  // <-- wrong implementation 
print(prefix)                    // ar🏴‍☠️🏴‍☠️🏴
print(prefix.count)              // 5
print(prefix.utf8.count)         // 32

print(string.hasPrefix(prefix))  // false

What's an elegant way to do this? (besides trial&error)


Solution

  • I discovered that String and String.UTF8View share the same indices, so I managed to create a very simple (and efficient?) solution, I think:

    extension String {
        func utf8Prefix(_ maxLength: Int) -> Substring {
            if self.utf8.count <= maxLength {
                return Substring(self)
            }
    
            var index = self.utf8.index(self.startIndex, offsetBy: maxLength+1)
            self.formIndex(before: &index)
            return self.prefix(upTo: index)
        }
    }
    

    Explanation (assuming maxLength == 32 and startIndex == 0):

    The first case (utf8.count <= maxLength) should be clear, that's where no work is needed.
    For the second case we first get the utf8-index 33, which is either

    • A: the endIndex of the string (if it's exactly 33 bytes long),
    • B: an index at the start of a character (after 33 bytes of previous characters)
    • C: an index somewhere in the middle of a character (after <33 bytes of previous characters)

    So if we now move our index back one character (with formIndex(before:)) this will jump to the first extended grapheme cluster boundary before index which in case A and B is one character before and in C the start of that character.
    I any case, the utf8-index will now be guaranteed to be at most 32 and at an extended grapheme cluster boundary, so prefix(upTo: index) will safely create a prefix with length ≤32.


    …but it's not perfect.
    In theory this should also be always the optimal solution, i.e. the prefix's count is as close as possible to maxLength but sometimes when the string ends with an extended grapheme cluster consisting of more than one Unicode scalar, formIndex(before: &index) goes back one character too many than would be necessary, so the prefix ends up shorter. I'm not exactly sure why that's the case.

    EDIT: A not as elegant but in exchange completely "correct" solution would be this (still only O(n)):

    extension String {
        func utf8Prefix(_ maxLength: Int) -> Substring {
            if self.utf8.count <= maxLength {
                return Substring(self)
            }
    
            let endIndex = self.utf8.index(self.startIndex, offsetBy: maxLength)
            var index = self.startIndex
            while index <= endIndex {
                self.formIndex(after: &index)
            }
            self.formIndex(before: &index)
            return self.prefix(upTo: index)
        }
    }