Search code examples
swiftcharacter-encodingfilehandlefile-pointer

Index distance to FileHandle pointer and characters encoding in Swift 4


I have this function to return (and seek) a FileHandle pointer at a specific word:

func getFilePointerIndex(atWord word: String, inFile file: FileHandle) -> UInt64? {
    let offset = file.offsetInFile
    if let str = String(data: file.readDataToEndOfFile(), encoding: .utf8) {
        if let range = str.range(of: word) {
            let intIndex = str.distance(from: str.startIndex, to: range.lowerBound)
            file.seek(toFileOffset: offset + UInt64(intIndex))
            return UInt64(intIndex) + offset
        }
    }
    return nil
}

When applied on some utf8 text files, it yields offset results far from the location of the word passed in. I thought it has to be the character encoding (variable-byte characters), since the seek(toFileOffset:) method applies to class Data objects.

Any idea to fix it?


Solution

  • let intIndex = str.distance(from: str.startIndex, to: range.lowerBound)
    

    measures the distance in Characters, i.e. “extended Unicode grapheme clusters”. For example, the character "€" would be stored as three bytes "0xE2 0x82 0xAC" in UTF-8 encoding, but counts as a single Character.

    To measure the distance in UTF-8 code units, use

    let intIndex = str.utf8.distance(from: str.utf8.startIndex, to: range.lowerBound)
    

    See also Strings in Swift 2 in the Swift blog for an overview about grapheme clusters and the different views of a Swift string.