I have this function to return (and seek) a FileHandle pointer at a specific word:
func getFilePointerIndex(atWord word: String, inFile file: FileHandle) -> UInt64? {
let offset = file.offsetInFile
if let str = String(data: file.readDataToEndOfFile(), encoding: .utf8) {
if let range = str.range(of: word) {
let intIndex = str.distance(from: str.startIndex, to: range.lowerBound)
file.seek(toFileOffset: offset + UInt64(intIndex))
return UInt64(intIndex) + offset
}
}
return nil
}
When applied on some utf8 text files, it yields offset results far from the location of the word passed in. I thought it has to be the character encoding (variable-byte characters), since the seek(toFileOffset:) method applies to class Data objects.
Any idea to fix it?
let intIndex = str.distance(from: str.startIndex, to: range.lowerBound)
measures the distance in Character
s, i.e. “extended Unicode grapheme
clusters”. For example, the character "€" would be stored as three
bytes "0xE2 0x82 0xAC" in UTF-8 encoding, but counts as a single
Character
.
To measure the distance in UTF-8 code units, use
let intIndex = str.utf8.distance(from: str.utf8.startIndex, to: range.lowerBound)
See also Strings in Swift 2 in the Swift blog for an overview about grapheme clusters and the different views of a Swift string.