I know that NSRegularExpression
works on Unicode code points and (normal) JavaScript regex works on UTF-16 code units, but I don't know what should I change in my regex.
Regex: <text[^>]+>([^<]+)<\/text>
Works here: regex101
My parsing method:
func parseCaptions(text: String) -> String? {
let textRange = NSRange(location: 0, length: text.count)
let regex = try! NSRegularExpression(pattern: "<text[^>]+>([^<]+)<\\/text>")
let matches = regex.matches(in: text, range: textRange)
var result: String?
for match in matches {
let range = match.range
let first = text.index(text.startIndex, offsetBy: range.location)
let last = text.index(text.startIndex, offsetBy: range.location + range.length)
var string = String(text[first...last])
string = string.replacingOccurrences(of: "\n", with: " ")
string = string.replacingOccurrences(of: "&#39;", with: "'")
string = string.replacingOccurrences(of: "&quot;", with: "\"")
string.append("\n")
result = string
}
return result
}
It's not the Regex the issue, it's what you do with the matches.
You do:
var result: String?
for match in matches {
let range = match.range
let first = text.index(text.startIndex, offsetBy: range.location)
let last = text.index(text.startIndex, offsetBy: range.location + range.length)
var string = String(text[first...last])
...
result = string
}
return result
So you're overwriting each time result
with the last match.
A solution:
func parseCaptions(text: String) -> String {
//NSRange, based on NSString use UTF16 for counting, while Swift.String use UTF8 by default, so `text.count` might be wrong
let textRange = NSRange(location: 0, length: text.utf16.count)
let regex = try! NSRegularExpression(pattern: "<text[^>]+>([^<]+)<\\/text>")
let matches = regex.matches(in: text, range: textRange)
var result: String = ""
for match in matches {
let textNSRange = match.range(at: 1)
let textRange = Range(textNSRange, in: text)!
var string = String(text[textRange])
string = string.replacingOccurrences(of: "\n", with: " ")
string = string.replacingOccurrences(of: "'", with: "'")
string = string.replacingOccurrences(of: "&quot;", with: "\"")
string.append("\n")
result.append(string)
}
return result
}
So, with input:
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<transcript>
<text start="9.462" dur="1.123">Aaaah</text>
<text start="70.507" dur="5.51">So guys, apparently we control Rewind this year.</text>
<text start="76.017" dur="4.842">
Y'all we can do whatever we want. What do we do?
</text>
</transcript>
We get:
Aaaah
So guys, apparently we control Rewind this year.
Y'all we can do whatever we want. What do we do?