Search code examples
iosswiftnsregularexpression

Regular expression doesn't work in Swift, but work in other languages


I know that NSRegularExpression works on Unicode code points and (normal) JavaScript regex works on UTF-16 code units, but I don't know what should I change in my regex.

Regex: <text[^>]+>([^<]+)<\/text>

Works here: regex101

My parsing method:

func parseCaptions(text: String) -> String? {
        let textRange = NSRange(location: 0, length: text.count)
        let regex = try! NSRegularExpression(pattern: "<text[^>]+>([^<]+)<\\/text>")
        let matches = regex.matches(in: text, range: textRange)
        
        var result: String?
        
        for match in matches {
            let range = match.range
            
            let first = text.index(text.startIndex, offsetBy: range.location)
            let last = text.index(text.startIndex, offsetBy: range.location + range.length)
            
            var string = String(text[first...last])
            
            string = string.replacingOccurrences(of: "\n", with: " ")
            string = string.replacingOccurrences(of: "&amp;#39;", with: "'")
            string = string.replacingOccurrences(of: "&amp;quot;", with: "\"")
            string.append("\n")
            
            result = string
        }
        
        return result
    }

Solution

  • It's not the Regex the issue, it's what you do with the matches.

    You do:

    var result: String?
    
    for match in matches {
        let range = match.range
        let first = text.index(text.startIndex, offsetBy: range.location)
        let last = text.index(text.startIndex, offsetBy: range.location + range.length)
    
        var string = String(text[first...last])
        ...
        result = string
    }
    return result
    

    So you're overwriting each time result with the last match.

    A solution:

    func parseCaptions(text: String) -> String {
        //NSRange, based on NSString use UTF16 for counting, while Swift.String use UTF8 by default, so `text.count` might be wrong
        let textRange = NSRange(location: 0, length: text.utf16.count)
        let regex = try! NSRegularExpression(pattern: "<text[^>]+>([^<]+)<\\/text>")
        let matches = regex.matches(in: text, range: textRange)
    
        var result: String = ""
        for match in matches {
            let textNSRange = match.range(at: 1)
            let textRange = Range(textNSRange, in: text)!
            var string = String(text[textRange])
            string = string.replacingOccurrences(of: "\n", with: " ")
            string = string.replacingOccurrences(of: "&#39;", with: "'")
            string = string.replacingOccurrences(of: "&amp;quot;", with: "\"")
            string.append("\n")
            result.append(string)
        }
        return result
    }
    

    So, with input:

    This XML file does not appear to have any style information associated with it. The document tree is shown below.
    <transcript>
    <text start="9.462" dur="1.123">Aaaah</text>
    <text start="70.507" dur="5.51">So guys, apparently we control Rewind this year.</text>
    <text start="76.017" dur="4.842">
    Y&#39;all we can do whatever we want. What do we do?
    </text>
    </transcript>
    

    We get:

    Aaaah
    So guys, apparently we control Rewind this year.
     Y'all we can do whatever we want. What do we do?