Search code examples
algorithmswiftrealmtext-segmentation

How can I fix this memory issue in my maximum matching algorithm with RealmSwift?


I wrote my own maximum matching function in Swift to divide Chinese sentences into words. It works fine, except with abnormally long sentences the memory usage goes up over 1 gb. I need help figuring out how to modify my code so that there isn't this memory problem. I'm not sure if it has to do with how I am using RealmSwift or if it is my algorithm in general.

Here is my code:

    func splitSentenceIntoWordsWithDictionaryMaximumMatching(string: String) -> [String] {
    var string = string
    var foundWordsArray: [String] = []
    var position = count(string)

    while position > 0
    {
        var index = advance(string.startIndex, position)
        let partialString = string.substringToIndex(index)
        if let found = Realm().objects(Word).filter("simplified == '\(partialString)'").first
        {
            foundWordsArray.append(partialString)
            position = position - 1
            var partialStringCount = count(partialString)
            while partialStringCount > 0
            {
                string = dropFirst(string)
                partialStringCount -= 1
            }
            position = count(string)
            index = advance(string.startIndex, position)
        }
        else if count(partialString) == 1
        {
            addNewEntryToDictionaryInTransaction(partialString, "", [partialString], partialString)
            foundWordsArray.append(partialString)
            var partialStringCount = count(partialString)
            while partialStringCount > 0
            {
                string = dropFirst(string)
                partialStringCount -= 1
            }
            position = count(string)
            index = advance(string.startIndex, position)
        }
        else
        {
            position = position - 1
            index = advance(string.startIndex, position)
        }
    }

    return foundWordsArray
}

Solution

  • In this case, you should use autoreleasepool (see: Use Local Autorelease Pool Blocks to Reduce Peak Memory Footprint) in the loop:

        while position > 0 {
            autoreleasepool {
                var index = advance(string.startIndex, position)
                ...
            }
        }
    

    BTW, your code has too many memory copy operations and O(N) operations on String.Index (count() and advance()), that may cause serious performance problems. Instead, you should effectively use String.Index something like this :

    import Foundation
    
    var  words:Set = ["今日", "献立", "魚", "味噌汁", "定食", "焼き魚", "です"]
    func splitSentenceIntoWordsWithDictionaryMaximumMatching(string: String) -> [String] {
        var foundWordsArray: [String] = []
        var start = string.startIndex
        var end = string.endIndex
    
        while start != end {
            autoreleasepool { // In this case(using builtin `Set`), I think we don't need `autoreleasepool` here. But this is just a demo :)
                let partialString = string[start ..< end]
                if !words.contains(partialString) {
                    if end.predecessor() != start { // faster way of `count(partialString) == 1`
                        end = end.predecessor()
                        return // we cannot use `continue` here because we are in `autoreleasepool` closure
                    }
                    words.insert(partialString)
                }
                foundWordsArray.append(partialString)
                start = end
                end = string.endIndex
            }
        }
    
        return foundWordsArray
    }
    
    var str = "今日の献立は焼き魚定食と味噌汁です"
    let result = splitSentenceIntoWordsWithDictionaryMaximumMatching(str)
    
    debugPrintln(result) // ["今日", "の", "献立", "は", "焼き魚", "定食", "と", "味噌汁", "です"]
    debugPrintln(words) // Set(["献立", "焼き魚", "今日", "魚", "味噌汁", "定食", "は", "と", "です", "の"])
    

    I used builtin Set, but I think you can easily adopts your Realm codes here :)


    ADDED: in reply to the comments:

    reversed version:

    var  words:Set = ["研究", "研究生", "生命", "起源"]
    func splitSentenceIntoWordsWithDictionaryMaximumMatchingReversed(string: String) -> [String] {
        var foundWordsArray: [String] = []
        var start = string.startIndex
        var end = string.endIndex
    
        while start != end {
            autoreleasepool {
                let partialString = string[start ..< end]
                if !words.contains(partialString) {
                    if start.successor() != end {
                        start = start.successor()
                        return
                    }
                    words.insert(partialString)
                }
                foundWordsArray.append(partialString)
                end = start
                start = string.startIndex
            }
        }
    
        return foundWordsArray.reverse()
    }
    
    var str = "研究生命起源"
    let result = splitSentenceIntoWordsWithDictionaryMaximumMatchingReversed(str)
    
    debugPrintln(result) // ["研究", "生命", "起源"]