I wrote my own maximum matching function in Swift to divide Chinese sentences into words. It works fine, except with abnormally long sentences the memory usage goes up over 1 gb. I need help figuring out how to modify my code so that there isn't this memory problem. I'm not sure if it has to do with how I am using RealmSwift or if it is my algorithm in general.
Here is my code:
func splitSentenceIntoWordsWithDictionaryMaximumMatching(string: String) -> [String] {
var string = string
var foundWordsArray: [String] = []
var position = count(string)
while position > 0
{
var index = advance(string.startIndex, position)
let partialString = string.substringToIndex(index)
if let found = Realm().objects(Word).filter("simplified == '\(partialString)'").first
{
foundWordsArray.append(partialString)
position = position - 1
var partialStringCount = count(partialString)
while partialStringCount > 0
{
string = dropFirst(string)
partialStringCount -= 1
}
position = count(string)
index = advance(string.startIndex, position)
}
else if count(partialString) == 1
{
addNewEntryToDictionaryInTransaction(partialString, "", [partialString], partialString)
foundWordsArray.append(partialString)
var partialStringCount = count(partialString)
while partialStringCount > 0
{
string = dropFirst(string)
partialStringCount -= 1
}
position = count(string)
index = advance(string.startIndex, position)
}
else
{
position = position - 1
index = advance(string.startIndex, position)
}
}
return foundWordsArray
}
In this case, you should use autoreleasepool
(see: Use Local Autorelease Pool Blocks to Reduce Peak Memory Footprint) in the loop:
while position > 0 {
autoreleasepool {
var index = advance(string.startIndex, position)
...
}
}
BTW, your code has too many memory copy operations and O(N)
operations on String.Index
(count()
and advance()
), that may cause serious performance problems. Instead, you should effectively use String.Index
something like this :
import Foundation
var words:Set = ["今日", "献立", "魚", "味噌汁", "定食", "焼き魚", "です"]
func splitSentenceIntoWordsWithDictionaryMaximumMatching(string: String) -> [String] {
var foundWordsArray: [String] = []
var start = string.startIndex
var end = string.endIndex
while start != end {
autoreleasepool { // In this case(using builtin `Set`), I think we don't need `autoreleasepool` here. But this is just a demo :)
let partialString = string[start ..< end]
if !words.contains(partialString) {
if end.predecessor() != start { // faster way of `count(partialString) == 1`
end = end.predecessor()
return // we cannot use `continue` here because we are in `autoreleasepool` closure
}
words.insert(partialString)
}
foundWordsArray.append(partialString)
start = end
end = string.endIndex
}
}
return foundWordsArray
}
var str = "今日の献立は焼き魚定食と味噌汁です"
let result = splitSentenceIntoWordsWithDictionaryMaximumMatching(str)
debugPrintln(result) // ["今日", "の", "献立", "は", "焼き魚", "定食", "と", "味噌汁", "です"]
debugPrintln(words) // Set(["献立", "焼き魚", "今日", "魚", "味噌汁", "定食", "は", "と", "です", "の"])
I used builtin Set
, but I think you can easily adopts your Realm
codes here :)
ADDED: in reply to the comments:
reversed version:
var words:Set = ["研究", "研究生", "生命", "起源"]
func splitSentenceIntoWordsWithDictionaryMaximumMatchingReversed(string: String) -> [String] {
var foundWordsArray: [String] = []
var start = string.startIndex
var end = string.endIndex
while start != end {
autoreleasepool {
let partialString = string[start ..< end]
if !words.contains(partialString) {
if start.successor() != end {
start = start.successor()
return
}
words.insert(partialString)
}
foundWordsArray.append(partialString)
end = start
start = string.startIndex
}
}
return foundWordsArray.reverse()
}
var str = "研究生命起源"
let result = splitSentenceIntoWordsWithDictionaryMaximumMatchingReversed(str)
debugPrintln(result) // ["研究", "生命", "起源"]