Search code examples
stringnlp

How can I split multiple joined words?


I have an array of 1000 or so entries, with examples below:

wickedweather
liquidweather
driveourtrucks
gocompact
slimprojector

I would like to be able to split these into their respective words, as:

wicked weather
liquid weather
drive our trucks
go compact
slim projector

I was hoping a regular expression my do the trick. But, since there is no boundary to stop on, nor is there any sort of capitalization that I could possibly key on, I am thinking, that some sort of reference to a dictionary might be necessary?

I suppose it could be done by hand, but why - when it can be done with code! =) But this has stumped me. Any ideas?


Solution

  • The Viterbi algorithm is much faster. It computes the same scores as the recursive search in Dmitry's answer above, but in O(n) time. (Dmitry's search takes exponential time; Viterbi does it by dynamic programming.)

    import re
    from collections import Counter
    
    def viterbi_segment(text):
        probs, lasts = [1.0], [0]
        for i in range(1, len(text) + 1):
            prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
                            for j in range(max(0, i - max_word_length), i))
            probs.append(prob_k)
            lasts.append(k)
        words = []
        i = len(text)
        while 0 < i:
            words.append(text[lasts[i]:i])
            i = lasts[i]
        words.reverse()
        return words, probs[-1]
    
    def word_prob(word): return dictionary[word] / total
    def words(text): return re.findall('[a-z]+', text.lower()) 
    dictionary = Counter(words(open('big.txt').read()))
    max_word_length = max(map(len, dictionary))
    total = float(sum(dictionary.values()))
    

    Testing it:

    >>> viterbi_segment('wickedweather')
    (['wicked', 'weather'], 5.1518198982768158e-10)
    >>> ' '.join(viterbi_segment('itseasyformetosplitlongruntogetherblocks')[0])
    'its easy for me to split long run together blocks'
    

    To be practical you'll likely want a couple refinements:

    • Add logs of probabilities, don't multiply probabilities. This avoids floating-point underflow.
    • Your inputs will in general use words not in your corpus. These substrings must be assigned a nonzero probability as words, or you end up with no solution or a bad solution. (That's just as true for the above exponential search algorithm.) This probability has to be siphoned off the corpus words' probabilities and distributed plausibly among all other word candidates: the general topic is known as smoothing in statistical language models. (You can get away with some pretty rough hacks, though.) This is where the O(n) Viterbi algorithm blows away the search algorithm, because considering non-corpus words blows up the branching factor.