Search code examples
pythonalgorithmtext-segmentation

Text segmentation: Algorithm to match input with the longest words from the dictionary


I need to split a string into words such that each word comes from a dictionary. Also make sure that longest possible word from the left is chosen. Hence

thisisinsane => this is insane (correct as longest possible word from left)
thisisinsane => this is in sane(wrong) 
Assuming 'this', 'is', 'in', 'insane' are all words in the dictionary.

I managed to solved this problem by traversing from the end of the string to the beginning matching longest word possible. But problem started cropping us for problems like these..

shareasale => share as ale(wrong as 'ale' not in the dictionary)
shareasale => share a sale(correct)
Assuming 'share', 'a', 'sale' are all words in the dictionary unlike 'ale'. 

I tried to solve this problem by removing the valid segments found before encountering the error i.e.

shareasale => 'share' and 'as' (error = 'ale')

and removing them once from the dictionary and then solve the problem. So

shareasale => no valid segments when share is removed
shareasale => share a sale (when 'as' is removed from the dictionary.

Thus I managed to solve this problem too. But then I am unable to solve this

asignas => 'as' ( error = 'ignas')

My solution will then remove 'as' from the dictionary and try to solve it

asignas => 'a' 'sign' (error = 'as')

Because in the new recursive call 'as' has been removed from the dictionary. The function I wrote is in this link. I hope someone can go through it and help me find a better algorithm to solve this else suggest modification to my existing algorithm.


Solution

  • Essentially your problem is a tree problem, where at every level all the words that form a prefix of the tree form branches. The branch that leaves no part of the string is a correct solution.

                          thisisinsane
                              |
                              |
                         (this)isinsane
                         /            \
                        /              \
              (this,i)sinsane         (this,is)insane
                  /                     /           \
                 /                     /             \
      (this,i,sin)ane          (this,is,in)sane    (this,is,insane)
                                    /
                                   /
                           (this,is,in,sane)
    

    So in this example there are two solutions, but we want to select the solution using the longest words, that is we want to explore the tree from the right using a depth-first-search strategy.

    So our algorithm should:

    1. Sort the dictionary by descending length.
    2. Find all prefixes of the current string. If there are none, return False.
    3. Set prefix to the longest unexplored prefix.
    4. Remove it from the string. If the string is empty, we found a solution, return a list of all prefixes.
    5. Recurse to 2.
    6. This branch has failed, return False.

    A sample implementation of this solution:

    def segment(string,wset):
        """Segments a string into words prefering longer words givens
        a dictionary wset."""
        # Sort wset in decreasing string order
        wset.sort(key=len, reverse=True)
        result = tokenize(string, wset, "")
        if result:
            result.pop() # Remove the empty string token
            result.reverse() # Put the list into correct order
            return result
        else:
            raise Exception("No possible segmentation!")
    
    def tokenize(string, wset, token):
        """Returns either false if the string can't be segmented by 
        the current wset or a list of words that segment the string
        in reverse order."""
        # Are we done yet?
        if string == "":
            return [token]
        # Find all possible prefixes
        for pref in wset:
            if string.startswith(pref):
                res = tokenize(string.replace(pref, '', 1), wset, pref)
                if res:
                    res.append(token)
                    return res
        # Not possible
        return False
    
    print segment("thisisinsane", ['this', 'is', 'in', 'insane']) # this is insane
    print segment("shareasale", ['share', 'a', 'sale', 'as'])     # share a sale
    print segment("asignas", ['as', 'sign', 'a'])                 # a sign as