Search code examples
stringalgorithmlanguage-agnosticdata-structures

How to break down a given text into words from the dictionary?


This is an interview question. Suppose you have a string text and a dictionary (a set of strings). How do you break down text into substrings such that each substring is found in the dictionary.

For example you can break down "thisisatext" into ["this", "is", "a", "text"] using /usr/share/dict/words.

I believe backtracking can solve this problem (in pseudo-Java):

void solve(String s, Set<String> dict, List<String> solution) {
   if (s.length == 0)
      return
   for each prefix of s found in dict
      solve(s without prefix, dict, solution + prefix)
}

List<String> solution = new List<String>()
solve(text, dict, solution)

Does it make sense? Would you optimize the step of searching the prefixes in the dictionary? What data structures would you recommend?


Solution

  • This solution assumes the existence of Trie data structure for the dictionary. Further, for each node in Trie, assumes the following functions:

    1. node.IsWord() : Returns true if the path to that node is a word
    2. node.IsChild(char x): Returns true if there exists a child with label x
    3. node.GetChild(char x): Returns the child node with label x
    Function annotate( String str, int start, int end, int root[], TrieNode node):
    i = start
    while i<=end:
        if node.IsChild ( str[i]):
            node = node.GetChild( str[i] )
            if node.IsWord():
                root[i+1] = start
            i+=1
        else:
            break;
    
    end = len(str)-1
    root = [-1 for i in range(len(str)+1)]
    for start= 0:end:
        if start = 0 or root[start]>=0:
            annotate(str, start, end, root, trieRoot)
    
    index  0  1  2  3  4  5  6  7  8  9  10  11
    str:   t  h  i  s  i  s  a  t  e  x  t
    root: -1 -1 -1 -1  0 -1  4  6 -1  6 -1   7
    

    I will leave the part for you to list the words that make up the string by reverse traversing the root.

    Time complexity is O(nk) where n is the length of the string and k is the length of the longest word in the dictionary.

    PS: I am assuming following words in the dictionary: this, is, a, text, ate.