I need to split a string into words such that each word comes from a dictionary. Also make sure that longest possible word from the left is chosen. Hence
thisisinsane => this is insane (correct as longest possible word from left)
thisisinsane => this is in sane(wrong)
Assuming 'this', 'is', 'in', 'insane' are all words in the dictionary.
I managed to solved this problem by traversing from the end of the string to the beginning matching longest word possible. But problem started cropping us for problems like these..
shareasale => share as ale(wrong as 'ale' not in the dictionary)
shareasale => share a sale(correct)
Assuming 'share', 'a', 'sale' are all words in the dictionary unlike 'ale'.
I tried to solve this problem by removing the valid segments found before encountering the error i.e.
shareasale => 'share' and 'as' (error = 'ale')
and removing them once from the dictionary and then solve the problem. So
shareasale => no valid segments when share is removed
shareasale => share a sale (when 'as' is removed from the dictionary.
Thus I managed to solve this problem too. But then I am unable to solve this
asignas => 'as' ( error = 'ignas')
My solution will then remove 'as' from the dictionary and try to solve it
asignas => 'a' 'sign' (error = 'as')
Because in the new recursive call 'as' has been removed from the dictionary. The function I wrote is in this link. I hope someone can go through it and help me find a better algorithm to solve this else suggest modification to my existing algorithm.
Essentially your problem is a tree problem, where at every level all the words that form a prefix of the tree form branches. The branch that leaves no part of the string is a correct solution.
thisisinsane
|
|
(this)isinsane
/ \
/ \
(this,i)sinsane (this,is)insane
/ / \
/ / \
(this,i,sin)ane (this,is,in)sane (this,is,insane)
/
/
(this,is,in,sane)
So in this example there are two solutions, but we want to select the solution using the longest words, that is we want to explore the tree from the right using a depth-first-search strategy.
So our algorithm should:
False
.prefix
to the longest unexplored prefix.False
.A sample implementation of this solution:
def segment(string,wset):
"""Segments a string into words prefering longer words givens
a dictionary wset."""
# Sort wset in decreasing string order
wset.sort(key=len, reverse=True)
result = tokenize(string, wset, "")
if result:
result.pop() # Remove the empty string token
result.reverse() # Put the list into correct order
return result
else:
raise Exception("No possible segmentation!")
def tokenize(string, wset, token):
"""Returns either false if the string can't be segmented by
the current wset or a list of words that segment the string
in reverse order."""
# Are we done yet?
if string == "":
return [token]
# Find all possible prefixes
for pref in wset:
if string.startswith(pref):
res = tokenize(string.replace(pref, '', 1), wset, pref)
if res:
res.append(token)
return res
# Not possible
return False
print segment("thisisinsane", ['this', 'is', 'in', 'insane']) # this is insane
print segment("shareasale", ['share', 'a', 'sale', 'as']) # share a sale
print segment("asignas", ['as', 'sign', 'a']) # a sign as