LZW algorithm is used to find patterns between input symbols. But can it seek pattern among words ? I mean the alfabet index not to be symbols but words for example for the input :
'abcd', 'abcd', 'fasf' , 'asda', 'abcd' , 'fasf' ...
to have an output like :
'abcd', '1', 'fasf' , 'asda' , '1', '2' ...
Or is there any compressing algorithm that does the trick ?
keys = []
def lzw(text):
tokens = text.split()
new_keys = dict.fromkeys(tokens).keys()
keys.extend([key for key in new_keys if key not in keys])
encoded = ["%s"%keys.index(tok) for tok in tokens]
for i,key in enumerate(keys):
try:
encoded[encoded.index(str(i))] = key
except:
pass
return " ".join(encoded)
print lzw("abcd abcd fasf asda abcd fasf")
#outputs: abcd 0 fasf asda 0 2
is a pretty easy implementation