There are two lists to be matched, li_a
is the given list consists of sequence of characters of a sentence, whereas li_b
is the collection of words.
li_a = ['T','h','e','s','e','45','a','r','e','c','a','r','s']
li_b = ['T','Th','The','Thes','These','a','ar','are','c','ca','car','cars']
The process is to match items of li_a
iteratively, with li_b
items. If the first character of li_a
is similar to li_b
items, the first character of li_a
joins with next character, and redo the process until it reaches to its longest match. Then, the longest term should be split, and the process will continue till the end. As the unknown characters and words of li_a
which do not appear in li_b
will be appended as they were before.
The final work should be like this:
new_li = ['These','45','are','cars']
The attempt getting so far, but this works for two strings not for Lists, and it doesn't retrieve unidentified words.
def longest_matched_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
for y in xrange(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return s1[x_longest - longest: x_longest]
You can do it using two for loops
and a temp variable
as the following:
def longest_matched_substring(li1, li2):
new_li = []
tmp = ''
for a in li1:
tmp += a
count = 0
for b in li2:
if tmp == b:
count += 1
if count == 0:
tmp1 = tmp.replace(a, '')
new_li.append(tmp1)
tmp = a
if li2.__contains__(tmp):
new_li.append(tmp)
return new_li
INPUT:
li_a = ['T','h','e','s','e','45','a','r','e','c','a','r','s']
li_b = ['T','Th','The','Thes','These','a','ar','are','c','ca','car','cars']
print longest_matched_substring(li_a, li_b)
OUTPUT:
['These', '45', 'are', 'cars']
as for the new scenario, you can modify the function as the following:
def longest_matched_substring(li1, li2):
new_li = []
tmp = ''
for a in li1:
tmp += a
count = 0
for b in li2:
if tmp == b:
count += 1
if count == 0:
tmp1 = tmp.replace(a, '')
new_li.append(tmp1)
tmp = a
if li_b.__contains__(tmp):
new_li.append(tmp)
for e1 in new_li:
tmp2 = e1
rm = []
for e2 in new_li:
if e1 != e2:
tmp2 += e2
rm.append(e2)
if tmp2 in li2:
new_li.insert(new_li.index(e1), tmp2) # if order matters
#new_li.append(tmp2) if order doesn't matter
for r in rm:
new_li.remove(r)
new_li.remove(e1)
rm = []
break
return new_li
INPUT:
li_a = ['T','h','e','s','e','45','a','r','e','c','a','r','s']
li_b = ['T','Th','The','These','a','ar','are','c','ca','car','cars']
print longest_matched_substring(li_a, li_b)
OUTPUT:
['These', '45', 'are', 'cars']