I have 2 lists, each as a collection of Strings and want to check if an Item of list(A)
exists in another item of list(B)
.
So in list(A)
there are the criteria-words and phrases that should be found in list(B)
.
I filled List(A)
with this (e.g. "innovation", "innovative", "new ways to go")
and lemmatized
it (['innovation'], ['innovative'], ['new', 'way', 'go']
.
In list(B)
there are tokenized
and lemmatized
sentences of a text ('time', new', 'way', 'go']
.
In that schema I try to analyze text whether and how often given words and phrases appear in it.
To match the patterns I read that its needed to convert each list-element itself to a string to check if it's a substring of the strings in list(b)
.
list_a = [['innovation'], ['innovative'], ['new', 'way', 'go'], ['set', 'trend']]
list_b = [['time', 'innovation'], ['time', 'go', 'new', 'way'], ['look', 'innovative', 'creative', 'people']]
for x in range(len(list_a)):
for j in range(len(list_b)):
a = " ".join(list_a[x])
if any(a in s for s in list_b[j]):
print("word of list a: ", a, " appears in list b: ", list_b[j]) `
actual output is:
word of list a: innovation appears in list b: ['time', 'innovation']
word of list a: innovative appears in list b: ['look', 'innovative', 'creative', 'people']
my aimed output would be:
word of list a: innovation appears in list b: ['time', 'innovation']
word of list a: innovative appears in list b: ['look', 'innovative', 'creative', 'people']
word of list a: new way go appears in list b: ['time', 'go', 'new', 'way']
Converting the items of list(b)
to strings like i tried with list(a)
did not help me.
Thanks for your help!
The first mistake is: don't create a string out of your list of words. Work with set
of words and the set methods (here: issubset
)
list_b
(not using any
else we cannot know which set contains the current set, a simple loop will do)Like this:
list_a = [['innovation'], ['innovative'], ['new', 'way', 'go'], ['set', 'trend']]
list_b = [['time', 'innovation'], ['time', 'go', 'new', 'way'], ['look', 'innovative', 'creative', 'people']]
list_a = [set(x) for x in list_a]
list_b = [set(x) for x in list_b]
for subset in list_a:
for other_subset in list_b:
if subset.issubset(other_subset):
print("{} appears in list b: {}".format(subset,other_subset))
prints:
{'innovation'} appears in list b: {'time', 'innovation'}
{'innovative'} appears in list b: {'look', 'creative', 'innovative', 'people'}
{'new', 'go', 'way'} appears in list b: {'time', 'new', 'go', 'way'}
now if you want to preserve order, but still want to benefit from the advantages of a set
for element testing, just create list of tuples instead for list_b
because it's iterated several times. No need to do the same for list_a
as it's iterated only once:
# list_a is now unchanged
list_b = [(set(x),x) for x in list_b]
for sublist in list_a:
subset = set(sublist)
for other_subset,other_sublist in list_b:
if subset.issubset(other_subset):
print("{} appears in list b: {}".format(sublist,other_sublist))
result:
['innovation'] appears in list b: ['time', 'innovation']
['innovative'] appears in list b: ['look', 'innovative', 'creative', 'people']
['new', 'way', 'go'] appears in list b: ['time', 'go', 'new', 'way']
Algorithm is still costly: O(n**3)
but not O(n**4)
thanks to O(n)
set lookup (compared to list lookup) to test if a list of words is included in the other one.