Comparing if Item in List(A) exists as Sub-Item in List(B)

I have 2 lists, each as a collection of Strings and want to check if an Item of list(A) exists in another item of list(B). So in list(A) there are the criteria-words and phrases that should be found in list(B). I filled List(A) with this (e.g. "innovation", "innovative", "new ways to go") and lemmatized it (['innovation'], ['innovative'], ['new', 'way', 'go'].

In list(B) there are tokenized and lemmatized sentences of a text ('time', new', 'way', 'go'].

In that schema I try to analyze text whether and how often given words and phrases appear in it.

To match the patterns I read that its needed to convert each list-element itself to a string to check if it's a substring of the strings in list(b).

    list_a = [['innovation'], ['innovative'], ['new', 'way', 'go'], ['set', 'trend']]
    list_b = [['time', 'innovation'], ['time', 'go', 'new', 'way'],  ['look', 'innovative', 'creative', 'people']]

    for x in range(len(list_a)):
        for j in range(len(list_b)):
            a = " ".join(list_a[x])
            if any(a in s for s in list_b[j]):
                print("word of list a: ", a, " appears in list b: ", list_b[j])    `

actual output is:

word of list a:  innovation  appears in list b:  ['time', 'innovation']
word of list a:  innovative  appears in list b:  ['look', 'innovative', 'creative', 'people']

my aimed output would be:

word of list a:  innovation  appears in list b:  ['time', 'innovation']
word of list a:  innovative  appears in list b:  ['look', 'innovative', 'creative', 'people']
word of list a: new way go appears in list b: ['time', 'go', 'new', 'way']

Converting the items of list(b) to strings like i tried with list(a) did not help me.

Thanks for your help!

Solution

The first mistake is: don't create a string out of your list of words. Work with set of words and the set methods (here: issubset)

convert your list of lists of words to lists of set of words
loop in sets of first list (a) and check if the set is contained in one of the sets of list_b (not using any else we cannot know which set contains the current set, a simple loop will do)

Like this:

list_a = [['innovation'], ['innovative'], ['new', 'way', 'go'], ['set', 'trend']]
list_b = [['time', 'innovation'], ['time', 'go', 'new', 'way'],  ['look', 'innovative', 'creative', 'people']]

list_a = [set(x) for x in list_a]
list_b = [set(x) for x in list_b]

for subset in list_a:
    for other_subset in list_b:
        if subset.issubset(other_subset):
            print("{} appears in list b: {}".format(subset,other_subset))

prints:

{'innovation'} appears in list b: {'time', 'innovation'}
{'innovative'} appears in list b: {'look', 'creative', 'innovative', 'people'}
{'new', 'go', 'way'} appears in list b: {'time', 'new', 'go', 'way'}

now if you want to preserve order, but still want to benefit from the advantages of a set for element testing, just create list of tuples instead for list_b because it's iterated several times. No need to do the same for list_a as it's iterated only once:

# list_a is now unchanged
list_b = [(set(x),x) for x in list_b]

for sublist in list_a:
    subset = set(sublist)
    for other_subset,other_sublist in list_b:
        if subset.issubset(other_subset):
            print("{} appears in list b: {}".format(sublist,other_sublist))

result:

['innovation'] appears in list b: ['time', 'innovation']
['innovative'] appears in list b: ['look', 'innovative', 'creative', 'people']
['new', 'way', 'go'] appears in list b: ['time', 'go', 'new', 'way']

Algorithm is still costly: O(n**3) but not O(n**4) thanks to O(n) set lookup (compared to list lookup) to test if a list of words is included in the other one.