Search code examples
pythonliststring-comparison

Comparing if Item in List(A) exists as Sub-Item in List(B)


I have 2 lists, each as a collection of Strings and want to check if an Item of list(A) exists in another item of list(B). So in list(A) there are the criteria-words and phrases that should be found in list(B). I filled List(A) with this (e.g. "innovation", "innovative", "new ways to go") and lemmatized it (['innovation'], ['innovative'], ['new', 'way', 'go'].

In list(B) there are tokenized and lemmatized sentences of a text ('time', new', 'way', 'go'].

In that schema I try to analyze text whether and how often given words and phrases appear in it.

To match the patterns I read that its needed to convert each list-element itself to a string to check if it's a substring of the strings in list(b).

    list_a = [['innovation'], ['innovative'], ['new', 'way', 'go'], ['set', 'trend']]
    list_b = [['time', 'innovation'], ['time', 'go', 'new', 'way'],  ['look', 'innovative', 'creative', 'people']]

    for x in range(len(list_a)):
        for j in range(len(list_b)):
            a = " ".join(list_a[x])
            if any(a in s for s in list_b[j]):
                print("word of list a: ", a, " appears in list b: ", list_b[j])    `

actual output is:

word of list a:  innovation  appears in list b:  ['time', 'innovation']
word of list a:  innovative  appears in list b:  ['look', 'innovative', 'creative', 'people']

my aimed output would be:

word of list a:  innovation  appears in list b:  ['time', 'innovation']
word of list a:  innovative  appears in list b:  ['look', 'innovative', 'creative', 'people']
word of list a: new way go appears in list b: ['time', 'go', 'new', 'way']

Converting the items of list(b) to strings like i tried with list(a) did not help me.

Thanks for your help!


Solution

  • The first mistake is: don't create a string out of your list of words. Work with set of words and the set methods (here: issubset)

    • convert your list of lists of words to lists of set of words
    • loop in sets of first list (a) and check if the set is contained in one of the sets of list_b (not using any else we cannot know which set contains the current set, a simple loop will do)

    Like this:

    list_a = [['innovation'], ['innovative'], ['new', 'way', 'go'], ['set', 'trend']]
    list_b = [['time', 'innovation'], ['time', 'go', 'new', 'way'],  ['look', 'innovative', 'creative', 'people']]
    
    list_a = [set(x) for x in list_a]
    list_b = [set(x) for x in list_b]
    
    for subset in list_a:
        for other_subset in list_b:
            if subset.issubset(other_subset):
                print("{} appears in list b: {}".format(subset,other_subset))
    

    prints:

    {'innovation'} appears in list b: {'time', 'innovation'}
    {'innovative'} appears in list b: {'look', 'creative', 'innovative', 'people'}
    {'new', 'go', 'way'} appears in list b: {'time', 'new', 'go', 'way'}
    

    now if you want to preserve order, but still want to benefit from the advantages of a set for element testing, just create list of tuples instead for list_b because it's iterated several times. No need to do the same for list_a as it's iterated only once:

    # list_a is now unchanged
    list_b = [(set(x),x) for x in list_b]
    
    for sublist in list_a:
        subset = set(sublist)
        for other_subset,other_sublist in list_b:
            if subset.issubset(other_subset):
                print("{} appears in list b: {}".format(sublist,other_sublist))
    

    result:

    ['innovation'] appears in list b: ['time', 'innovation']
    ['innovative'] appears in list b: ['look', 'innovative', 'creative', 'people']
    ['new', 'way', 'go'] appears in list b: ['time', 'go', 'new', 'way']
    

    Algorithm is still costly: O(n**3) but not O(n**4) thanks to O(n) set lookup (compared to list lookup) to test if a list of words is included in the other one.