Search code examples
pythonstringsearchnlp

Python: Find combination of keywords in text


I am using the following function to determine if a text has words (or expressions) from a list:

def is_in_text(text, lista=[]):
    return any(i in text for i in lista)

I can pass to this function a list of words and expressions that I would like to find in a text. For example, the following code:

text_a = 'There are white clouds in the sky'
print(is_in_text(text_a, ['clouds in the sky']))

Will return

True

This works if I'm interested in texts that mention "clouds" and "sky". However, if the text varies slightly, I may no longer detect it. For example:

text_b = 'There are white clouds in the beautiful sky'
print(is_in_text(text_b, ['clouds in the sky']))

Will return False.

How can I modify this function to be able to find texts that contain both words, but not necessarily in a predetermined order? In this example, I would like to look for "'clouds' + 'sky' ".

Just to be clear, I am interested in texts that contain both words. I would like to have a function that searchs for these kind of combinations, without me having to enter all these conditions manually.


Solution

  • You can re-write is_in_text to check that each word in whatever list of words you want to check is in the string:

    def is_in_text(text, lista=[]):
        isin = True
        for word in lista:
            isin = isin and (word in text)
        return isin
    

    E.g.

    text_a = 'There are white clouds in the sky'
    print(is_in_text(text_a, ['cloud', 'sky']))
    

    returns True

    while

    text_a = 'There are white clouds in the sky'
    print(is_in_text(text_a, ['dog', 'sky']))
    

    returns False

    This requires you to know what words you want to match the two strings on, though. If you want to check all the words in your string you can split your string on spaces.

    E.g.

    text_b = 'There are white clouds in the beautiful sky'
    print(is_in_text(text_b, 'clouds in the sky'.split(' ')))
    

    now returns True

    Edit:

    So, I think you should probably re-think what you're trying to do since this will be pretty fragile, but based on what you're describing this works:

    def is_in_text(text, lista=[]):
        isin = False
        for string in lista:
            sub_isin = True
            for substr in string.split(' '):
                sub_isin = sub_isin & (substr in text)
    
            isin = isin or sub_isin
        return isin
    

    E.g.

    text_a = 'There are white clouds in the sky'
    print(is_in_text(text_a, ['rain', 'cloud sky']))
    

    evaluates to True

    while

    text_a = 'There are white clouds in the sky'
    print(is_in_text(text_a, ['rain', 'dog sky']))
    

    evaluates to False