Search code examples
pythonstringpandastextblob

How to check if a string contains substring when both are stored in lists in python?


My main string is in dataframe and substrings are stored in lists. My desired output is to find the matched substring. Here is the code I am using.

sentence2 = "Previous study: 03/03/2018 (other hospital)  Findings:   Lung parenchyma: The study reveals evidence of apicoposterior segmentectomy of LUL showing soft tissue thickening adjacent surgical bed at LUL, possibly post operation." 
blob_sentence = TextBlob(sentence2)
noun = blob_sentence.noun_phrases
df1 = pd.DataFrame(noun)
comorbidity_keywords = ["segmentectomy","lobectomy"]
matches =[]
for comorbidity_keywords[0] in df1:
    if comorbidity_keywords[0] in df1 and comorbidity_keywords[0] not in matches:
       matches.append(comorbidity_keywords)

This gives me the result as the string that is not an actual match. The output should be "segmentectomy". But I get [0,'lobectomy']. Please Help!!. I have tried to take help from the answer posted here. Check if multiple strings exist in another string Please help to find out what am I doing incorrectly?


Solution

  • I don't really use TextBlob, but I have two methods that might help you get to your goal. Essentially, I'm splitting the sentence by a whitespace and iterating through that to see if there are any matches. One method returns a list and the other a dictionary of index values and the word.

    ### If you just want a list of words
    def find_keyword_matches(sentence, keyword_list):
        s1 = sentence.split(' ')
        return [i for i in  s1 if i in keyword_list]
    

    Then:

    find_keyword_matches(sentence2, comorbidity_keywords)
    

    Output:

    ['segmentectomy']
    

    For a dictionary:

    def find_keyword_matches(sentence, keyword_list):
        s1 = sentence.split(' ')
        return {xyz.index(i):i for i in xyz if i in comorbidity_keywords}
    

    Output:

    {17: 'segmentectomy'}
    

    Finally, an iterator that will also print where in the sentence a word is found, if at all:

    def word_range(sentence, keyword):
        try:
            idx_start = sentence.index(keyword)
            idx_end = idx_start + len(keyword)
            print(f'Word \'{keyword}\' found within index range {idx_start} to {idx_end}')
            if idx_start > 0:
                return keyword
        except ValueError:
            pass
    

    Then do a nested list comprehension to get rid of None values:

    found_words = [x for x in [word_range(sentence2, i) for i in comorbidity_keywords] if not x is None]