Search code examples
pythonregexpandaskeywordkeyword-search

Python matching various keyword from dictionary issues


I have a complex text where I am categorizing different keywords stored in a dictionary:

    text = 'data-ls-static="1">Making Bio Implants, Drug Delivery and 3D Printing in Medicine,MEDICINE</h3>'

    sector = {"med tech": ['Drug Delivery' '3D printing', 'medicine', 'medical technology', 'bio cell']}

this can successfully find my keywords and categorize them with some limitations:

    pattern = r'[a-zA-Z0-9]+'

    [cat for cat in sector if any(x in re.findall(pattern,text) for x in sector[cat])]

The limitations that I cannot solve are:

  1. For example, keywords like "Drug Delivery" that are separated by a space are not recognized and therefore categorized.

  2. I was not able to make the pattern case insensitive, as words like MEDICINE are not recognized. I tried to add (?i) to the pattern but it doesn't work.

  3. The categorized keywords go into a pandas df, but they are printed into []. I tried to loop again the script to take them out but they are still there.

Data to pandas df:

    ind_list = []
    for site in url_list:
        ind = [cat for cat in indication if any(x in re.findall(pattern,soup_string) for x in indication[cat])]
        ind_list.append(ind)

    websites['Indication'] = ind_list

Current output:

Website                                  Sector                              Sub-sector                                 Therapeutical Area Focus URL status
0     url3.com                              [med tech]                                      []                                                 []          []         []
1     www.url1.com                    [med tech, services]                                      []                       [oncology, gastroenterology]          []         []
2     www.url2.com                    [med tech, services]                                      []                                        [orthopedy]          []         []

In the output I get [] that I'd like to avoid.

Can you help me with these points?

Thanks!


Solution

  • Give you some hints here the problem that can readily be spot:

    1. Why can't match keywords like "Drug Delivery" that are separated by a space ? This is because the regex pattern r'[a-zA-Z0-9]+' does not match for a space. You can change it to r'[a-zA-Z0-9 ]+' (added a space after 9) if you want to match also for a space. However, if you want to support other types of white spaces (e.g. \t, \n), you need to further change this regex pattern.

    2. Why don't support case insensitive match ? Your code fragment any(x in re.findall(pattern,text) for x in sector[cat]) requires x to have the same upper/lower case for BOTH being in result of re.findall and being in sector[cat]. This constrain even cannot be bypassed by setting flags=re.I in the re.findall() call. Suggest you to convert them all to the same case before checking. That is, for example change them all to lower cases before matching: any(x in re.findall(pattern,text.lower()) for x.lower() in sector[cat]) Here we added .lower() to both text and x.lower().

    With the above 2 changes, it should allow you to capture some categorized keywords.

    Actually, for this particular case, you may not need to use regular expression and re.findall at all. You may just check e.g. sector[cat][i].lower()) in text.lower(). That is, change the list comprehension as follows:

    [cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
    

    Edit

    Test Run with 2-word phrase:

    text = 'drug delivery'
    sector = {"med tech": ['Drug Delivery', '3D printing', 'medicine', 'medical technology', 'bio cell']}
    [cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
    
    Output:       # Successfully got the categorizing keyword even with dictionary values of different upper/lower cases
    ['med tech']
    
    text = 'Drug Store fast delivery'
    [cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
    
    Ouptput:    # Correctly doesn't match with extra words in between 
    
    []