Search code examples
pythonpandasdataframepandas-apply

Match all words from string in another string (words can be in different positions)


I have a list of strings that I have to match with dataframe column.

The list looks as follows:

list = ['golden village lte', 'pones wcdma', 'coral gbts', 'street view gbts', 'street view
wcdma']  

The column in the dataframe looks like this:

data = {'COLUMN': ['wcdma street view disconnected', 'gbts planned work street view', 'lte atn golden village optical invalid', 'wcdma street view planned work']}

I'd like to find every row which contains each word from the sting from list so that as a result I could have the next dataframe:

  COLUMN                               |  String    
 wcdma street view disconnected        | street view wcdma  
 gbts planned work street view         | street view gbts  
 lte atn golden village optical invalid| golden village lte  
 wcdma street view planned work        | street view wcdma   

What did I tried to find matches is to provide string in list as list of elements (like ['street', 'view', 'wcdma']) and do searches:

df.apply(lambda x: all(er in x.COLUMN for er in list), axis=1)

But it returns me nothing, even in case I do know that there must be at least one match. It WILL return smth if I change all() to any() but that's not what I need.


Solution

  • import pandas as pd
    list2 = ['golden village lte', 'pones wcdma', 'coral gbts', 'street view gbts', 'street view wcdma']
    list2=[x.split(' ') for x in list1]
    data = {'COLUMN': ['wcdma street view disconnected', 'gbts planned work street view', 'lte atn golden village optical invalid', 'wcdma street view planned work']}
    data=pd.DataFrame(data)
    def search(x):
        list1=x.split(' ')
        for y in list2:
             check=all(item in list1 for item in y)
             if check:
                 return ' '.join(y)
        return None
    data['matched']=data['COLUMN'].transform(search)
    

    Explanation: I am converting each string as list 1st splitting on space. Using transform() for 'COLUMN', I am using all() to detect whether all elements of 'y' are in 'list2'. If yes, I return that string