Search code examples
pythonpandasgroup-bycomparison

Python Pandas: How to groupby and compare columns


Here is my datafarme 'df':

match           name                   group  
adamant         Adamant Home Network   86   
adamant         ADAMANT, Ltd.          86   
adamant bild    TOV Adamant-Bild       86   
360works        360WORKS               94   
360works        360works.com           94

Per group number I want to compare the names one by one and see if they are matched to a same word from the 'match' column.

So desired output will be counts:

 If they match we count it as 'TP' and if not we count it as 'FN'.

I had an idea of counting number of match words per group number but that would not help completely with what I want:

df.groupby(group).count() 

Does any body have an idea how to do it?


Solution

  • If I understood well your question, this should do the work:

    import re
    import pandas
    
    
    df = pandas.DataFrame([['adamant', 'Adamant Home Network', 86], ['adamant', 'ADAMANT, Ltd.', 86],
                           ['adamant bild', "TOV Adamant-Bild", 86], ['360works', '360WORKS', 94],
                           ['360works ', "360works.com ", 94]], columns=['match', 'name', 'group'])
    
    
    def my_function(group):
        for i, row in group.iterrows():
            if ''.join(re.findall("[a-zA-Z]+", row['match'])).lower() not in ''.join(
                    re.findall("[a-zA-Z]+", row['name'])).lower():
                # parsing the names in each columns and looking for an inclusion
                # if one of the inclusion fails, we return 'FN'
                return 'FN'
        # if all inclusions succeed, we return 'TP'
        return 'TP'
    
    
    res_series = df.groupby('group').apply(my_function)
    res_series.name = 'count'
    res_df = res_series.reset_index()
    print res_df
    

    This will give you this DataFrame:

         group     count
    1    86        'TP'
    2    94        'TP'