Search code examples
pythonpandasmatchtopic-modeling

Find matching rows in dataframes based on number of matching items


I have two topic models, topics1 and topics2. They were created from very similar but different datasets. As a result, the words representing each topic/cluster as well as the topic numbers will be different for each dataset. A toy example looks like:

import pandas as pd
topics1 = pd.DataFrame({'topic_num':[1,2,3],
                        'words':[['red','blue','green'],
                                 ['blue','sky','cloud'],
                                 ['eat','food','nomnom']]
                        })
topics2 = pd.DataFrame({'topic_num':[1,2,3],
                        'words':[['blue','sky','airplane'],
                                 ['blue','green','yellow'],
                                 ['mac','bit','byte']]
                        })

For each topic in topics1, I would like to find the topic in topics2 with the maximum number of matches. In the above example, in topics1 topic_num 1 would match topic_num 2 intopics2 and topic_num 2 in topics1 would match topic_num 1 in topics2. In both of these cases, 2 of the 3 words in each row match across dataframes.

Is there a way to find this using built-in pandas functions such as eq()? My solution just iterated across every word in topics1 and eery word in topics2.


Solution

  • You will have to compute all combinations in any case.

    You could use sets and broadcast the comparisons:

    # set the topic_num as index
    # convert the Series of lists to Series of sets
    s1 = topics1.set_index('topic_num')['words'].map(set)
    s2 = topics2.set_index('topic_num')['words'].map(set)
    
    # convert to numpy arrays
    # broadcast the intersection of s1 and s2
    # to form a square array, convert back to DataFrame
    # get the length of the intersection
    tmp = (pd.DataFrame(s1.to_numpy()[:, None]
                        &s2.to_numpy(),
                        index=s1.index, columns=s2.index
                       )
             .map(len) #.applymap(len)
           )
    
    # get the col id with the max length per row
    # mask if the length was 0
    out = tmp.idxmax(axis=1).where(tmp.gt(0).any(axis=1))
    

    Output:

    topic_num
    1    2.0
    2    1.0
    3    NaN
    dtype: float64
    

    Intermediate tmp:

    topic_num  1  2  3
    topic_num         
    1          1  2  0
    2          2  1  0
    3          0  0  0
    

    Note that this logic can be wrapped by cdist as rightly suggested by @Onyambu, however I would first convert to set to avoid doing it repeatedly in the cdist (which would be expensive). This should be done in a bit weird way by encapsulating the sets in list since cdist requires a 2D input:

    from scipy.spatial.distance import cdist
    
    dst = lambda x, y : len(x[0] & y[0])
    val = cdist([[set(x)] for x in topics1['words']],
                [[set(x)] for x in topics2['words']], dst)
    
    pos = val.argmax(1)
    out = pd.DataFrame({'topic1': topics1['topic_num'],
                        'topic2': (topics2['topic_num']
                                   .iloc[pos]
                                   .where(val[topics1.index, pos]!=0)
                                   .values
                                  )})
    #    topic1  topic2
    # 0       1     2.0
    # 1       2     1.0
    # 2       3     NaN