Search code examples
pythonpandasdataframeassociations

Obtaining which element in a df column appears most frequently for each individual element in another df column (list of various individual strings)


I have a column called 'tags' in my pandas dataframe that is a list of multiple strings.

[abc, 123, xyz]
[456, 123]
[abc, 123, xyz]

and I have another column tech that has a single string each

win
mac
win

Please let me know if there is a way for me to get which element in tech appears most frequently for each element in tags. For example, 'abc' is most frequently associated with 'win' as compared to the other techs. So the output should look like :

abc win
123 win
xyz win
456 mac

Solution

  • IIUC , you can explode the Tags column and use crosstab with idxmax:

    Input:

    d = {'Tags':[['abc', 123, 'xyz'],[456, 123],['abc', 123, 'xyz']],
         'tech':['win','mac','win']}
    df = pd.DataFrame(d)
    print(df)
    
                  Tags tech
    0  [abc, 123, xyz]  win
    1       [456, 123]  mac
    2  [abc, 123, xyz]  win
    

    Solution:

    m = df.explode('Tags')
    out = pd.crosstab(m['Tags'],m['tech']).idxmax(1)
    
    
    Tags
    123    win
    456    mac
    abc    win
    xyz    win
    dtype: object