Search code examples
pythonpandasdataframesetunordered

How to handle unordered sets of tags in Python Pandas Dataframe in terms of merging


I have two Dataframes which contain tags in one of the columns. Tags can be present or absent in the list and are stored in an unordered manner. I would like to get the merge of the two datasets with rows that have the same tags. In plain python I can store them as the set datatype.

tags1 = {"tag1","tag2"}
tags2 = {"tag2","tag1"}
tags1 == tags2
True

However I cant do the same in Pandas since sets are not hashable.

df1 = pd.DataFrame([{'name': 'foo', 'tags': {"tag1","tag2"}}, {'name': 'bar', 'tags': {"tag3","tag4"}}])
df2 = pd.DataFrame([{'name': 'baz', 'tags': {"tag4","tag3"}}, {'name': 'qux', 'tags': {"tag5","tag6"}}])
df1.merge(df2, how="inner", indicator=True, on="tags")

What I would like to see

     name_x  value           name_y _merge
0    bar     {"tag3","tag4"} baz    both

What I get

TypeError: unhashable type: 'set'

What would be the best approach here? Unfortunatelly I cant define a fixed amount of columns since the quantity of the tags is not fixed.


Solution

  • Use frozensets and assign to left_on and right_on parameters if need original sets columns:

    out = (df1.merge(df2, how="inner", indicator=True, 
                    left_on=df1['tags'].apply(frozenset),
                    right_on=df2['tags'].apply(frozenset)))
    print (out)
              key_0 name_x        tags_x name_y        tags_y _merge
    0  (tag3, tag4)    bar  {tag3, tag4}    baz  {tag3, tag4}   both
    

    Or assign frozensets to tags columns:

    out = (df1.assign(tags=df1['tags'].apply(frozenset))
              .merge(df2.assign(tags=df2['tags'].apply(frozenset)), 
                    how="inner", 
                    indicator=True, 
                    on="tags"))
    print (out)
      name_x          tags name_y _merge
    0    bar  (tag3, tag4)    baz   both