I have two Dataframes which contain tags in one of the columns. Tags can be present or absent in the list and are stored in an unordered manner. I would like to get the merge of the two datasets with rows that have the same tags. In plain python I can store them as the set datatype.
tags1 = {"tag1","tag2"}
tags2 = {"tag2","tag1"}
tags1 == tags2
True
However I cant do the same in Pandas since sets are not hashable.
df1 = pd.DataFrame([{'name': 'foo', 'tags': {"tag1","tag2"}}, {'name': 'bar', 'tags': {"tag3","tag4"}}])
df2 = pd.DataFrame([{'name': 'baz', 'tags': {"tag4","tag3"}}, {'name': 'qux', 'tags': {"tag5","tag6"}}])
df1.merge(df2, how="inner", indicator=True, on="tags")
What I would like to see
name_x value name_y _merge
0 bar {"tag3","tag4"} baz both
What I get
TypeError: unhashable type: 'set'
What would be the best approach here? Unfortunatelly I cant define a fixed amount of columns since the quantity of the tags is not fixed.
Use frozenset
s and assign to left_on
and right_on
parameters if need original set
s columns:
out = (df1.merge(df2, how="inner", indicator=True,
left_on=df1['tags'].apply(frozenset),
right_on=df2['tags'].apply(frozenset)))
print (out)
key_0 name_x tags_x name_y tags_y _merge
0 (tag3, tag4) bar {tag3, tag4} baz {tag3, tag4} both
Or assign frozenset
s to tags
columns:
out = (df1.assign(tags=df1['tags'].apply(frozenset))
.merge(df2.assign(tags=df2['tags'].apply(frozenset)),
how="inner",
indicator=True,
on="tags"))
print (out)
name_x tags name_y _merge
0 bar (tag3, tag4) baz both