Search code examples
pythonpandasunionfrozenset

Frozenset union of two columns


I have a dataset containing two columns with frozensets. Now I would like to merge/take the union of these frozensets. I can do this with a for loop, however my dataset contains > 27 million rows, so I am looking for a way to avoid the for loop. Anyone any thoughts?

Data

import pandas as pd
import numpy as np
d = {'ID1': [frozenset(['a', 'b']), frozenset(['a','c']), frozenset(['c','d'])],
    'ID2': [frozenset(['c', 'g']), frozenset(['i','f']), frozenset(['t','l'])]}
df = pd.DataFrame(data=d)

Code with for loop

from functools import reduce
df['frozenset']=0
for i in range(len(df)):
    df['frozenset'].iloc[i] = reduce(frozenset.union, [df['ID1'][i],df['ID2'][i]])

Desired output

    ID1      ID2     frozenset
0   (a, b)  (c, g)  (a, c, g, b)
1   (a, c)  (f, i)  (a, c, f, i)
2   (c, d)  (t, l)  (c, d, t, l)

Solution

  • Doesn't seem like you need to use functools.reduce here. Doing a direct union with each pair of frozensets should suffice.

    If you want the most speed possible for this sort of operation, I recommend taking a look at list comprehensions (see For loops with pandas - When should I care? for an exhaustive discussion).

    df['union'] = [x | y for x, y in zip(df['ID1'], df['ID2'])]
    df
    
          ID1     ID2         union
    0  (a, b)  (c, g)  (c, a, b, g)
    1  (c, a)  (f, i)  (c, a, i, f)
    2  (c, d)  (l, t)  (c, l, d, t)
    

    If you want this to generalise for multiple columns, you can union them all using frozenset.union().

    df['union2'] = [frozenset.union(*X) for X in df[['ID1', 'ID2']].values]
    df
    
          ID1     ID2         union        union2
    0  (a, b)  (c, g)  (c, a, b, g)  (c, a, b, g)
    1  (c, a)  (f, i)  (c, a, i, f)  (c, a, i, f)
    2  (c, d)  (l, t)  (c, l, d, t)  (c, l, d, t)