Search code examples
pythonpandasmergedaskcartesian-product

cross merge/cartesian product in dask


how can I perform the equivalent of this cross merge in dask?

merged_df = pd.merge(df, df, how='cross', suffixes=('', '_y'))

To provide an example, say I have this dataframe, say dataframe A:

#Niusup Niucust
#1        a
#1        b 
#1        c
#2        d
#2        e

and want to obtain this one:

#Niusup Niucust_x Niucust_y
#1        a       a
#1        a       b
#1        a       c
#1        b       a 
#1        b       b
#1        b       c
#1        c       a 
#1        c       b
#1        c       c
#2        d       d 
#2        d       e
#2        e       d
#2        e       e

I need Dask because dataframe A contains 5000000 observations and so I expect the cartesian product to contain a lot of observations.

thank you


Solution

  • Example

    data = {'#Niusup': {0: '#1', 1: '#1', 2: '#1', 3: '#2', 4: '#2'},
     'Niucust': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e'}}
    df = pd.DataFrame(data)
    

    Code

    g = df.groupby('#Niusup')
    dfs = [g.get_group(x) for x in g.groups]
    pd.concat([df_i.merge(df_i.drop('#Niusup', axis=1), how='cross',suffixes=('', '_y')) for df_i in dfs])
    

    output:

        #Niusup Niucust Niucust_y
    0   #1      a       a
    1   #1      a       b
    2   #1      a       c
    3   #1      b       a
    4   #1      b       b
    5   #1      b       c
    6   #1      c       a
    7   #1      c       b
    8   #1      c       c
    0   #2      d       d
    1   #2      d       e
    2   #2      e       d
    3   #2      e       e