Search code examples
pandasrapidscudf

User defined function to combine CUDF dataframe columns


As per the title, I am trying to combine the row values from different cudf.DataFrame columns. The following code works for a standard pandas.DataFrame:

import pandas as pd
data = {'a': [1], 'b': [2], 'c': [3], 'd': [4]}
df = pd.DataFrame.from_dict(data)

def f(row):
    return {'dictfromcolumns': [row['a'], row['b'], row['c'], row['d']]}

df['new'] = df.apply(f, axis=1)

The equivalent code with cudf, should look like:

dfgpu = cudf.DataFrame(df)
dfgpu['new'] = dfgpu.apply(f, axis=1)

But this will throw the following ValueError exception:

ValueError: user defined function compilation failed.

Is there an alternative way to accomplish the combination of cudf columns (in my case I need to create a dict and store it as the value in a new column)

Thanks!


Solution

  • pandas allows storing arbitrary data structures inside columns (such as a dictionary of lists, in your case). cuDF does not. However, cuDF provides an explicit data type called struct, which is common in big data processing engines and may be want you want in this case.

    Your UDF is failing because Numba.cuda doesn't understand the dictionary/list data structures.

    The best way to do this is to first collect your data into a single column as a list (cuDF also provides an explicit list data type). You can do this by melting your data from wide to long (and adding a key column to keep track of the original rows) and then doing a groupby collect operation. Then, create the struct column.

    import pandas as pd
    import cudf
    import numpy as np
    
    data = {'a': [1, 10], 'b': [2, 11], 'c': [3, 12], 'd': [4, 13]}
    df = pd.DataFrame.from_dict(data)
    
    gdf = cudf.from_pandas(df)
    gdf["key"] = np.arange(len(gdf))
    
    melted = gdf.melt(id_vars=["key"], value_name="struct_key_name") # wide to long format
    gdf["new"] = melted.groupby("key").collect()[["struct_key_name"]].to_struct()
    gdf
        a   b   c   d   key     new
    0   1   2   3   4   0   {'struct_key_name': [1, 4, 2, 3]}
    1   10  11  12  13  1   {'struct_key_name': [10, 13, 11, 12]}
    

    Note that the struct column in cuDF is not the same as "a dictionary in a column". It's a much more efficient, explicit type meant for storing and manipulating columnar {key : value} data. cuDF provides a "struct accessor" to manipulate structs, which you can access at df[col].struct.XXX. It currently supports selecting individual fields (keys) and the explode operation. You can also carry structs around in other operations (including I/O).