As per the title, I am trying to combine the row values from different cudf.DataFrame
columns. The following code works for a standard pandas.DataFrame
:
import pandas as pd
data = {'a': [1], 'b': [2], 'c': [3], 'd': [4]}
df = pd.DataFrame.from_dict(data)
def f(row):
return {'dictfromcolumns': [row['a'], row['b'], row['c'], row['d']]}
df['new'] = df.apply(f, axis=1)
The equivalent code with cudf, should look like:
dfgpu = cudf.DataFrame(df)
dfgpu['new'] = dfgpu.apply(f, axis=1)
But this will throw the following ValueError
exception:
ValueError: user defined function compilation failed.
Is there an alternative way to accomplish the combination of cudf columns (in my case I need to create a dict and store it as the value in a new column)
Thanks!
pandas allows storing arbitrary data structures inside columns (such as a dictionary of lists, in your case). cuDF does not. However, cuDF provides an explicit data type called struct
, which is common in big data processing engines and may be want you want in this case.
Your UDF is failing because Numba.cuda doesn't understand the dictionary/list data structures.
The best way to do this is to first collect your data into a single column as a list (cuDF also provides an explicit list
data type). You can do this by melting your data from wide to long (and adding a key column to keep track of the original rows) and then doing a groupby collect
operation. Then, create the struct column.
import pandas as pd
import cudf
import numpy as np
data = {'a': [1, 10], 'b': [2, 11], 'c': [3, 12], 'd': [4, 13]}
df = pd.DataFrame.from_dict(data)
gdf = cudf.from_pandas(df)
gdf["key"] = np.arange(len(gdf))
melted = gdf.melt(id_vars=["key"], value_name="struct_key_name") # wide to long format
gdf["new"] = melted.groupby("key").collect()[["struct_key_name"]].to_struct()
gdf
a b c d key new
0 1 2 3 4 0 {'struct_key_name': [1, 4, 2, 3]}
1 10 11 12 13 1 {'struct_key_name': [10, 13, 11, 12]}
Note that the struct column in cuDF is not the same as "a dictionary in a column". It's a much more efficient, explicit type meant for storing and manipulating columnar {key : value} data. cuDF provides a "struct accessor" to manipulate structs, which you can access at df[col].struct.XXX
. It currently supports selecting individual fields (keys) and the explode operation. You can also carry structs around in other operations (including I/O).