Search code examples
pythonpandasdataframecudf

Assignin lists as elements of CUDF DataFrame


While using Pandas, I can add lists as elements without issues, as in

import pandas as pd

A = {"cls": "A"}
B = {"cls": "B"}
C = {"cls": ["A", "B"]}

df = pd.DataFrame([A,B,C])
type(df.iloc[2]["cls"])   # Returns `list`

But cudf.DataFrame do not accept a List. As we can see here:

import cudf
cu_df = cudf.DataFrame([A, B, C])

Fails with ArrowTypeError: Expected bytes, got a 'list' object

We can see if we do not add C, it work.

import cudf
cu_df = cudf.DataFrame([A, B])

(no error)

Trying to convert from a regular pandas dataframe, also do not works

cu_df = cudf.DataFrame(df)

(fails with the same ArrowTypeError)

Any ideas in how to circumvent this?


Solution

  • After reading some documentation and this GitHub issue, it says

    list operations are somewhat limited, and a column of lists can't be treated the same as a column of ndarrays in Pandas.

    Thus, you might try to convert the list into string:

    A = {"cls": "A"}
    B = {"cls": "B"}
    C = {"cls": str(["A", "B"])}
    

    and use it in cudf:

    df = pd.DataFrame([A, B, C])
    cu_df = cudf.DataFrame(df)
    

    if that does not help, as mentioned on same issue:

    explode each list column into a flat column, perform the binary operation, then construct a list column back from the result