Search code examples
pythonpandasapplydask

Using Dask on an apply returning several columns (a DataFrame so)


I'm trying to use dask on an apply with a function that outputs 5 floats. I'll simplify in a example here.

def func1(row, param):
    return float(row.Val1) * param, float(row.Val1) * np.power(param, 2)

data = pd.DataFrame(np.array([["A01", 12], ["A02", 24], ["A03", 13]]), columns=["ID", "Val1"])

data2 = dd.from_pandas(data, npartitions=2).map_partitions(lambda df: df.apply(lambda row: func1(row, 2), axis=1, result_type="expand"), meta=pd.DataFrame()).compute(scheduler=get)

If I don't put the meta, I get this error message:

ValueError: Metadata inference failed in `lambda`.

You have supplied a custom function and Dask is unable to 
determine the type of output that that function returns. 

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Original error is below:
------------------------
ValueError("could not convert string to float: 'foo'", 'occurred at index 0')

And if I put a meta (maybe not the appropriate one though...), I get this one:

ValueError: The columns in the computed data do not match the columns in the provided metadata

Anyone can help? :)


Solution

  • The empty DataFrame that you provide doesn't have the correct column names. You don't provide any columns in your metadata, but your output does have them. This is the source of your error.

    The meta value should match the column names and dtypes of your expected output.