Search code examples
pythonpandasapplydaskdask-distributed

Dask map_partitions meta when using lambda function to add column


I am using Dask to apply a function myfunc that adds two new columns new_col_1 and new_col_2 to my Dask dataframe data. This function uses two columns a1 and a2 for computing the new columns.

ddata[['new_col_1', 'new_col_2']] = ddata.map_partitions(
lambda df: df.apply((lambda row: myfunc(row['a1'], row['a2'])), axis=1, 
                    result_type="expand")).compute()  

This gives the following error:

ValueError: Metadata inference failed in `lambda`.

You have supplied a custom function and Dask is unable to  determine the type of output that that function returns. 

To resolve this please provide a meta= keyword.

How can I provide the meta keyword for this scenario?


Solution

  • meta can be provided via kwarg to .map_partitions:

    some_result = dask_df.map_partitions(some_func, meta=expected_df)
    

    expected_df could be specified manually, or alternatively you could compute it explicitly on a small sample of data (in which case it will be a pandas dataframe).

    There are more details in the docs.