Search code examples
pythonbigdatadaskmultiple-conditionsdask-dataframe

Dask "Column assignment doesn't support type numpy.ndarray"


I'm trying to use Dask instead of pandas since the data size I'm analyzing is quite large. I wanted to add a flag column based on several conditions.

import dask.array as da
data['Flag'] = da.where((data['col1']>0) & (data['col2']>data['col4'] | data['col3']>data['col4']), 1, 0).compute()

But, then I got the following error message. The above code works perfectly when using np.where with pandas dataframe, but didn't work with dask.array.where.

enter image description here


Solution

  • If numpy works and the operation is row-wise, then one solution is to use .map_partitions:

    def create_flag(data):
        data['Flag'] = np.where((data['col1']>0) & (data['col2']>data['col4'] | data['col3']>data['col4']), 1, 0)
        return data
    
    ddf = ddf.map_partitions(create_flag)