Search code examples
pythonlistnumpydaskdask-dataframe

Creating a new column in dask (arrays ,list)


What would be the equivalent of transforming this to a dask format

df['x'] = np.where(df['y'].isin(a_list), 'yes', 'no')

The df will be a dask dataframe with n partitions and a_list is a just a list of items.

The error I am getting if i just change np.where to da.where ,while using the dask dataframe is that number of partitions do not match 1 != n


Solution

  • This can be achieved without np:

    df["x"] = df["y"].isin(a_list).map({False: "No", True: "Yes"})
    

    Here's a reproducible example:

    import dask
    
    df = dask.datasets.timeseries(seed=123)
    
    df["x"] = df["name"].isin(["Bob", "Tim"]).map({False: "No", True: "Yes"})
    
    print(df.head(10))
    #                        id      name    x         y
    # timestamp
    # 2000-01-01 00:00:00  1064     Wendy   No  0.921843
    # 2000-01-01 00:00:01   983     Edith   No -0.196625
    # 2000-01-01 00:00:02  1028     Alice   No -0.512889
    # 2000-01-01 00:00:03  1000       Tim  Yes -0.378292
    # 2000-01-01 00:00:04  1022     Wendy   No -0.640633
    # 2000-01-01 00:00:05  1024       Bob  Yes  0.664895
    # 2000-01-01 00:00:06  1011     Quinn   No  0.940216
    # 2000-01-01 00:00:07   971   Norbert   No -0.750241
    # 2000-01-01 00:00:08  1035    Hannah   No -0.335760
    # 2000-01-01 00:00:09  1041  Patricia   No  0.984533