Search code examples
pythondaskcudf

How to create unique ID column in DASK_CUDF


How to create unique id column in dsak cudf dataframe across all the partitions So far I am using following technique, but if I increase data to more than 10cr rows it is giving me memory error.

def unique_id(df):
    rag = cupy.arrange(len(df))
    df['unique_id']=rag
    return df
    
part = data.npartitions
data = data.repartitions(npartitions=1)
cols_meta={c:str(data[c].dtype) for c in data.columns}
data = data.map_partitions(lambda df:unique_id(df), meta={**cols_meta,'unique_id'})
data = data.repartitions(npartitions=part)

If there's any other way, or any modification in code, please suggest. Thank you for help


Solution

  • I was doing that because wanted to create ids sequentially, till the length data.

    The other suggestions will likely work. However, one of the easiest way to do this is to create a temporary column with value 1 and use cumsum, like the following:

    import cudf
    import dask_cudf
    ​
    df = cudf.DataFrame({
        "a": ["dog"]*10
    })
    ddf = dask_cudf.from_cudf(df, 3)
    ​
    ddf["temp"] = 1
    ddf["monotonic_id"] = ddf["temp"].cumsum()
    del ddf["temp"]
    ​
    print(ddf.partitions[2].compute())
         a  monotonic_id
    8  dog             9
    9  dog            10
    

    As expected, the two rows in the partition index 2 have IDs 9 and 10. If you need the indexes to start at 0, you can subtract 1.