How to create unique id column in dsak cudf dataframe across all the partitions So far I am using following technique, but if I increase data to more than 10cr rows it is giving me memory error.
def unique_id(df):
rag = cupy.arrange(len(df))
df['unique_id']=rag
return df
part = data.npartitions
data = data.repartitions(npartitions=1)
cols_meta={c:str(data[c].dtype) for c in data.columns}
data = data.map_partitions(lambda df:unique_id(df), meta={**cols_meta,'unique_id'})
data = data.repartitions(npartitions=part)
If there's any other way, or any modification in code, please suggest. Thank you for help
I was doing that because wanted to create ids sequentially, till the length data.
The other suggestions will likely work. However, one of the easiest way to do this is to create a temporary column with value 1 and use cumsum
, like the following:
import cudf
import dask_cudf
df = cudf.DataFrame({
"a": ["dog"]*10
})
ddf = dask_cudf.from_cudf(df, 3)
ddf["temp"] = 1
ddf["monotonic_id"] = ddf["temp"].cumsum()
del ddf["temp"]
print(ddf.partitions[2].compute())
a monotonic_id
8 dog 9
9 dog 10
As expected, the two rows in the partition index 2 have IDs 9 and 10. If you need the indexes to start at 0, you can subtract 1.