Imagine I have a Dask
DataFrame from read_csv
or created another way.
How can I make a unique index for the dask dataframe?
Note:
reset_index
builds a monotonically ascending index in each partition. That means (0,1,2,3,4,5,... ) for Partition 1,
(0,1,2,3,4,5,... ) for Partition 2, (0,1,2,3,4,5,... ) for Partition 3 and so on.
I would like a unique index for every row in the dataframe (across all partitions).
This is my approach (function) for building a unique index with map_partitions and truly random numbers, as simply reset_index creates a monotonically ascending index in each Partition!
import sys
import random
from dask.distributed import Client
client = Client()
def createDDF_u_idx(ddf):
def create_u_idx(df):
rng = random.SystemRandom()
p_id = str(rng.randint(0, sys.maxsize))
df['idx'] = [p_id + 'a' + str(x) for x in range(df.index.size)]
return df
cols_meta = {c: str(ddf[c].dtype) for c in ddf.columns}
ddf = ddf.map_partitions(lambda df: create_u_idx(df), meta={**cols_meta, 'idx': 'str'})
ddf = client.persist(ddf) # compute up to here, keep results in memory
ddf = ddf.set_index('idx')
return ddf