Search code examples
pythonpandasdataframedaskdask-dataframe

How to create unique index in Dask DataFrame?


Imagine I have a Dask DataFrame from read_csv or created another way.

How can I make a unique index for the dask dataframe?

Note:

reset_index builds a monotonically ascending index in each partition. That means (0,1,2,3,4,5,... ) for Partition 1, (0,1,2,3,4,5,... ) for Partition 2, (0,1,2,3,4,5,... ) for Partition 3 and so on.

I would like a unique index for every row in the dataframe (across all partitions).


Solution

  • This is my approach (function) for building a unique index with map_partitions and truly random numbers, as simply reset_index creates a monotonically ascending index in each Partition!

    import sys
    import random
    from dask.distributed import Client
    
    client = Client()
    
    def createDDF_u_idx(ddf):
    
        def create_u_idx(df):
            rng = random.SystemRandom()
            p_id = str(rng.randint(0, sys.maxsize))
    
            df['idx'] = [p_id + 'a' + str(x) for x in range(df.index.size)]
    
            return df
        cols_meta = {c: str(ddf[c].dtype) for c in ddf.columns}
        ddf = ddf.map_partitions(lambda df: create_u_idx(df), meta={**cols_meta, 'idx': 'str'})
        ddf = client.persist(ddf)  # compute up to here, keep results in memory
        ddf = ddf.set_index('idx')
    
        return ddf