Search code examples
pythonpandasdatetimetimedelta

Add random number of microseconds to each index in a dataframe's datetime index


I have a dataframe that uses a DateTime index. Due to the nature of the data, there are duplicate rows in the raw data that produce duplicate indices. I would like to add a random number of microseconds to each index to make them unique.

So, my index is currently:

2021-06-01 08:58:47+00:00

and I want it to be:

2021-06-01 08:58:47.011356+00:00

I have a column named TimeUTC which I used to create the index. I tried adding a random number of microseconds to it:

    df.index = df.index + datetime.timedelta(microseconds= random.randrange(0,1000000,1))

but that added the same number of microseconds to all rows. I had the same result when trying to use apply + lambda.

  1. How can I add a random number of microseconds to each row's TimeUTC?
  2. Can I do this directly to the index instead of updating TimeUTC and then reindexing?

Thank you.


Solution

  • you need to use to_timedelta with unit='us' for microseconds, and an array of random number of the same length than the dataframe

    # dummy data
    df = pd.DataFrame({
        'a':range(10), 
        'TimeUTC':pd.to_datetime(['2021-10-01 00:00:00']*3+['2021-10-01 00:00:01']*5
                                 +['2021-10-01 00:00:02']*2, utc=True)}
    )
    
    # set seed for reproducibility
    np.random.seed(10)
    
    # set the index from TimeUTC and add microseconds with to_timedelta
    df.index = df['TimeUTC']+pd.to_timedelta(np.random.randint(1,1000000,len(df)),unit='us')
    
    print(df)
                                      a                   TimeUTC
    2021-10-01 00:00:00.345354+00:00  0 2021-10-01 00:00:00+00:00
    2021-10-01 00:00:00.760958+00:00  1 2021-10-01 00:00:00+00:00
    2021-10-01 00:00:00.881168+00:00  2 2021-10-01 00:00:00+00:00
    2021-10-01 00:00:01.443713+00:00  3 2021-10-01 00:00:01+00:00
    2021-10-01 00:00:01.617842+00:00  4 2021-10-01 00:00:01+00:00
    2021-10-01 00:00:01.105596+00:00  5 2021-10-01 00:00:01+00:00 
    2021-10-01 00:00:01.533661+00:00  6 2021-10-01 00:00:01+00:00
    2021-10-01 00:00:01.927706+00:00  7 2021-10-01 00:00:01+00:00
    2021-10-01 00:00:02.299742+00:00  8 2021-10-01 00:00:02+00:00
    2021-10-01 00:00:02.804337+00:00  9 2021-10-01 00:00:02+00:00
    

    Note that adding random number may not conserve the time order of your dataframe, see in this case for the value a=5, the datetime index is lower than the previous two rows