Search code examples
pythontimestampnormal-distribution

Distributing the random data records across the day using Python


I'm designing the data simulator, which generates some records based on the limit, limit can be anything from 100 to 10000

limit = 100

the records should be distributed across whole day ex: 15% of the records in 0th hour, 20% in 1st hour, 5% in 2nd hour and so on...

How to simulate this kind of distribution using python, which library would help to design the logic?

Right now I am able to simulate records like below

t_id    t_amount    gateway    transaction_date
101     30          Master     11/05/2016
102     10          Amex       11/05/2016

If you look at the transaction date, it doesn't have a timestamp. But I want to have timestamp like below records, where all the 100 records have distributed across whole day, how to achieve it?

t_id    t_amount    gateway    transaction_date
101     30          Master     11/05/2016 00:21:42
102     10          Amex       11/05/2016 01:22:42

Solution

  • Here's one way to generate something along the lines of what you describe. Note that limit can be made random, as can be the weights per hour.

    In [78]: df.tail()
    Out[78]:
                        gateway  t_amount  t_id
    transaction_date
    2016-11-05 03:00:00    Amex        68   195
    2016-11-05 03:00:00    Amex        41   196
    2016-11-05 03:00:00  Master        66   197
    2016-11-05 03:00:00    Amex        59   198
    2016-11-05 03:00:00    Amex        45   199
    

    The code below pregenerates the hours given the desired number of observations limit and weights per hour. It then uses the random module from Numpy to generate the sample data. Check out their documentation for other distributions.

    import numpy as np
    import pandas as pd
    
    #total number of observations:
    limit = 10**2
    N = 100
    #percent of transactions during that hour.
    weights_per_hour= (np.array([.35, .25, .25, .15])*limit).astype(int)
    
    #generate time range using Pandas datetime functions
    time_range = pd.date_range(start = '20161105',freq='H', periods=4)
    
    #generate data index according to the hour distribution.
    time_indx  = time_range.repeat(weights_per_hour)
    
    #create temp data frame as a housing unit.
    dat_dict =  {"t_id":[x+100 for x in range(N)], "transaction_date":time_indx}
    temp_df = pd.DataFrame(dat_dict)
    
    #enter the choices for transaction type
    gateway_choice = np.array(['Master', 'Amex'])
    
    #generate random data
    rnd_df = pd.DataFrame({"t_amount":np.random.randint(low=1, high=100,size=limit), "gateway":np.random.choice(gateway_choice,limit)})
    
    #attach random data to to temp_df
    df = pd.concat([rnd_df, temp_df], axis=1)
    df.set_index('transaction_date', inplace=True)
    

    In the code above, the index is in a timestamp format. You may have to play around for it to print but it is certainly stored. To convert it into a Pandas non-index format, use pd.index.to_datetime() and df.reset_index(df.index) to put it into the dataframe.