I'm designing the data simulator, which generates some records based on the limit, limit can be anything from 100 to 10000
limit = 100
the records should be distributed across whole day ex: 15% of the records in 0th hour, 20% in 1st hour, 5% in 2nd hour and so on...
How to simulate this kind of distribution using python, which library would help to design the logic?
Right now I am able to simulate records like below
t_id t_amount gateway transaction_date
101 30 Master 11/05/2016
102 10 Amex 11/05/2016
If you look at the transaction date, it doesn't have a timestamp. But I want to have timestamp like below records, where all the 100 records have distributed across whole day, how to achieve it?
t_id t_amount gateway transaction_date
101 30 Master 11/05/2016 00:21:42
102 10 Amex 11/05/2016 01:22:42
Here's one way to generate something along the lines of what you describe. Note that limit
can be made random, as can be the weights per hour.
In [78]: df.tail()
Out[78]:
gateway t_amount t_id
transaction_date
2016-11-05 03:00:00 Amex 68 195
2016-11-05 03:00:00 Amex 41 196
2016-11-05 03:00:00 Master 66 197
2016-11-05 03:00:00 Amex 59 198
2016-11-05 03:00:00 Amex 45 199
The code below pregenerates the hours given the desired number of observations limit
and weights per hour. It then uses the random module from Numpy to generate the sample data. Check out their documentation for other distributions.
import numpy as np
import pandas as pd
#total number of observations:
limit = 10**2
N = 100
#percent of transactions during that hour.
weights_per_hour= (np.array([.35, .25, .25, .15])*limit).astype(int)
#generate time range using Pandas datetime functions
time_range = pd.date_range(start = '20161105',freq='H', periods=4)
#generate data index according to the hour distribution.
time_indx = time_range.repeat(weights_per_hour)
#create temp data frame as a housing unit.
dat_dict = {"t_id":[x+100 for x in range(N)], "transaction_date":time_indx}
temp_df = pd.DataFrame(dat_dict)
#enter the choices for transaction type
gateway_choice = np.array(['Master', 'Amex'])
#generate random data
rnd_df = pd.DataFrame({"t_amount":np.random.randint(low=1, high=100,size=limit), "gateway":np.random.choice(gateway_choice,limit)})
#attach random data to to temp_df
df = pd.concat([rnd_df, temp_df], axis=1)
df.set_index('transaction_date', inplace=True)
In the code above, the index is in a timestamp format. You may have to play around for it to print but it is certainly stored. To convert it into a Pandas non-index format, use pd.index.to_datetime()
and df.reset_index(df.index)
to put it into the dataframe.