I have extracted some variables from my python data set and I want to generate a larger data set from the distributions I have. The problem is that I am trying to introduce some variability to the new data set while maintaining the similar behaviour. This is an example of my extracted data that consists of 400 observations:
Value Observation Count Ratio of Entries
1 352 0.88
2 28 0.07
3 8 0.02
4 4 0.01
7 4 0.01
13 4 0.01
Now I am trying to use this information to generate a similar dataset with 2,000 observations. I am aware of the numpy.random.choice
and the random.choice
functions, but I do not want to use the exact same distributions. Instead I would like to generate random variables (the values column) based from the distribution but with more variability. An example of how I want my larger data set to look like:
Value Observation Count Ratio of Entries
1 1763 0.8815
2 151 0.0755
3 32 0.0160
4 19 0.0095
5 10 0.0050
6 8 0.0040
7 2 0.0010
8 4 0.0020
9 2 0.0010
10 3 0.0015
11 1 0.0005
12 1 0.0005
13 1 0.0005
14 2 0.0010
15 1 0.0005
So the new distribution is something that could be estimated if I fitted my original data with an exponential decay function, however, I am not interested in continuous variables. How do I get around this and is there a particular or mathematical method relevant to what I am trying to do?
It sounds like you want to generate data based on the PDF described in the second table. The PDF is something like
0 for x <= B
A*exp(-A*(x-B)) for x > B
A
defines the width of your distribution, which will always be normalized to have an area of 1. B
is the horizontal offset, which is zero in your case. You can make it an integer distribution by binning with ceil
.
The CDF of a normalized decaying exponential is 1 - exp(-A*(x-B))
. Generally, a simple way to make a custom distribution is to generate uniform numbers and map them through the CDF.
Fortunately, you won't have to do that, since scipy.stats.expon
already provides the implementation you are looking for. All you have to do is fit to the data in your last column to get A
(B
is clearly zero). You can easily do this with curve_fit
. Keep in mind that A
maps to 1.0/scale
in scipy PDF language.
Here is some sample code. I've added an extra layer of complexity here by computing the integral of the objective function from n-1
to n
for integer inputs, taking the binning into account for you when doing the fit.
import numpy as np
from scipy.optimize import curve_fit
from scipy.stats import expon
def model(x, a):
return np.exp(-a * (x - 1)) - exp(-a * x)
#Alternnative:
# return -np.diff(np.exp(-a * np.concatenate(([x[0] - 1], x))))
x = np.arange(1, 16)
p = np.array([0.8815, 0.0755, ..., 0.0010, 0.0005])
a = curve_fit(model, x, p, 0.01)
samples = np.ceil(expon.rvs(scale=1/a, size=2000)).astype(int)
samples[samples == 0] = 1
data = np.bincount(samples)[1:]