Search code examples
pythonrandomstatisticsnormal-distributionweibull

Generate random data based on existing data


is there a way in python to generate random data based on the distribution of the alreday existing data?

Here are the statistical parameters of my dataset:

Data
count   209.000000
mean    1.280144
std     0.374602
min     0.880000
25%     1.060000
50%     1.150000
75%     1.400000
max     4.140000

as it is no normal distribution it is not possible to do it with np.random.normal. Any Ideas?

Distribution

Thank you.

Edit: Performing KDE:

from sklearn.neighbors import KernelDensity
# Gaussian KDE
kde = KernelDensity(kernel='gaussian', bandwidth=0.525566).fit(data['y'].to_numpy().reshape(-1, 1))
sns.distplot(kde.sample(2400))

KDE


Solution

  • In general, real-world data doesn't exactly follow a "nice" distribution like the normal or Weibull distributions.

    Similarly to machine learning, there are generally two steps to sampling from a distribution of data points:

    • Fit a data model to the data.

    • Then, predict a new data point based on that model, with the help of randomness.

    There are several ways to estimate the distribution of data and sample from that estimate:

    • Kernel density estimation.
    • Gaussian mixture models.
    • Histograms.
    • Regression models.
    • Other machine learning models.

    In addition, methods such as maximum likelihood estimation make it possible to fit a known distribution (such as the normal distribution) to data, but the estimated distribution is generally rougher than with kernel density estimation or other machine learning models.

    See also my section "Random Numbers from a Distribution of Data Points".