I am trying to write Pandas code that would allow me to sample DataFrame using a normal distribution. The most convinient way is to use random_state parameter of the sample method to draw random samples, but somehow employ numpy.random.Generator.normal to draw random samples using a normal (Gaussian) distribution.
import pandas as pd
import numpy as np
import random
# Generate a list of unique random numbers
temp = random.sample(range(1, 101), 100)
df = pd.DataFrame({'temperature': temp})
# Sample normal
rng = np.random.default_rng()
triangle_df.sample(n=10, random_state=rng.normal())
This obviously doesn't work. There is an issue with random_state=rng.normal().
Passing a Generator
to sample
just changes the way the generator is initialized, it won't change the distribution that is used. Random sampling is uniform (choice
is used internally [source]) and you can't change that directly with the random_state
parameter.
Also note that normal sampling doesn't really make sense for discrete values (like the rows of a DataFrame).
Now let's assume that you want to sample the rows of your DataFrame in a non-uniform way (for example with weights that follow a normal distribution) you could use the weights
parameter to pass custom weights for each row.
Here is an example with normal weights (although I'm not sure if this makes much sense):
rng = np.random.default_rng()
weights = abs(rng.normal(size=len(df)))
sampled = df.sample(n=10000, replace=True, weights=weights)
Another example based on this Q/A. Here we'll give higher probabilities to the rows from the middle of the DataFrame:
from scipy.stats import norm
N = len(df)
weights = norm.pdf(np.arange(N)-N//2, scale=5)
df.sample(n=10, weights=weights).sort_index()
Output (mostly rows around 50):
temperature
43 94
44 50
47 80
48 99
50 63
51 52
52 1
53 20
54 41
63 3
Probabilities of sampling with a bias for the center (and sampled points):