I have a existing distribution of values and I want to draw samples of size 5, but those 5 samples need to have a std of X within some tolerance. For example, I need 5 samples that have a std of 10 (even though the overall distribution is std=~32).
The example code below somewhat works, but is quite slow for large dataset. It randomly samples the distribution until it finds something close to the target std, then removes those elements so they can't be drawn again.
Is there a smarter way to do this properly and faster? It works ok for some target_std (above 6), but it isn't accurate below 6.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(23)
# Create a distribution
d1 = np.random.normal(95, 5, 200)
d2 = np.random.normal(125, 5, 200)
d3 = np.random.normal(115, 10, 200)
d4 = np.random.normal(70, 10, 100)
d5 = np.random.normal(160, 5, 200)
d6 = np.random.normal(170, 20, 100)
dist = np.concatenate((d1, d2, d3, d4, d5, d6))
print(f"Full distribution: len={len(dist)}, mean={np.mean(dist)}, std={np.std(dist)}")
plt.hist(dist, bins=100)
plt.title("Full Distribution")
plt.show();
batch_size = 5
num_batches = math.ceil(len(dist)/batch_size)
target_std = 10
tolerance = 1
# how many samples to search
num_samples = 100
result = []
# Find samples of batch_size that are closest to target_std
for i in range(num_batches):
samples = []
idxs = np.arange(len(dist))
for j in range(num_samples):
indices = np.random.choice(idxs, size=batch_size, replace=False)
sample = dist[indices]
std = sample.std()
err = abs(std - target_std)
samples.append((sample, indices, std, err, np.mean(sample), max(sample), min(sample)))
if err <= tolerance:
# close enough, stop sampling
break
# sort by smallest err first, then take the first/best result
samples = sorted(samples, key=lambda x: x[3])
best = samples[0]
if i % 100 == 0:
pass
print(f"{i}, std={best[2]}, err={best[3]}, nsamples={num_samples}")
result.append(best)
# remove the data from our source
dist = np.delete(dist, best[1])
df_samples = pd.DataFrame(result, columns=["sample", "indices", "std", "err", "mean", "max", "min"])
df_samples["err"].plot(title="Errors (target_std - batch_std)")
batch_std = df_samples["std"].mean()
batch_err = df_samples["err"].mean()
print(f"RESULT: Target std: {target_std}, Mean batch std: {batch_std}, Mean batch err: {batch_err}")
Since your problem is not restricted to a certain distribution, I use a normally random distribution, but this should work for any distribution. However the run time will depend on the population size.
population = np.random.randn(1000)*32
std = 10.
tol = 1.
n_samples = 5
samples = list(np.random.choice(population, n_samples))
while True:
center = np.mean(samples)
dis = [abs(i-center) for i in samples]
if np.std(samples)>(std+tol):
samples.pop(dis.index(max(dis)))
elif np.std(samples)<(std-tol):
samples.pop(dis.index(min(dis)))
else:
break
samples.append(np.random.choice(population, 1)[0])
Here is how the code works.
First, draw n_samples
, probably the std is not in the range you want, so we calculate the mean and absolute distance of each sample to the mean. Then if the std is larger than the desired value plus tolerance, we kick the furthest sample and draw a new one and vice versa.
Note that if this takes too much time to calculate for your data, after kicking the outlier out, you can calculate what should be the range of the next element that should be drawn in the population, instead of randomly taking one. Hopefully this works for you.
DISCLAIMER: This is not a random draw anymore, and you should be aware that the draw is biased and is not representative of the population.