I have the following values that describe a dataset:
Number of Samples: 5388
Mean: 4173
Median: 4072
1st Decile: 2720
9th Decile: 5676
I need to generate any datasets that will fit these values. All the examples I found require you to have the standard deviation which I don't. How this can be done? Thanks!
Interesting question! Based on Scott's suggestions I gave it a quick try.
Inputs:
import random
import pandas as pd
import numpy as np
# fixing the random seed
random.seed(a=1, version=2)
# formating floats
pd.options.display.float_format = '{:.1f}'.format
# given inputs
count = 5388
mean = 4173
median = 4072
lower_percentile = 10
lower_percentile_value = 2720
upper_percentile = 90
upper_percentile_value = 5676
max_value = 6325
min_value = 2101
The Function:
def generate_dataset(count, mean, median, lower_percentile, upper_percentile,
lower_percentile_value, upper_percentile_value,
min_value, max_value
):
# Calculate the number of values that fall within each percentile
p_1_size = int(float(lower_percentile) * float(count) / 100)
p_4_size = int(count - (float(upper_percentile) * float(count) / 100))
p_2_size = int((count / 2) - p_1_size)
p_3_size = int((count / 2) - p_4_size)
# can be used to adjust the mean
mean_adjuster = 5790
# randomly pick values of right size from a range
p_1 = random.choices(range(min_value, lower_percentile_value), k=p_1_size)
p_2 = random.choices(range(lower_percentile_value, median), k=p_2_size)
p_3 = random.choices(range(median, mean_adjuster), k=p_3_size)
p_4 = random.choices(range(upper_percentile_value, max_value), k=p_4_size)
return p_1 + p_2 + p_3 + p_4
dataset = generate_dataset(
count, mean, median, lower_percentile, upper_percentile,
lower_percentile_value, upper_percentile_value, min_value, max_value
)
Comparaison:
# converting into DataFrame
df = pd.DataFrame({"x": dataset})
new_count = len(df)
new_mean = np.mean(df.x)
new_median = np.quantile(df.x, 0.5)
new_lower_percentile = np.quantile(df.x, lower_percentile/100)
new_upper_percentile = np.quantile(df.x, upper_percentile/100)
compare = pd.DataFrame(
{
"value": ["count", "mean", "median", "low_p", "high_p"],
"original": [count, mean, median, lower_percentile_value, upper_percentile_value],
"new":[new_count, new_mean, new_median, new_lower_percentile, new_upper_percentile]
}
)
print(compare)
Output:
value original new
0 count 5388 5388.0
1 mean 4173 4173.4
2 median 4072 4072.5
3 low_p 2720 2720.4
4 high_p 5676 5743.0
Getting the values to be exactly equal is a bit tricky when all your values are integers and not floats..
You can add another variable to control the mean with two numbers or change the random seed and see if you can get a closer values. Alternatively, you can write a function that changes the seed until the values are equal. (might take couple of minutes or couple of centuries:)
Cheers!