Search code examples
pythonpandasnumpystatisticsdata-science

How to Generate a dataset based on mean, median, 1st & 9th decile values?


I have the following values that describe a dataset:

Number of Samples: 5388
Mean: 4173
Median: 4072
1st Decile: 2720
9th Decile: 5676

I need to generate any datasets that will fit these values. All the examples I found require you to have the standard deviation which I don't. How this can be done? Thanks!


Solution

  • Interesting question! Based on Scott's suggestions I gave it a quick try.

    Inputs:

    import random
    import pandas as pd
    import numpy as np
    
    # fixing the random seed
    random.seed(a=1, version=2)
    # formating floats
    pd.options.display.float_format = '{:.1f}'.format
    
    # given inputs
    count = 5388
    mean = 4173
    median = 4072
    
    lower_percentile = 10
    lower_percentile_value = 2720
    
    upper_percentile = 90
    upper_percentile_value = 5676
    
    max_value = 6325
    min_value = 2101
    

    The Function:

    def generate_dataset(count, mean, median, lower_percentile, upper_percentile,
        lower_percentile_value, upper_percentile_value,
        min_value, max_value
        ):
            
        # Calculate the number of values that fall within each percentile
        p_1_size = int(float(lower_percentile) * float(count) / 100)
        p_4_size = int(count - (float(upper_percentile) * float(count) / 100))
        p_2_size = int((count / 2) - p_1_size)
        p_3_size = int((count / 2) - p_4_size)
        
        # can be used to adjust the mean
        mean_adjuster = 5790
    
        # randomly pick values of right size from a range 
        p_1 = random.choices(range(min_value, lower_percentile_value), k=p_1_size)
        p_2 = random.choices(range(lower_percentile_value, median), k=p_2_size)
        p_3 = random.choices(range(median, mean_adjuster), k=p_3_size)
        p_4 = random.choices(range(upper_percentile_value, max_value), k=p_4_size)
        
        return p_1 + p_2 + p_3 + p_4
        
    dataset = generate_dataset(
        count, mean, median, lower_percentile, upper_percentile,
        lower_percentile_value, upper_percentile_value, min_value, max_value
        )
    

    Comparaison:

    # converting into DataFrame
    df = pd.DataFrame({"x": dataset})
    
    new_count = len(df)
    new_mean = np.mean(df.x)
    new_median = np.quantile(df.x, 0.5)
    new_lower_percentile = np.quantile(df.x, lower_percentile/100)
    new_upper_percentile = np.quantile(df.x, upper_percentile/100)
    
    compare = pd.DataFrame(
        {
            "value": ["count", "mean", "median", "low_p", "high_p"],
            "original": [count, mean, median, lower_percentile_value, upper_percentile_value],
            "new":[new_count, new_mean, new_median, new_lower_percentile, new_upper_percentile]
        }
    )
    
    print(compare)
    

    Output:

       value  original    new
    0   count      5388 5388.0
    1    mean      4173 4173.4
    2  median      4072 4072.5
    3   low_p      2720 2720.4
    4  high_p      5676 5743.0
    

    Getting the values to be exactly equal is a bit tricky when all your values are integers and not floats..

    You can add another variable to control the mean with two numbers or change the random seed and see if you can get a closer values. Alternatively, you can write a function that changes the seed until the values are equal. (might take couple of minutes or couple of centuries:)

    Cheers!