Search code examples
pythonpandasdataframemeanstandard-deviation

Retrieving the average of averages in Python DataFrame


I have a mass pandas DataFrame df:

year          count
1983          5
1983          4
1983          7
...
2009          8
2009          11
2009          30

and I aim to sample 10 data points per year 100 times and get the mean and standard deviation of count per year. The signs of the count values are determined randomly.


I want to randomly sample 10 data per year, which can be done by:

new_df = pd.DataFrame(columns=['year', 'count'])
ref = df.year.unique()

for i in range(len(ref)):
  appended_df = df[df['year'] == ref[i]].sample(n=10)
  new_df = pd.concat([new_df,appended_df])

Then, I assign a sign to count randomly (so that by random chance the count could be positive or negative) and rename it to value, which can be done by:

vlist = []

for i in range(len(new_df)):
  if randint(0,1) == 0:
    vlist.append(new_df.count.iloc[i])
  else:
    vlist.append(new_df.count.iloc[i] * -1)

new_data['value'] = vlist

Getting a mean and standard deviation per each year is quite simple:

xdf = new_data.groupby("year").agg([np.mean, np.std]).reset_index()

But I can't seem to find an optimal way to try this sampling 100 times per year, store the mean values, and get the mean and standard deviation of those 100 means per year. I could think of using for loop, but it would take too much of a runtime.

Essentially, the output should be in the form of the following (the values are arbitrary here):

year      mean_of_100_means  total_sd
1983      4.22               0.43
1984      -6.39              1.25
1985      2.01               0.04
...
2007      11.92              3.38
2008      -5.27              1.67
2009      1.85               0.99

Any insights would be appreciated.


Solution

  • Try:

    def fn(x):
        _100_means = [x.sample(10).mean() for i in range(100)]
        return {
            "mean_of_100_means": np.mean(_100_means),
            "total_sd": np.std(_100_means),
        }
    
    
    print(df.groupby("year")["count"].apply(fn).unstack().reset_index())
    

    EDIT: Changed the computation of means.

    Prints:

        year  mean_of_100_means   total_sd
    0   1983             48.986   8.330787
    1   1984             48.479  10.384896
    2   1985             48.957   7.854900
    3   1986             50.821  10.303847
    4   1987             50.198   9.835832
    5   1988             47.497   8.678749
    6   1989             46.763   9.197387
    7   1990             49.696   8.837589
    8   1991             46.979   8.141969
    9   1992             48.555   8.603597
    10  1993             50.220   8.263946
    11  1994             48.735   9.954741
    12  1995             49.759   8.532844
    13  1996             49.832   8.998654
    14  1997             50.306   9.038316
    15  1998             49.513   9.024341
    16  1999             50.532   9.883166
    17  2000             49.195   9.177008
    18  2001             50.731   8.309244
    19  2002             48.792   9.680028
    20  2003             50.251   9.384759
    21  2004             50.522   9.269677
    22  2005             48.090   8.964458
    23  2006             49.529   8.250701
    24  2007             47.192   8.682196
    25  2008             50.124   9.337356
    26  2009             47.988   8.053438
    

    The dataframe was created:

    data = []
    for y in range(1983, 2010):
        for i in np.random.randint(0, 100, size=1000):
            data.append({"year": y, "count": i})
    df = pd.DataFrame(data)