python pandas dataframe mean standard-deviation

Retrieving the average of averages in Python DataFrame

I have a mass pandas DataFrame df:

year          count
1983          5
1983          4
1983          7
...
2009          8
2009          11
2009          30

and I aim to sample 10 data points per year 100 times and get the mean and standard deviation of count per year. The signs of the count values are determined randomly.

I want to randomly sample 10 data per year, which can be done by:

new_df = pd.DataFrame(columns=['year', 'count'])
ref = df.year.unique()

for i in range(len(ref)):
  appended_df = df[df['year'] == ref[i]].sample(n=10)
  new_df = pd.concat([new_df,appended_df])

Then, I assign a sign to count randomly (so that by random chance the count could be positive or negative) and rename it to value, which can be done by:

vlist = []

for i in range(len(new_df)):
  if randint(0,1) == 0:
    vlist.append(new_df.count.iloc[i])
  else:
    vlist.append(new_df.count.iloc[i] * -1)

new_data['value'] = vlist

Getting a mean and standard deviation per each year is quite simple:

xdf = new_data.groupby("year").agg([np.mean, np.std]).reset_index()

But I can't seem to find an optimal way to try this sampling 100 times per year, store the mean values, and get the mean and standard deviation of those 100 means per year. I could think of using for loop, but it would take too much of a runtime.

Essentially, the output should be in the form of the following (the values are arbitrary here):

year      mean_of_100_means  total_sd
1983      4.22               0.43
1984      -6.39              1.25
1985      2.01               0.04
...
2007      11.92              3.38
2008      -5.27              1.67
2009      1.85               0.99

Any insights would be appreciated.

Solution

Try:

def fn(x):
    _100_means = [x.sample(10).mean() for i in range(100)]
    return {
        "mean_of_100_means": np.mean(_100_means),
        "total_sd": np.std(_100_means),
    }


print(df.groupby("year")["count"].apply(fn).unstack().reset_index())

EDIT: Changed the computation of means.

Prints:

    year  mean_of_100_means   total_sd
0   1983             48.986   8.330787
1   1984             48.479  10.384896
2   1985             48.957   7.854900
3   1986             50.821  10.303847
4   1987             50.198   9.835832
5   1988             47.497   8.678749
6   1989             46.763   9.197387
7   1990             49.696   8.837589
8   1991             46.979   8.141969
9   1992             48.555   8.603597
10  1993             50.220   8.263946
11  1994             48.735   9.954741
12  1995             49.759   8.532844
13  1996             49.832   8.998654
14  1997             50.306   9.038316
15  1998             49.513   9.024341
16  1999             50.532   9.883166
17  2000             49.195   9.177008
18  2001             50.731   8.309244
19  2002             48.792   9.680028
20  2003             50.251   9.384759
21  2004             50.522   9.269677
22  2005             48.090   8.964458
23  2006             49.529   8.250701
24  2007             47.192   8.682196
25  2008             50.124   9.337356
26  2009             47.988   8.053438

The dataframe was created:

data = []
for y in range(1983, 2010):
    for i in np.random.randint(0, 100, size=1000):
        data.append({"year": y, "count": i})
df = pd.DataFrame(data)