I'm trying to calculate the standard deviation of the 'Cost' column at each row for each unique Client Id. Each row adds new values which then determines the new standard deviation. I'm am looking to generate the 'Standard Deviation' column shown below!
Client ID Session Cost Standard Deviation
1 0 10 NaN
1 1 11 0.5000
1 2 14 1.6997
2 0 15 NaN
2 1 16 0.5000
2 2 14 0.8165
2 3 22 3.1122
This would do the trick:
df["Standard Deviation"] = df.groupby("Client ID")["Cost"].expanding(2).std(ddof=0).reset_index()["Cost"]
Client ID Session Cost Standard Deviation
0 1 0 10 NaN
1 1 1 11 0.500000
2 1 2 14 1.699673
3 2 0 15 NaN
4 2 1 16 0.500000
5 2 2 14 0.816497
6 2 3 22 3.112475
You can rephrase your problem as:
Finding the cumulative standard deviation of the "Cost" column grouped by the "Client ID" column.
Pandas conveniently has built-in functions that handle both cumulative and group by computations.
Group By
A group by to compute the standard deviation looks like this:
df.groupby("Client ID")["Cost"].std()
Client ID
1 2.081666
2 3.593976
Cumulative
The cumulative standard deviation can be computed like this (note, we use ddof=0
to get the standard deviation of the population as a whole, which is what we want. we also use min_periods=2
, otherwise the first row would have a value of 0.0
instead of NaN
):
df.expanding(min_periods=2)["Cost"].std(ddof=0)
0 NaN
1 0.707107
2 2.081666
3 2.380476
4 2.588436
5 2.338090
6 3.909695
Group By + Cumulative
Combining the two, we get our result (note, we need to reset the index to lose the group by indexing and use the original index):
df.groupby("Client ID")["Cost"].expanding(2).std(ddof=0).reset_index()["Cost"]
0 NaN
1 0.500000
2 1.699673
3 NaN
4 0.500000
5 0.816497
6 3.112475