Search code examples
pythondataframestandard-deviation

How do you recalculate Standard Deviation at each row in a Dataframe?


I'm trying to calculate the standard deviation of the 'Cost' column at each row for each unique Client Id. Each row adds new values which then determines the new standard deviation. I'm am looking to generate the 'Standard Deviation' column shown below!

Client ID    Session  Cost    Standard Deviation
1            0        10      NaN
1            1        11      0.5000
1            2        14      1.6997
2            0        15      NaN
2            1        16      0.5000
2            2        14      0.8165
2            3        22      3.1122

Solution

  • This would do the trick:

    df["Standard Deviation"] = df.groupby("Client ID")["Cost"].expanding(2).std(ddof=0).reset_index()["Cost"]
    
       Client ID  Session  Cost  Standard Deviation
    0          1        0    10                 NaN
    1          1        1    11            0.500000
    2          1        2    14            1.699673
    3          2        0    15                 NaN
    4          2        1    16            0.500000
    5          2        2    14            0.816497
    6          2        3    22            3.112475
    

    Explanation

    You can rephrase your problem as:

    Finding the cumulative standard deviation of the "Cost" column grouped by the "Client ID" column.

    Pandas conveniently has built-in functions that handle both cumulative and group by computations.

    Group By

    A group by to compute the standard deviation looks like this:

    df.groupby("Client ID")["Cost"].std()
    
    Client ID
    1    2.081666
    2    3.593976
    

    Cumulative

    The cumulative standard deviation can be computed like this (note, we use ddof=0 to get the standard deviation of the population as a whole, which is what we want. we also use min_periods=2, otherwise the first row would have a value of 0.0 instead of NaN):

    df.expanding(min_periods=2)["Cost"].std(ddof=0)
    
    0         NaN
    1    0.707107
    2    2.081666
    3    2.380476
    4    2.588436
    5    2.338090
    6    3.909695
    

    Group By + Cumulative

    Combining the two, we get our result (note, we need to reset the index to lose the group by indexing and use the original index):

    df.groupby("Client ID")["Cost"].expanding(2).std(ddof=0).reset_index()["Cost"]
    
    0         NaN
    1    0.500000
    2    1.699673
    3         NaN
    4    0.500000
    5    0.816497
    6    3.112475