Search code examples
pythonpandascumulative-sumweighted-average

Cumulative sum of a pandas column until a maximum value is met, and average adjacent rows


I'm a biology student who is fairly new to python and was hoping someone might be able to help with a problem I have yet to solve

With some subsequent code I have created a pandas dataframe that looks like the example below:

Distance.     No. of values        Mean rSquared
    1                   500                  0.6
    2                    80                  0.3
    3                    40                  0.4
    4                    30                  0.2
    5                    50                  0.2
    6                    30                  0.1

I can provide my previous code to create this dataframe, but I didn't think it was particularly relevant.

I need to sum the number of values column until I achieve a value >= 100; and then combine the data of the rows of the adjacent columns, taking the weighted average of the distance and mean r2 values, as seen in the example below

Mean Distance.             No. Of values             Mean rSquared
1                          500                       0.6
(80*2+40*3)/120            (80+40) = 120             (80*0.3+40*0.4)/120
(30*4+50*5+30*6)/110       (30+50+30) = 110          (30*0.2+50*0.2+30*0.1)/110

etc...

I know pandas has it's .cumsum function, which I might be able to implement into a for loop with an if statement that checks the upper limit and resets the sum back to 0 when it is greater than or equal to the upper limit. However, I haven't a clue how to average the adjacent columns.

Any help would be appreciated!


Solution

  • You can use this code snippet to solve your problem.

    # First, compute some weighted values
    df.loc[:, "weighted_distance"] = df["Distance"] * df["No. of values"]
    df.loc[:, "weighted_mean_rSquared"] = df["Mean rSquared"] * df["No. of values"]
    
    
    min_threshold = 100
    indexes = []
    temp_sum = 0
    
    # placeholder for final result
    final_df = pd.DataFrame()
    columns = ["Distance", "No. of values", "Mean rSquared"]
    
    # reseting index to make the 'df' usable in following output
    df = df.reset_index(drop=True)
    
    # main loop to check and compute the desired output
    for index, _ in df.iterrows():
        temp_sum += df.iloc[index]["No. of values"]
        indexes.append(index)
    
        # if the sum exceeds 'min_threshold' then do some computation
        if temp_sum >= min_threshold:
            temp_distance = df.iloc[indexes]["weighted_distance"].sum() / temp_sum
            temp_mean_rSquared = df.iloc[indexes]["weighted_mean_rSquared"].sum() / temp_sum
        
            # create temporary dataframe and concatenate with the 'final_df'
            temp_df = pd.DataFrame([[temp_distance, temp_sum, temp_mean_rSquared]], columns=columns)
            final_df = pd.concat([final_df, temp_df])
        
            # reset the variables
            temp_sum = 0
            indexes = []