Search code examples
pythonpandas

Sum of columns based on range of values of other columns in a Pandas dataframe


This is my dataframe:

df = pd.DataFrame({'sym': ['a', 'b'], 'vol_1': [100, 50], 'price_1': [5, 150], 'vol_2': [1500, 2000], 'price_2': [20, 175],
                   'vol_3': [123, 500], 'price_3': [22, 1000], 'min': [18, 150], 'max': [23, 176]})

I want to add a column that sums vol_1, vol_2, and vol_3 for each row if the price that is in the next column for each vol is in range of min and max cols. For example for the first row I want vol_2 and vol_3 because the prices are in range of min and max. My desired outcome looks like this:

 sym  vol_1  price_1  vol_2  price_2  vol_3  price_3  min  max  vol_sum
0   a    100        5   1500       20    123       22   18   23     1623
1   b     50      150   2000      175    500     1000  150  176     2050

Solution

  • Reshape the data so you have individual columns for vol, price, min and max. Next, filter for only rows where price is between min and max, group by the sym column and append result to df.

    df["vol_sum"] = (pd.wide_to_long(df,
                                     stubnames=["vol", "price"], 
                                     i=["sym", "min", "max"], 
                                     j="number", 
                                     sep="_")
                      .query("min <= price <= max", engine="python")
                      .groupby("sym")
                      .vol
                      .sum()
                      .array
                     )
    
    
    
       sym  vol_1   price_1 vol_2   price_2 vol_3   price_3 min max vol_sum
    0   a   100      5      1500    20      123     22      18  23  1623
    1   b   50      150     2000    175     500    1000     150 176 2050
    

    update (13, October 2021):

    If you choose not to convert to long form, a list comprehension could help; not so sure about the speed though :

    vols = [col for col in df if col.startswith('vol')]
    
    bools = {f"vol{col[-2:]}" : df[col].between(df['min'], df['max'])
             for col in df 
             if col.startswith('price')}
    
    df.assign(vol_sum = df[vols].mul(bools).sum(1))
    
      sym  vol_1  price_1  vol_2  price_2  vol_3  price_3  min  max  vol_sum
    0   a    100        5   1500       20    123       22   18   23     1623
    1   b     50      150   2000      175    500     1000  150  176     2050
    

    Another option, using MultiIndexes:

    #select only vol and price columns
    temp = df.filter(like='_')
    # create MultiIndex
    temp.columns = temp.columns.str.split('_', expand = True)
    #get conditions where price is between min and max
    cond1 = temp.price.le(df['max'], axis = 0)
    cond2 = temp.price.ge(df['min'], axis = 0)
    
    #assign new column based on cond1 and cond2:
    df.assign(vol_sum = temp.vol.where(cond1 & cond2).sum(1))
      sym  vol_1  price_1  vol_2  price_2  vol_3  price_3  min  max  vol_sum
    0   a    100        5   1500       20    123       22   18   23   1623.0
    1   b     50      150   2000      175    500     1000  150  176   2050.0
    

    This should be faster than pd.wide_to_long, as the data size increases, as the data rows size remains the same(unlike pd.wide_to_long, which increases the number of rows)