Search code examples
pythonpandasdataframenumpypandas-groupby

Find positive and negative bin limits based on multiple other columns


I have a dataframe like as shown below

   ID   raw_val  var_name   constant   s_value   
   1    388       Qty        0.36       -0.032   
   2    120       Qty        0.36       -0.007
   3    34        Qty        0.36       0.16
   4    45        Qty        0.36       0.31
   1    110       F1         0.36       -0.232   
   2    1000      F1         0.36       -0.17
   3    318       F1         0.36       0.26
   4    419       F1         0.36       0.31

My objective is to

a) Find the upper and lower limits (of raw_val) for each value of var_name for s_value >=0

b) Find the upper and lower limits (of raw_val) for each value of var_name for s_value <0

I tried the below

df['sign'] = np.where[df['s_value']<0, 'neg', 'pos']
s = df.groupby(['var_name','sign'])['raw_val'].series
df['buckets'] = pd.IntervalIndex.from_arrays(s)

Please note that my real data is big data and has more than 200 unique values for var_name column. The distribution of positive and negative values (s_value) may be uneven for each value of the var_name columns. In sample df, I have shown even distribution of pos and neg values but it may not be the case in real life.

I expect my output to be like as below

var_name sign  low_limit   upp_limit
 Qty     neg     120          388
 F1      neg     110          1000
 Qty     pos     34            45
 Qty     pos     318          419

Solution

  • I think numpy.where with aggregate minimal and maximal values is way:

    df['sign'] = np.where(df['s_value']<0, 'neg', 'pos')
    df1 = (df.groupby(['var_name','sign'], sort=False, as_index=False)
             .agg(low_limit=('raw_val','min'), upp_limit=('raw_val','max')))
    print (df1)
      var_name sign  low_limit  upp_limit
    0      Qty  neg        120        388
    1      Qty  pos         34         45
    2       F1  neg        110       1000
    3       F1  pos        318        419