python-3.xpandasdataframesum

Sum columns in dynamic range which are from another two columns


I am working on making a dynamic table by adding columns in same row, however, range of columns is determined based on two columns colum_start and colum_stop:

df = pd.DataFrame({
   'Name': ['Apple', 'Banana', 'Orange', 'Cherry', 'Egg', 'Cheese'],
   'ID': ['F1', 'F1', 'F1', 'F1', 'V1', 'V2'],
  '202101': [1, 10, 20, 30, 40, 50],
  '202102': [20, 15, 12, 18, 32, 12],
  '202103': [3, 11, 25, 32, 13, 4],
  '202104': [32, 11, 9, 82, 2, 1],
  '202105': [9, 5, 11, 11, 2, 5],
  'colum_start ': [202102, 202101, 202102, 202103, 202101, 202103],
  'colum_stop': [202105, 202103, 202105, 202104, 202102, 202105],
})
df

enter image description here

I want to create a sum_column that sums the columns based on colum_start and colum_stop in column name.

desired output:

enter image description here

Appriciate your help!!


Solution

  • Use numpy broadcasting for select values between both columns by compare by columns, and chained cumulative sums compared for 1:

    a = df['colum_start'].astype(str).to_numpy()[:, None]
    b = df['colum_stop'].astype(str).to_numpy()[:, None]
    c = df.columns.to_numpy()
    
    mask = (np.cumsum(c == a, axis=1) == 1) & (np.cumsum(c[::-1] == b, axis=1)[:, ::-1] == 1)
    
    df['SUM_OF_RANGE'] = df.where(mask, 0).sum(axis=1)
    print (df)
    
         Name  ID  202101  202102  202103  202104  202105  colum_start  \
    0   Apple  F1       1      20       3      32       9       202102   
    1  Banana  F1      10      15      11      11       5       202101   
    2  Orange  F1      20      12      25       9      11       202102   
    3  Cherry  F1      30      18      32      82      11       202103   
    4     Egg  V1      40      32      13       2       2       202101   
    5  Cheese  V2      50      12       4       1       5       202103   
    
       colum_stop SUM_OF_RANGE  
    0      202105           64  
    1      202103           36  
    2      202105           57  
    3      202104          114  
    4      202102           72  
    5      202105           10  
        
    

    If small data and perfromance is not important use lsit comprehension with DataFrame.loc:

    c = df.columns.to_numpy()
    
    df['SUM_OF_RANGE'] = [df.loc[i, str(a):str(b)].sum() 
                          for i, a, b in zip(df.index, df['colum_start'], df['colum_stop'])]
    print (df)
         Name  ID  202101  202102  202103  202104  202105  colum_start  \
    0   Apple  F1       1      20       3      32       9       202102   
    1  Banana  F1      10      15      11      11       5       202101   
    2  Orange  F1      20      12      25       9      11       202102   
    3  Cherry  F1      30      18      32      82      11       202103   
    4     Egg  V1      40      32      13       2       2       202101   
    5  Cheese  V2      50      12       4       1       5       202103   
    
       colum_stop  SUM_OF_RANGE  
    0      202105            64  
    1      202103            36  
    2      202105            57  
    3      202104           114  
    4      202102            72  
    5      202105            10