Search code examples
pythonpandasindexingsum

How can I Sum specific columns in a dataframe based on index value provided in a different df. Return the sum and Boolean T/F


I have two pandas dataframes. df_lst contains a list of column names and expected value, and df has a series of data.

The column names in df_lst may change and I use the following script to look up the column index from df that aligns to the column name in df_lst Showing this code incase it is an extra step that might not be needed.

ind_dict = dict((k,i) for i,k in enumerate(d.columns))
inter = set(df_lst['Col_Name']).intersection(df)
df_lst['Index'] = [ ind_dict[x] for x in inter ]

The input for this task would look like this:

import random
import numpy as np
import pandas as pd

a = np.random.randint(12, size=(7, 11))
df = pd.DataFrame(a, ['foo','foo','bar', 'bar', 'bar', 'foo', 'foo'], ['a','b','f','g','h','j' ,'k', 'r', 's', 't', 'z'])

df_lst = pd.DataFrame({'Col_Name': ['Col_g', 'Col_j', 'Col_r', 'Col_s'], 
                   'Expected Value': [100, 90, 122, 111],                                      
                   'Index': [4, 6, 8, 9]})

How can I use the new Index values to look at that corresponding column in df and sum the values and return both the summed value and a 'True' if greater than or 'False' if less than for each row in df_lst

df_out = pd.DataFrame({'Col_Name': ['Col_g', 'Col_j', 'Col_r', 'Col_s'], 
                   'Expected Value': [100, 90, 122, 111],                                      
                   'Index': [4, 6, 8, 9],
                   'Sum of Col': ['sum of col_g', 'sum of col_j', 'sum of col_r', 'sum of col_s'],
                   'Bool': ['True or False', 'True or False', 'True or False', 'True or False']
                   })

Eventually this True/False data will be part of a while loop that checks something like "while 1 or more is false do X"


Solution

  • We can select the values from df using the values from df_lst['Index'] and iloc we will need to subtract 1 to convert from 1 based indexing to 0 based. Then sum the column and join back to the DataFrame. We can then compute the Bool column based on the new Sum of Col values:

    df_out = df_lst.join(
        df.iloc[:, df_lst['Index'] - 1].sum()
            .add_prefix('Col_')
            .rename('Sum of Col'),
        on='Col_Name'
    )
    
    df_out['Bool'] = df_out['Sum of Col'] > df_out['Expected Value']
    

    df_out:

      Col_Name  Expected Value  Index  Sum of Col   Bool
    0    Col_g             100      4         106   True
    1    Col_j              90      6          97   True
    2    Col_r             122      8          95  False
    3    Col_s             111      9         113   True
    

    Steps:

    Select with iloc notice that integer indexes start at 0 so the g column is at index 3 not 4:

    df.iloc[:, df_lst['Index'] - 1]
    
          g   j   r   s
    foo   0   7  14  16
    foo  23  13  12  12
    bar   5  13   3  16
    bar  17  13  24  16
    bar  24  14  11  23
    foo  17  19  24  17
    foo  20  18   7  13
    

    Sum Columns with sum:

    df.iloc[:, df_lst['Index'] - 1].sum()
    Out[3]: 
    g    106
    j     97
    r     95
    s    113
    dtype: int64
    

    add_prefix so columns match Col_Name column and rename the Series so that the new column has the correct name:

    df.iloc[:, df_lst['Index'] - 1].sum().add_prefix('Col_').rename('Sum of Col')
    
    Col_g    106
    Col_j     97
    Col_r     95
    Col_s    113
    Name: Sum of Col, dtype: int64
    

    join together with df_lst:

    df_lst.join(
        df.iloc[:, df_lst['Index'] - 1].sum()
            .add_prefix('Col_')
            .rename('Sum of Col'),
        on='Col_Name'
    )
    
      Col_Name  Expected Value  Index  Sum of Col
    0    Col_g             100      4         106
    1    Col_j              90      6          97
    2    Col_r             122      8          95
    3    Col_s             111      9         113
    

    Do whatever comparisons are needed and add any additional columns:

    df_out['Bool'] = df_out['Sum of Col'] > df_out['Expected Value']
    
      Col_Name  Expected Value  Index  Sum of Col   Bool
    0    Col_g             100      4         106   True
    1    Col_j              90      6          97   True
    2    Col_r             122      8          95  False
    3    Col_s             111      9         113   True
    

    Reproducible Setup:

    import pandas as pd
    from numpy.random import Generator, MT19937
    
    rng = Generator(MT19937(25))
    a = rng.integers(25, size=(7, 11))
    df = pd.DataFrame(a, ['foo', 'foo', 'bar', 'bar', 'bar', 'foo', 'foo'],
                      ['a', 'b', 'f', 'g', 'h', 'j', 'k', 'r', 's', 't', 'z'])
    
    df_lst = pd.DataFrame({'Col_Name': ['Col_g', 'Col_j', 'Col_r', 'Col_s'],
                           'Expected Value': [100, 90, 122, 111],
                           'Index': [4, 6, 8, 9]})