I have two pandas dataframes. df_lst contains a list of column names and expected value, and df has a series of data.
The column names in df_lst may change and I use the following script to look up the column index from df that aligns to the column name in df_lst Showing this code incase it is an extra step that might not be needed.
ind_dict = dict((k,i) for i,k in enumerate(d.columns))
inter = set(df_lst['Col_Name']).intersection(df)
df_lst['Index'] = [ ind_dict[x] for x in inter ]
The input for this task would look like this:
import random
import numpy as np
import pandas as pd
a = np.random.randint(12, size=(7, 11))
df = pd.DataFrame(a, ['foo','foo','bar', 'bar', 'bar', 'foo', 'foo'], ['a','b','f','g','h','j' ,'k', 'r', 's', 't', 'z'])
df_lst = pd.DataFrame({'Col_Name': ['Col_g', 'Col_j', 'Col_r', 'Col_s'],
'Expected Value': [100, 90, 122, 111],
'Index': [4, 6, 8, 9]})
How can I use the new Index values to look at that corresponding column in df and sum the values and return both the summed value and a 'True' if greater than or 'False' if less than for each row in df_lst
df_out = pd.DataFrame({'Col_Name': ['Col_g', 'Col_j', 'Col_r', 'Col_s'],
'Expected Value': [100, 90, 122, 111],
'Index': [4, 6, 8, 9],
'Sum of Col': ['sum of col_g', 'sum of col_j', 'sum of col_r', 'sum of col_s'],
'Bool': ['True or False', 'True or False', 'True or False', 'True or False']
})
Eventually this True/False data will be part of a while loop that checks something like "while 1 or more is false do X"
We can select the values from df
using the values from df_lst['Index']
and iloc
we will need to subtract 1 to convert from 1 based indexing to 0 based. Then sum
the column and join
back to the DataFrame. We can then compute the Bool
column based on the new Sum of Col
values:
df_out = df_lst.join(
df.iloc[:, df_lst['Index'] - 1].sum()
.add_prefix('Col_')
.rename('Sum of Col'),
on='Col_Name'
)
df_out['Bool'] = df_out['Sum of Col'] > df_out['Expected Value']
df_out
:
Col_Name Expected Value Index Sum of Col Bool
0 Col_g 100 4 106 True
1 Col_j 90 6 97 True
2 Col_r 122 8 95 False
3 Col_s 111 9 113 True
Steps:
Select with iloc
notice that integer indexes start at 0 so the g
column is at index 3 not 4:
df.iloc[:, df_lst['Index'] - 1]
g j r s
foo 0 7 14 16
foo 23 13 12 12
bar 5 13 3 16
bar 17 13 24 16
bar 24 14 11 23
foo 17 19 24 17
foo 20 18 7 13
Sum Columns with sum
:
df.iloc[:, df_lst['Index'] - 1].sum()
Out[3]:
g 106
j 97
r 95
s 113
dtype: int64
add_prefix
so columns match Col_Name
column and rename
the Series so that the new column has the correct name:
df.iloc[:, df_lst['Index'] - 1].sum().add_prefix('Col_').rename('Sum of Col')
Col_g 106
Col_j 97
Col_r 95
Col_s 113
Name: Sum of Col, dtype: int64
join
together with df_lst
:
df_lst.join(
df.iloc[:, df_lst['Index'] - 1].sum()
.add_prefix('Col_')
.rename('Sum of Col'),
on='Col_Name'
)
Col_Name Expected Value Index Sum of Col
0 Col_g 100 4 106
1 Col_j 90 6 97
2 Col_r 122 8 95
3 Col_s 111 9 113
Do whatever comparisons are needed and add any additional columns:
df_out['Bool'] = df_out['Sum of Col'] > df_out['Expected Value']
Col_Name Expected Value Index Sum of Col Bool
0 Col_g 100 4 106 True
1 Col_j 90 6 97 True
2 Col_r 122 8 95 False
3 Col_s 111 9 113 True
Reproducible Setup:
import pandas as pd
from numpy.random import Generator, MT19937
rng = Generator(MT19937(25))
a = rng.integers(25, size=(7, 11))
df = pd.DataFrame(a, ['foo', 'foo', 'bar', 'bar', 'bar', 'foo', 'foo'],
['a', 'b', 'f', 'g', 'h', 'j', 'k', 'r', 's', 't', 'z'])
df_lst = pd.DataFrame({'Col_Name': ['Col_g', 'Col_j', 'Col_r', 'Col_s'],
'Expected Value': [100, 90, 122, 111],
'Index': [4, 6, 8, 9]})