I would like to average certain column values depending on whether a condition is met in another column. Specifically, if column 1 in the below dataframe is < 1700, I want to include the corresponding value in that row from column 51 in my average calculation. And if column 2 < 1700, I want to also include the value in that row from column 52 in my average calculation.
So, for row 0, the new calculated column for that row would be 64 (average of 65 & 63). For row 1, the average would be just 80 (column 51 value) since neither columns 2 nor 3 were less than 1700 and hence not included in the average calculation.
This is a simplified example as my actual dataframe has about 10 columns for conditions with 10 corresponding columns of values to average.
As a potential complexity, the column headers are numbers rather than traditional text labels and do not refer to the order of that column in the dataframe since I've excluded certain columns when I imported the csv file. In other words, column 51 isn't the 51st column in the dataframe.
When I run the below code I'm getting the following error:
ValueError: ("No axis named 1 for object type ", 'occurred at index 0')
Is there a more efficient way to code this and avoid this error? Thanks for your help!
import pandas as pd
import numpy as np
test_df = pd.DataFrame({1:[1600,1600,1600,1700,1800],2:[1500,2000,1400,1500,2000],
3:[2000,2000,2000,2000,2000],51:[65,80,75,80,75],52:[63,82,85,85,75],53:[83,80,75,76,78]})
test_df
1 2 3 51 52 53
0 1600 1500 2000 65 63 83
1 1600 2000 2000 80 82 80
2 1600 1400 2000 75 85 75
3 1700 1500 2000 80 85 76
4 1800 2000 2000 75 75 78
def calc_mean_based_on_conditions(row):
list_of_columns_to_average = []
for i in range(1,4):
if row[i] < 1700:
list_of_columns_to_average.append(i+50)
if not list_of_columns_to_average:
return np.nan
else:
return row[(list_of_columns_to_average)].mean(axis=1)
test_df['MeanValue'] = test_df.apply(calc_mean_based_on_conditions, axis=1)
Something very relevant (supporting int as column names)- https://github.com/theislab/anndata/issues/31
Due to this bug/issue, I converted the column names to type string:
test_df = pd.DataFrame({'1':[1600,1600,1600,1700,1800],'2':[1500,2000,1400,1500,2000],
'3':[2000,2000,2000,2000,2000],'51':[65,80,75,80,75],'52':[63,82,85,85,75],'53':
[83,80,75,76,78]})
Created a new dataframe - new_df to meet out requirements
new_df = test_df[['1', '2', '3']].where(test_df[['1','2','3']]<1700).notnull()
new_df now looks like this
1 2 3
0 True True False
1 True False False
2 True True False
3 False True False
4 False False False
Then simply rename the column and check using 'where'
new_df = new_df.rename(columns={"1": "51", "2":"52", "3":"53"})
test_df['mean_value'] = test_df[['51', '52', '53']].where(new_df).mean(axis=1)
This should give you the desired output -
1 2 3 51 52 53 mean_value
0 1600 1500 2000 65 63 83 64.0
1 1600 2000 2000 80 82 80 80.0
2 1600 1400 2000 75 85 75 80.0
3 1700 1500 2000 80 85 76 85.0
4 1800 2000 2000 75 75 78 NaN