I have a big number of columns with a suffix 'mean' or 'sum'. Sometimes the one with the 'mean' suffix is NaN. When this happens, I would like to turn the one with the 'sum' suffix into NaN as well. I have a large number of variables so I need (?) to use a loop. I have created a fake dataframe and I have added the 3 things that I have tried based on similar posts here in SO. Unfortunately nothing has worked
original_data_set = (pd.DataFrame
(
{
'customerId':[1,2]
,'usage_1_sum':[100, 200]
,'usage_1_mean':[np.nan,100]
,'usage_2_sum':[420,330]
,'usage_2_mean':[45,np.nan]
}
)
)
print('original dataset')
original_data_set
desired_data_set = (pd.DataFrame
(
{
'customerId':[1,2]
,'usage_1_sum':[np.nan, 200]
,'usage_1_mean':[np.nan,100]
,'usage_2_sum':[420,np.nan]
,'usage_2_mean':[45,np.nan]
}
)
)
print('desired dataset')
desired_data_set
holder_set = original_data_set.copy()
for number in range(1,3):
holder_set['usage_{}_sum'.format(number)] = (
holder_set['usage_{}_sum'.format(number)]
.where(holder_set['usage_{}_mean'.format(number)] == np.nan, np.nan
)
)
print('using an np.where statement changed all sum variables into NaN with no discretion')
holder_set
holder_set = original_data_set.copy()
for number in range(1,3):
conditions = [holder_set['usage_{}_mean'.format(number)]==np.nan]
outcome = [np.nan]
holder_set['usage_{}_sum'.format(number)] = np.select(conditions, outcome, default=holder_set['usage_{}_sum'.format(number)])
print('using an np.select did not have any effect on the dataframe')
holder_set
holder_set = original_data_set.copy()
for number in range(1,3):
holder_set.loc[holder_set['usage_{}_mean'.format(number)]==np.nan, 'usage_{}_sum'.format(number)] = 12
print('using a loc did not have any effect on the dataframe')
holder_set
Assuming the original
dataframe as df
:
df = pd.DataFrame({'customerId': [1, 2], 'usage_1_sum': [100, 200], 'usage_1_mean': [
np.nan, 100], 'usage_2_sum': [420, 330], 'usage_2_mean': [45, np.nan]})
Use, Series.str.endswith
to filter the columns which ends with _mean
then for each column in columns ending with _mean
change the corresponding values in _sum
column to NaN
where the values in mean column is NaN
:
for col in df.columns[df.columns.str.endswith('_mean')]:
df.loc[df[col].isna(), col.rstrip('_mean') + '_sum'] = np.nan
Result:
# print(df)
customerId usage_1_sum usage_1_mean usage_2_sum usage_2_mean
0 1 NaN NaN 420.0 45.0
1 2 200.0 100.0 NaN NaN