Search code examples
pythonpandasloopsconditional-statementscalculated-columns

Change the value of a column into a nan based on the value of another column inside a loop


I have a big number of columns with a suffix 'mean' or 'sum'. Sometimes the one with the 'mean' suffix is NaN. When this happens, I would like to turn the one with the 'sum' suffix into NaN as well. I have a large number of variables so I need (?) to use a loop. I have created a fake dataframe and I have added the 3 things that I have tried based on similar posts here in SO. Unfortunately nothing has worked

original_data_set = (pd.DataFrame
(
{
    'customerId':[1,2]
    ,'usage_1_sum':[100, 200]
    ,'usage_1_mean':[np.nan,100]
    ,'usage_2_sum':[420,330]
    ,'usage_2_mean':[45,np.nan]
}
)
             )

print('original dataset')
original_data_set

desired_data_set = (pd.DataFrame
(
{
    'customerId':[1,2]
    ,'usage_1_sum':[np.nan, 200]
    ,'usage_1_mean':[np.nan,100]
    ,'usage_2_sum':[420,np.nan]
    ,'usage_2_mean':[45,np.nan]
}
)
             )

print('desired dataset')
desired_data_set



holder_set = original_data_set.copy()

for number in range(1,3):
    holder_set['usage_{}_sum'.format(number)] = (
        
        holder_set['usage_{}_sum'.format(number)]
        .where(holder_set['usage_{}_mean'.format(number)] == np.nan, np.nan
              )
                                                )

print('using an np.where statement changed all sum variables into NaN with no discretion')
holder_set


holder_set = original_data_set.copy()

for number in range(1,3):
    conditions = [holder_set['usage_{}_mean'.format(number)]==np.nan]
    outcome = [np.nan]
    holder_set['usage_{}_sum'.format(number)] = np.select(conditions, outcome, default=holder_set['usage_{}_sum'.format(number)])
    
    
print('using an np.select did not have any effect on the dataframe')
holder_set


holder_set = original_data_set.copy()

for number in range(1,3):
    holder_set.loc[holder_set['usage_{}_mean'.format(number)]==np.nan, 'usage_{}_sum'.format(number)] = 12

print('using a loc did not have any effect on the dataframe')
holder_set


Solution

  • Assuming the original dataframe as df:

    df = pd.DataFrame({'customerId': [1, 2], 'usage_1_sum': [100, 200], 'usage_1_mean': [
                      np.nan, 100], 'usage_2_sum': [420, 330], 'usage_2_mean': [45, np.nan]})
    

    Use, Series.str.endswith to filter the columns which ends with _mean then for each column in columns ending with _mean change the corresponding values in _sum column to NaN where the values in mean column is NaN:

    for col in df.columns[df.columns.str.endswith('_mean')]:
        df.loc[df[col].isna(), col.rstrip('_mean') + '_sum'] = np.nan
    

    Result:

    # print(df)
       customerId  usage_1_sum  usage_1_mean  usage_2_sum  usage_2_mean
    0           1          NaN           NaN        420.0          45.0
    1           2        200.0         100.0          NaN           NaN