Search code examples
pythonpandasdataframecalculated-columns

Specify number of columns consulted in reversed(row) function


CONTEXT

I have a df similar to:

df_a = pd.DataFrame({
    '2020_Q1': [2, 3, 6, 20, 20],
    '2020_Q2': [2, 3, 6, 20, 20],
    '2020_Q3': [5, 3, 6, 20, 20], 
    '2020_Q4': [5, 4, 7, 20, 20],
    '2021_Q1': [5, 3, 7, 20, 20],
    '2021_Q2': [5, 4, 0, 20, 20],
}, index = ['People', 'AA', 'BB', 'MM', '$$'])
df_a.columns = pd.to_datetime(df_a.columns).to_period('Q')
df_a

Out[1]: 
          2020Q1   2020Q2   2020Q3  2020Q4   2021Q1   2021Q2
People        2        2        5        5        5        5
AA            3        3        3        4        3        4
BB            6        6        6        7        7        0
MM           20       20       20       20       20       20
$$           20       20       20       20       20       20

And a function that sets the values of the year end depending on how each row is calculated. Rows df['People', 'AA', 'BB'] get calculated by the last quarter of the year with a value higher than zero, or zero if none is higher, the financial values df['MM', '$$'] are calculated as a SUMof the Quarters of the year

def compute_end_year_value(row):
    if row.name in ['People', 'AA', 'BB']:
        for val in reversed(row):
            if val > 0:
                return val
        return 0
    return sum(row)

THE PROBLEM

However if I run the function as is, it will consult the entire row. How can I add to the definition a way to specify the reversed function to consider only the last n columns from where it is inserted?

So that when I do something like the following code, I get that output

df_a.insert(4,'2020_Total', df_a.apply(compute_end_year_value, axis = 1, steps = 4))
df_a.insert(7,'2021_Total', df_a.apply(compute_end_year_value, axis = 1, steps = 2))
df_a 

Out[1]: 
         2020Q1   2020Q2   2020Q3   2020Q4   2020_Total   2021Q1  2021Q2  2021_Total
People        2        2        5        5           5        5        5           5
AA            3        3        3        4           4        3        4           4
BB            6        6        6        6           6        7        0           7
MM           20       20       20       20          80       20       20          40
$$           20       20       20       20          80       20       20          40
           

Solution

  • You are iterating over rows instead of columns in your apply function.

    To achieve the desired out come you need to specify the axis to be columns instead.

    df_a.insert(4,'2020_Total', df_a.apply(lambda col: compute_end_year_value(col), axis=1))
    df_a.insert(7,'2021_Total', df_a.apply(lambda col: compute_end_year_value(col), axis=1))
    
    >>>
            2020_Q1  2020_Q2  2020_Q3  2020_Q4  2020_Total  2021_Q1  2021_Q2  \
    People        2        2        5        5           5        5        5   
    AA            3        3        3        4           4        3        4   
    BB            6        6        6        7           7        7        0   
    MM           20       20       20       20         120       20       20   
    $$           20       20       20       20         120       20       20   
    
            2021_Total  
    People           5  
    AA               4  
    BB               7  
    MM             240  
    $$             240  
    

    For more information on the apply function see the docs.

    EDIT

    For dealing with each year, I assume easiest approach is telling the function which year to consider. I modified the compute_end_year_value method to the following:

    def compute_end_year_value(row,year):
        col_pos_year = [col.startswith(year) for col in df_a.columns]
        valid_year_columns = df_a.columns[col_pos_year]
        row = row[valid_year_columns]
        if row.name in ['People', 'AA', 'BB']:
              for val in reversed(row):
                  if val > 0:
                      return val
              return 0
        
        return sum(row)
    
    df_a.insert(4,'2020_Total', df_a.apply(lambda col: compute_end_year_value(col, "2020"), axis=1))
    df_a.insert(7,'2021_Total', df_a.apply(lambda col: compute_end_year_value(col, "2021"), axis=1))
    df_a
    >>>
            2020_Q1  2020_Q2  2020_Q3  2020_Q4  2020_Total  2021_Q1  2021_Q2  \
    People        2        2        5        5           5        5        5   
    AA            3        3        3        4           4        3        4   
    BB            6        6        6        7           7        7        0   
    MM           20       20       20       20          80       20       20   
    $$           20       20       20       20          80       20       20   
    
            2021_Total  
    People           5  
    AA               4  
    BB               7  
    MM              40  
    $$              40  
    

    I added a list that checks if the column starts with a particular year and selects only those columns to be iterated over.