Search code examples
pythonpandasloopsdataframelag

Quickest way to apply a formula to a column which requires the last output


I have a dataset for which I am calculating the "Hazard rate" defined by the below formula:

if t = 1:
     hr_t  = pd_t

else:
     hr_t = (pd_t * (t - (t-1)) + hr_(t-1) * (t-1)) / t

where t stands for time (indicated by Years)

The simplest way to do this would be to iterate over each row in the table. However, this will also be the slowest. Is there an efficient way to calculate the above two formulas based on the condition in a pandas dataframe?

#Create the year and PD values
data = {'Year':[1,2,3,4,5],
        'PD': [0.1, 0.23, 0.22, 0.19, 0.10]}
data

#Create a dataframe
df = pd.DataFrame(data)
df

The hazard rate at each time period (t indicated by each row) is calculated using one of the two functions above. As the year increases the hazard rate references the same line but is a function of the year before

Example output:

Year    PD    Hazard_rate
1       0.1   0.1
2       0.23  0.165
3       0.22  0.226667
4       0.19  0.2125
5       0.10  0.172

The hazard rate for year 2 is calculated in the following way:

PD      Year-(Year-Year-1)+hazard_rate_(Year-1) * (Year - 1) / (Year)
0.23 * (2 - (2 - 1)) + 0.1 * (2 - 1) / 2 = 0.165

Solution

  • #Create the year and PD values
    data = {'Year':[1,2,3,4,5],
            'PD': [0.1, 0.23, 0.22, 0.19, 0.10]}
    data
    
    #Create a dataframe
    df = pd.DataFrame(data)
    df
    
    # initialize the series
    df['Hazard_rate'] = 0
    
    # iterate over the data frame rows (you need to loop since subsequent
    # calculations are depending on prior ones, pandas.DataFrame.apply() is
    # just going to implement a loop under the hood anyway
    # ASSUMPTIONS: hr_(t-1) is the hazard rate value for the prior year
    #              all other "t" is just the row's Year value
    for index, row in df.iterrows():
        if row.Year == 1:
            df.loc[index, 'Hazard_rate'] = row.PD
        else:
            hr = (row.PD * (row.Year - (row.Year - 1)) + df.loc[df.Year == (row.Year-1), 'Hazard_rate'] * (row.Year - 1)) / row.Year
            df.loc[index, 'Hazard_rate'] = hr
    

    EDIT: user was getting a bug/error, so this list implementation should work (same assumptions as above).

    #Create the year and PD values
    data = {'Year':[1,2,3,4,5],
            'PD': [0.1, 0.23, 0.22, 0.19, 0.10]}
    
    #Create a dataframe
    df = pd.DataFrame(data)
    
    hazard_rates = [0] * len(df.index)
    for index, row in df.iterrows():
        if row.Year == 1:
            hazard_rates[index] = row.PD
        else:
            hr = (row.PD * (row.Year - (row.Year - 1)) + df.loc[df.Year == (row.Year-1), 'Hazard_rate'] * (row.Year - 1)) / row.Year
            hazard_rates[index] = float(hr)
    
    df['Hazard_rates'] = hazard_rates