I have a dataset for which I am calculating the "Hazard rate" defined by the below formula:
if t = 1:
hr_t = pd_t
else:
hr_t = (pd_t * (t - (t-1)) + hr_(t-1) * (t-1)) / t
where t stands for time (indicated by Years)
The simplest way to do this would be to iterate over each row in the table. However, this will also be the slowest. Is there an efficient way to calculate the above two formulas based on the condition in a pandas dataframe?
#Create the year and PD values
data = {'Year':[1,2,3,4,5],
'PD': [0.1, 0.23, 0.22, 0.19, 0.10]}
data
#Create a dataframe
df = pd.DataFrame(data)
df
The hazard rate at each time period (t indicated by each row) is calculated using one of the two functions above. As the year increases the hazard rate references the same line but is a function of the year before
Example output:
Year PD Hazard_rate
1 0.1 0.1
2 0.23 0.165
3 0.22 0.226667
4 0.19 0.2125
5 0.10 0.172
The hazard rate for year 2 is calculated in the following way:
PD Year-(Year-Year-1)+hazard_rate_(Year-1) * (Year - 1) / (Year)
0.23 * (2 - (2 - 1)) + 0.1 * (2 - 1) / 2 = 0.165
#Create the year and PD values
data = {'Year':[1,2,3,4,5],
'PD': [0.1, 0.23, 0.22, 0.19, 0.10]}
data
#Create a dataframe
df = pd.DataFrame(data)
df
# initialize the series
df['Hazard_rate'] = 0
# iterate over the data frame rows (you need to loop since subsequent
# calculations are depending on prior ones, pandas.DataFrame.apply() is
# just going to implement a loop under the hood anyway
# ASSUMPTIONS: hr_(t-1) is the hazard rate value for the prior year
# all other "t" is just the row's Year value
for index, row in df.iterrows():
if row.Year == 1:
df.loc[index, 'Hazard_rate'] = row.PD
else:
hr = (row.PD * (row.Year - (row.Year - 1)) + df.loc[df.Year == (row.Year-1), 'Hazard_rate'] * (row.Year - 1)) / row.Year
df.loc[index, 'Hazard_rate'] = hr
EDIT: user was getting a bug/error, so this list implementation should work (same assumptions as above).
#Create the year and PD values
data = {'Year':[1,2,3,4,5],
'PD': [0.1, 0.23, 0.22, 0.19, 0.10]}
#Create a dataframe
df = pd.DataFrame(data)
hazard_rates = [0] * len(df.index)
for index, row in df.iterrows():
if row.Year == 1:
hazard_rates[index] = row.PD
else:
hr = (row.PD * (row.Year - (row.Year - 1)) + df.loc[df.Year == (row.Year-1), 'Hazard_rate'] * (row.Year - 1)) / row.Year
hazard_rates[index] = float(hr)
df['Hazard_rates'] = hazard_rates