I'm new to python and i faced some difficulty on writing this certain of codes. So to give a background, i want to do an ever-default tagging for credit risk modelling.
0 for non-default accounts and 1 for default accounts
So the idea/concept is that throughout the date performance ( of 12 months), if the specific customer (ID) ever has one event of default (1) then the next performance after that event (for that customer) will be tagged as a default (1) even though the 'default tagging' is 0.
so the input like this:
ID | Date Performance | Default Tag |
---|---|---|
AAA | 2021-03-01 | 0 |
AAA | 2021-04-01 | 0 |
AAA | 2021-05-01 | 0 |
AAA | 2021-06-01 | 0 |
AAA | 2021-07-01 | 0 |
AAA | 2021-08-01 | 0 |
AAA | 2021-09-01 | 0 |
AAA | 2021-10-01 | 0 |
AAA | 2021-11-01 | 0 |
AAA | 2021-12-01 | 0 |
AAA | 2022-01-01 | 0 |
AAA | 2022-02-01 | 0 |
ABB | 2021-03-01 | 0 |
ABB | 2021-04-01 | 0 |
ABB | 2021-05-01 | 0 |
ABB | 2021-06-01 | 1 |
ABB | 2021-07-01 | 0 |
ABB | 2021-08-01 | 0 |
ABB | 2021-09-01 | 1 |
ABB | 2021-10-01 | 0 |
ABB | 2021-11-01 | 0 |
ABB | 2021-12-01 | 0 |
ABB | 2022-01-01 | 0 |
ABB | 2022-02-01 | 0 |
And the output would be on a new column like this:
ID | Date Performance | Ever Default Tag |
---|---|---|
AAA | 2021-03-01 | 0 |
AAA | 2021-04-01 | 0 |
AAA | 2021-05-01 | 0 |
AAA | 2021-06-01 | 0 |
AAA | 2021-07-01 | 0 |
AAA | 2021-08-01 | 0 |
AAA | 2021-09-01 | 0 |
AAA | 2021-10-01 | 0 |
AAA | 2021-11-01 | 0 |
AAA | 2021-12-01 | 0 |
AAA | 2022-01-01 | 0 |
AAA | 2022-02-01 | 0 |
ABB | 2021-03-01 | 0 |
ABB | 2021-04-01 | 0 |
ABB | 2021-05-01 | 0 |
ABB | 2021-06-01 | 1 |
ABB | 2021-07-01 | 1 |
ABB | 2021-08-01 | 1 |
ABB | 2021-09-01 | 1 |
ABB | 2021-10-01 | 1 |
ABB | 2021-11-01 | 1 |
ABB | 2021-12-01 | 1 |
ABB | 2022-01-01 | 1 |
ABB | 2022-02-01 | 1 |
Here is some code to do what your question asks:
(UPDATED to replace one-line lambda with more readable function for apply
, and also to correct an off-by-one error on when default begins.)
import pandas as pd
records = [
{'ID':'AAA', 'Date Performance': '2021-03-01', 'Default Tag': 0},
{'ID':'AAA', 'Date Performance': '2021-04-01', 'Default Tag': 0},
{'ID':'AAA', 'Date Performance': '2021-05-01', 'Default Tag': 0},
{'ID':'AAA', 'Date Performance': '2021-06-01', 'Default Tag': 0},
{'ID':'AAA', 'Date Performance': '2021-07-01', 'Default Tag': 0},
{'ID':'AAA', 'Date Performance': '2021-08-01', 'Default Tag': 0},
{'ID':'AAA', 'Date Performance': '2021-09-01', 'Default Tag': 0},
{'ID':'AAA', 'Date Performance': '2021-10-01', 'Default Tag': 0},
{'ID':'AAA', 'Date Performance': '2021-11-01', 'Default Tag': 0},
{'ID':'AAA', 'Date Performance': '2021-12-01', 'Default Tag': 0},
{'ID':'AAA', 'Date Performance': '2022-01-01', 'Default Tag': 0},
{'ID':'AAA', 'Date Performance': '2022-02-01', 'Default Tag': 0},
{'ID':'AAB', 'Date Performance': '2021-03-01', 'Default Tag': 0},
{'ID':'AAB', 'Date Performance': '2021-04-01', 'Default Tag': 0},
{'ID':'AAB', 'Date Performance': '2021-05-01', 'Default Tag': 0},
{'ID':'AAB', 'Date Performance': '2021-06-01', 'Default Tag': 1},
{'ID':'AAB', 'Date Performance': '2021-07-01', 'Default Tag': 0},
{'ID':'AAB', 'Date Performance': '2021-08-01', 'Default Tag': 0},
{'ID':'AAB', 'Date Performance': '2021-09-01', 'Default Tag': 1},
{'ID':'AAB', 'Date Performance': '2021-10-01', 'Default Tag': 0},
{'ID':'AAB', 'Date Performance': '2021-11-01', 'Default Tag': 0},
{'ID':'AAB', 'Date Performance': '2021-12-01', 'Default Tag': 0},
{'ID':'AAB', 'Date Performance': '2022-01-01', 'Default Tag': 0},
{'ID':'AAB', 'Date Performance': '2022-02-01', 'Default Tag': 0}
]
df = pd.DataFrame(records)
def hasEverDefaulted(df, x):
id, dt = 'ID', 'Date Performance'
hasSameIdAndNotLaterDate = (df[id] == x[id]) & (df[dt] <= x[dt])
return int(sum(df[hasSameIdAndNotLaterDate]['Default Tag']) > 0)
df['Default Tag'] = df.apply(lambda x: hasEverDefaulted(df, x), axis=1)
df.rename(columns={'Default Tag': 'Ever Default Tag'}, inplace=True)
print(df)
Output:
ID Date Performance Ever Default Tag
0 AAA 2021-03-01 0
1 AAA 2021-04-01 0
2 AAA 2021-05-01 0
3 AAA 2021-06-01 0
4 AAA 2021-07-01 0
5 AAA 2021-08-01 0
6 AAA 2021-09-01 0
7 AAA 2021-10-01 0
8 AAA 2021-11-01 0
9 AAA 2021-12-01 0
10 AAA 2022-01-01 0
11 AAA 2022-02-01 0
12 AAB 2021-03-01 0
13 AAB 2021-04-01 0
14 AAB 2021-05-01 0
15 AAB 2021-06-01 1
16 AAB 2021-07-01 1
17 AAB 2021-08-01 1
18 AAB 2021-09-01 1
19 AAB 2021-10-01 1
20 AAB 2021-11-01 1
21 AAB 2021-12-01 1
22 AAB 2022-01-01 1
23 AAB 2022-02-01 1
UPDATE #2:
Here's an explanation of what's happening with the call to apply()
.
When you call the apply()
method of an object of type DataFrame
in pandas with argument axis=1
, it repeatedly calls a specified function for each row of the DataFrame
object with a Series
object that contains the values in the columns of the DataFrame
object for the given row.
In the code in this answer, the function that gets called is a lambda
with the given row as argument x
, which in turn calls hasEverDefaulted()
with arguments df
and x
.
Within hasEverDefaulted()
, we do a vectorized logical calculation to obtain a column of boolean
values which scan the entire 'ID'
column of df
checking for equality with the 'ID'
value in row x
and also scan the entire 'Date Performance'
column checking for date values that are <=
the date value in row x
. We then use the vectorized logical "and" operator &
to combine these and select (by setting the result row corresponding to x
to True
) only values which match the current row's "id" and also have a "date" that is no later than that of the current row. We assign this vectorized (columnar) result to the variable hasSameIdAndNotLaterDate
.
We then use this variable as a boolean
filter to select from df
only the rows that are True
, and we sum the Default Tag
column values for these rows, create a boolean
value by testing this sum to check if it's > 0
(in which case default has occurred for that row's 'ID'
on or before that row's 'Date Performance'
) and return an int
version of this boolean
value (namely, 0
or 1
).
Back in the statement that calls apply()
we assign the columnar result of apply()
to the 'Default Tag'
column of df
. To make it clear that we have now changed the semantics of this column (which now indicates whether default occurred on or before the 'Date Performance'
date in each row), we call the rename()
method of the DataFrame
object df
to change the column name to 'Ever Default Tag'
.