Search code examples
pythonpandasmodeling

Ever Default Tagging - for Credit Risk Modelling- Python


I'm new to python and i faced some difficulty on writing this certain of codes. So to give a background, i want to do an ever-default tagging for credit risk modelling.

0 for non-default accounts and 1 for default accounts

So the idea/concept is that throughout the date performance ( of 12 months), if the specific customer (ID) ever has one event of default (1) then the next performance after that event (for that customer) will be tagged as a default (1) even though the 'default tagging' is 0.

so the input like this:

ID Date Performance Default Tag
AAA 2021-03-01 0
AAA 2021-04-01 0
AAA 2021-05-01 0
AAA 2021-06-01 0
AAA 2021-07-01 0
AAA 2021-08-01 0
AAA 2021-09-01 0
AAA 2021-10-01 0
AAA 2021-11-01 0
AAA 2021-12-01 0
AAA 2022-01-01 0
AAA 2022-02-01 0
ABB 2021-03-01 0
ABB 2021-04-01 0
ABB 2021-05-01 0
ABB 2021-06-01 1
ABB 2021-07-01 0
ABB 2021-08-01 0
ABB 2021-09-01 1
ABB 2021-10-01 0
ABB 2021-11-01 0
ABB 2021-12-01 0
ABB 2022-01-01 0
ABB 2022-02-01 0

And the output would be on a new column like this:

ID Date Performance Ever Default Tag
AAA 2021-03-01 0
AAA 2021-04-01 0
AAA 2021-05-01 0
AAA 2021-06-01 0
AAA 2021-07-01 0
AAA 2021-08-01 0
AAA 2021-09-01 0
AAA 2021-10-01 0
AAA 2021-11-01 0
AAA 2021-12-01 0
AAA 2022-01-01 0
AAA 2022-02-01 0
ABB 2021-03-01 0
ABB 2021-04-01 0
ABB 2021-05-01 0
ABB 2021-06-01 1
ABB 2021-07-01 1
ABB 2021-08-01 1
ABB 2021-09-01 1
ABB 2021-10-01 1
ABB 2021-11-01 1
ABB 2021-12-01 1
ABB 2022-01-01 1
ABB 2022-02-01 1

Solution

  • Here is some code to do what your question asks:

    (UPDATED to replace one-line lambda with more readable function for apply, and also to correct an off-by-one error on when default begins.)

    import pandas as pd
    records = [
        {'ID':'AAA', 'Date Performance': '2021-03-01', 'Default Tag': 0},
        {'ID':'AAA', 'Date Performance': '2021-04-01', 'Default Tag': 0},
        {'ID':'AAA', 'Date Performance': '2021-05-01', 'Default Tag': 0},
        {'ID':'AAA', 'Date Performance': '2021-06-01', 'Default Tag': 0},
        {'ID':'AAA', 'Date Performance': '2021-07-01', 'Default Tag': 0},
        {'ID':'AAA', 'Date Performance': '2021-08-01', 'Default Tag': 0},
        {'ID':'AAA', 'Date Performance': '2021-09-01', 'Default Tag': 0},
        {'ID':'AAA', 'Date Performance': '2021-10-01', 'Default Tag': 0},
        {'ID':'AAA', 'Date Performance': '2021-11-01', 'Default Tag': 0},
        {'ID':'AAA', 'Date Performance': '2021-12-01', 'Default Tag': 0},
        {'ID':'AAA', 'Date Performance': '2022-01-01', 'Default Tag': 0},
        {'ID':'AAA', 'Date Performance': '2022-02-01', 'Default Tag': 0},
        {'ID':'AAB', 'Date Performance': '2021-03-01', 'Default Tag': 0},
        {'ID':'AAB', 'Date Performance': '2021-04-01', 'Default Tag': 0},
        {'ID':'AAB', 'Date Performance': '2021-05-01', 'Default Tag': 0},
        {'ID':'AAB', 'Date Performance': '2021-06-01', 'Default Tag': 1},
        {'ID':'AAB', 'Date Performance': '2021-07-01', 'Default Tag': 0},
        {'ID':'AAB', 'Date Performance': '2021-08-01', 'Default Tag': 0},
        {'ID':'AAB', 'Date Performance': '2021-09-01', 'Default Tag': 1},
        {'ID':'AAB', 'Date Performance': '2021-10-01', 'Default Tag': 0},
        {'ID':'AAB', 'Date Performance': '2021-11-01', 'Default Tag': 0},
        {'ID':'AAB', 'Date Performance': '2021-12-01', 'Default Tag': 0},
        {'ID':'AAB', 'Date Performance': '2022-01-01', 'Default Tag': 0},
        {'ID':'AAB', 'Date Performance': '2022-02-01', 'Default Tag': 0}
    ]
    df = pd.DataFrame(records)
    def hasEverDefaulted(df, x):
        id, dt = 'ID', 'Date Performance'
        hasSameIdAndNotLaterDate = (df[id] == x[id]) & (df[dt] <= x[dt])
        return int(sum(df[hasSameIdAndNotLaterDate]['Default Tag']) > 0)
    df['Default Tag'] = df.apply(lambda x: hasEverDefaulted(df, x), axis=1)
    df.rename(columns={'Default Tag': 'Ever Default Tag'}, inplace=True)
    print(df)
    

    Output:

         ID Date Performance  Ever Default Tag
    0   AAA       2021-03-01                 0
    1   AAA       2021-04-01                 0
    2   AAA       2021-05-01                 0
    3   AAA       2021-06-01                 0
    4   AAA       2021-07-01                 0
    5   AAA       2021-08-01                 0
    6   AAA       2021-09-01                 0
    7   AAA       2021-10-01                 0
    8   AAA       2021-11-01                 0
    9   AAA       2021-12-01                 0
    10  AAA       2022-01-01                 0
    11  AAA       2022-02-01                 0
    12  AAB       2021-03-01                 0
    13  AAB       2021-04-01                 0
    14  AAB       2021-05-01                 0
    15  AAB       2021-06-01                 1
    16  AAB       2021-07-01                 1
    17  AAB       2021-08-01                 1
    18  AAB       2021-09-01                 1
    19  AAB       2021-10-01                 1
    20  AAB       2021-11-01                 1
    21  AAB       2021-12-01                 1
    22  AAB       2022-01-01                 1
    23  AAB       2022-02-01                 1
    

    UPDATE #2: Here's an explanation of what's happening with the call to apply().

    When you call the apply() method of an object of type DataFrame in pandas with argument axis=1, it repeatedly calls a specified function for each row of the DataFrame object with a Series object that contains the values in the columns of the DataFrame object for the given row.

    In the code in this answer, the function that gets called is a lambda with the given row as argument x, which in turn calls hasEverDefaulted() with arguments df and x.

    Within hasEverDefaulted(), we do a vectorized logical calculation to obtain a column of boolean values which scan the entire 'ID' column of df checking for equality with the 'ID' value in row x and also scan the entire 'Date Performance' column checking for date values that are <= the date value in row x. We then use the vectorized logical "and" operator & to combine these and select (by setting the result row corresponding to x to True) only values which match the current row's "id" and also have a "date" that is no later than that of the current row. We assign this vectorized (columnar) result to the variable hasSameIdAndNotLaterDate.

    We then use this variable as a boolean filter to select from df only the rows that are True, and we sum the Default Tag column values for these rows, create a boolean value by testing this sum to check if it's > 0 (in which case default has occurred for that row's 'ID' on or before that row's 'Date Performance') and return an int version of this boolean value (namely, 0 or 1).

    Back in the statement that calls apply() we assign the columnar result of apply() to the 'Default Tag' column of df. To make it clear that we have now changed the semantics of this column (which now indicates whether default occurred on or before the 'Date Performance' date in each row), we call the rename() method of the DataFrame object df to change the column name to 'Ever Default Tag'.