Search code examples
pandasdataframetimestampdata-analysiscatboost

Assigning higher weigth to recent month observations for ML model


I have hightly imballanced dataset and I want to assign weights for my observations by months.
For instance, If my observation is in January 2022 I'll give it 1/5 and if it's March 2022 I'll give it 1/3and so on.

feature_1    date     weights
117       2016-11-12   0.015
...          ...        ...
123       2022-01-01    0.2
234       2022-01-02    0.2
...          ...  
345       2022-05-31    1.0


I'm using CatboostClassifier and I guess I can pass list of weights for all my data to weight param. So it will look smth like this

model.fit(Pool(X_train,y_train,weight=train_weight))

Problem is I can't think of elegant solution to form weights column/list.
For now, I splitted my dataframe in Months frequency like that:

g = X_train.groupby(pd.Grouper(key='date', freq='M'))
dfs = [group for _,group in g]

and made column of weights like that:

for i, df in enumerate(dfs):
    weight = []
    for val in dfs[i].iterrows():
        weight.append(1 / (len(dfs)+2 - i))
    dfs[i]['weight'] = weight

Solution

  • Given the following toy dataframe:

    from datetime import datetime
    
    import pandas as pd
    
    df = pd.DataFrame(
        {
            "feature_1": [117, 123, 234, 345],
            "date": ["2016-11-12", "2022-01-01", "2022-01-02", "2022-05-31"],
        }
    )
    
    df["date"] = pd.to_datetime(df["date"])
    

    Define a helper function to calculate weights:

    def weight(current_date, previous_date):
        try:
            wgt = round(
                1
                / (
                    (current_date.year - previous_date.year) * 12
                    + current_date.month
                    - previous_date.month
                ),
                3,
            )
        except ZeroDivisionError:
            wgt = 1
        return wgt
    

    And so, assuming the most recent date is 31 May 2022:

    df["weight"] = df["date"].apply(lambda x: weight(datetime(2022, 5, 31), x))
    
    print(df)
    # Output
       feature_1       date  weight
    0        117 2016-11-12   0.015
    1        123 2022-01-01   0.250
    2        234 2022-01-02   0.250
    3        345 2022-05-31   1.000