Search code examples
pythonscikit-learnsklearn-pandas

How to weigh data points with sklearn training algorithms


I am looking to train either a random forest or gradient boosting algorithm using sklearn. The data I have is structured in a way that it has a variable weight for each data point that corresponds to the amount of times that data point occurs in the dataset. Is there a way to give sklearn this weight during the training process, or do I need to expand my dataset to a non-weighted version that has duplicate data points each represented individually?


Solution

  • You can definitely specify the weights while training these classifiers in scikit-learn. Specifically, this happens during the fit step. Here is an example using RandomForestClassifier but the same goes also for GradientBoostingClassifier:

    from sklearn.datasets import load_breast_cancer
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import train_test_split
    import numpy as np
    
    data = load_breast_cancer()
    X = data.data
    y = data.target
    X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)
    

    Here I define some arbitrary weights just for the sake of the example:

    weights = np.random.choice([1,2],len(y_train))
    

    And then you can fit your model with these models:

    rfc = RandomForestClassifier(n_estimators = 20, random_state = 42)
    rfc.fit(X_train,y_train, sample_weight = weights)
    

    You can then evaluate your model on your test data.

    Now, to your last point, you could in this example resample your training set according to the weights by duplication. But in most real world examples, this could end up being very tedious because

    • you would need to make sure all your weights are integers to perform duplication
    • you would have to uselessly multiply the size of your data, which is memory-consuming and is most likely going to slow down the training procedure