How to weigh data points with sklearn training algorithms

I am looking to train either a random forest or gradient boosting algorithm using sklearn. The data I have is structured in a way that it has a variable weight for each data point that corresponds to the amount of times that data point occurs in the dataset. Is there a way to give sklearn this weight during the training process, or do I need to expand my dataset to a non-weighted version that has duplicate data points each represented individually?

Solution

You can definitely specify the weights while training these classifiers in scikit-learn. Specifically, this happens during the fit step. Here is an example using RandomForestClassifier but the same goes also for GradientBoostingClassifier:

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np

data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)

Here I define some arbitrary weights just for the sake of the example:

weights = np.random.choice([1,2],len(y_train))

And then you can fit your model with these models:

rfc = RandomForestClassifier(n_estimators = 20, random_state = 42)
rfc.fit(X_train,y_train, sample_weight = weights)

You can then evaluate your model on your test data.

Now, to your last point, you could in this example resample your training set according to the weights by duplication. But in most real world examples, this could end up being very tedious because

you would need to make sure all your weights are integers to perform duplication
you would have to uselessly multiply the size of your data, which is memory-consuming and is most likely going to slow down the training procedure