Search code examples
pythonpandasdataframesampling

Pandas stratified splitting into train, test, and validation set based on the target variable its cluster


I have a dataframe with some features and a target column belonging to {0,1}. I need to split this dataset into training, testing and validation sets. The validation part must be the 20% of the dataset, and the remaining 80% must be split so that the 80% of it goes into the training set. And this can be easily achieved with sklearn's train_test_split

My problem is that the splitting must be done in a stratified way based on the clusters I computed for both target values.

To compute the clusters I separated the entries for both targets into two subsets e.g.

ones = df[df_numerical['Target'] == 1].copy()
zeroes = df[df_numerical['Target'] == 1].copy()

Then for each subset I used kmeans to compute their clusters, and added the clusters to the dataframe, e.g.:

# the number of clusters for both variables is not the same
clusters_1 = kmeans_1.predict(ones[NUMERICAL_FEATURES])
ones['Cluster'] = clusters_1

clusters_0 = kmeans_0.predict(zeroes[NUMERICAL_FEATURES])
zeroes['Cluster'] = clusters_0

Now how can I split the datasets such that they are stratified by cluster size?

The splitting I need must be done in this way: assuming of having 100 records, 80 of class 1 and 20 of class 0, I need to split this records in a 70 / 30 %, so I need to have 56 (70% of 80) records of class 1 and 14 (70% of 20) of class 0. And I know this can be done using the stratify parameter of train_test_split, but my problem is that in addition to this, the splitting must be stratified also w.r.t the clusters of each target value.

One solution I thought would be of extracting the indices of the elements for both classes, putting them into lists, extracting from them the right number of elements and then re-combine the dataframes:

cluster_indices_0 = zeroes.groupby(['Cluster']).apply(lambda x: x.index)
cluster_indices_1 = ones.groupby(['Cluster']).apply(lambda x: x.index)

But in this way I'd have to manually compute, for each cluster the number of elements to pop, and I was looking for a way to do this automatically.

Is there a function in sklearn or pandas to achieve what I'm looking for without getting list in the computation of the number of elements to extract?


Solution

  • Since you have your data already split by target, you simply need to call train_test_split on each subset and use the cluster column for stratification.

    train_test_0, validation_0 = train_test_split(zeroes, train_size=0.8, stratify=zeroes['Cluster'])
    train_0, test_0 = train_test_split(train_test_0, train_size=0.7, stratify=train_test_0['Cluster'])
    

    then do the same for target one and combine all the subsets