I would like to know how to go about splitting my multi-labeled class training
data-set to a specific ratio like 80% (class_2), 15% (class_1) and 5% (class_0).
I have a balanced data-set. I originally split the pandas data-set: 80% train
and 20% test
via the command:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
I wanted to however further specify the ratio for the testing
pandas data-set as: 80% (class_2), 15% (class_1) and 5% (class_0). How can this be accomplished?
Here is a snippet of my dataset:
Feat1 Feat2 Feat3 Feat4 Feat5 Label
-58.422504 37.966175 -4.8636584 1.6725544 1.9571232 0
-16.001776 12.794211 -1.1406443 1.3552929 -3.1035073 1
-35.907864 19.15079 -1.4540794 4.7229285 -1.3495653 0
-40.63919 11.879825 0.26731083 4.509876 -0.3005377 1
-82.577805 38.87009 -0.6941721 0.41522327 -3.7065275 0
-91.21994 13.109437 -7.270507 2.081625 -4.206697 0
-47.69479 17.02262 -24.102415 -0.9498974 -6.126767 2
-76.956795 17.869856 -1.6058419 4.2835464 -1.3354894 0
-52.443146 46.593403 -3.4466643 1.1810641 -1.9001787 2
-67.86523 14.28042 0.71933913 2.1071763 1.3627108 1
-47.336437 9.525495 -20.755278 6.523259 -3.422134 2
-42.978676 12.458537 0.07322929 1.3635784 0.09735282 1
-24.21139 38.562397 0.042716235 6.6496754 -1.9689865 2
-48.612396 11.766575 -0.748889 3.8106124 2.109056 1
-49.890644 14.508443 0.36204648 1.7602062 -0.42747113 1
-58.165733 18.751013 -3.8809242 5.257564 -1.4671975 0
-31.926224 8.061624 -0.9180617 3.1844578 1.3856677 1
-49.51432 13.603332 1.1162373 0.88059276 0.8680044 1
-38.187065 22.042477 -9.74126 3.464233 -1.4608487 2
-36.763634 11.885029 -0.3559528 1.2861489 -0.006563603 1
-59.474194 17.596613 -13.849893 2.5668569 -7.367901 2
-20.775812 8.021951 -5.8948507 -1.76145 -3.0236924 1
-44.744774 42.550343 -2.8213162 1.496162 -5.367485 2
-59.297913 15.10593 -15.805616 -0.8902338 -2.0228894 2
-43.05664 17.326857 -21.520315 -0.544733 -5.821276 2
-113.831566 10.970723 -1.0806333 2.6965592 -0.50331205 0
-67.71741 37.033604 -7.5146904 4.7712235 -0.88289934 0
-51.200836 20.278473 -9.158655 4.746186 -5.2653203 2
-43.760933 13.239898 -5.1588607 2.5003295 -2.2052805 0
-53.52218 12.309539 -0.24887963 4.237159 0.52248794 0
How to go about correctly splitting the train
data-set based on the Label
names off specific ratios?
Thanks for your help and time!
Sampling might be better for your purpose:
import numpy as np
class_0, class_1, class_2 = np.split(df.sample(frac=1, random_state=42),
[int(.05*len(df)), int(.20*len(df))])