Search code examples
python-3.xtraining-datamultilabel-classification

Split Train dataset based on labels


I would like to know how to go about splitting my multi-labeled class training data-set to a specific ratio like 80% (class_2), 15% (class_1) and 5% (class_0).

I have a balanced data-set. I originally split the pandas data-set: 80% train and 20% test via the command:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

I wanted to however further specify the ratio for the testing pandas data-set as: 80% (class_2), 15% (class_1) and 5% (class_0). How can this be accomplished?

Here is a snippet of my dataset:

Feat1   Feat2   Feat3   Feat4   Feat5   Label
-58.422504  37.966175   -4.8636584  1.6725544   1.9571232   0
-16.001776  12.794211   -1.1406443  1.3552929   -3.1035073  1
-35.907864  19.15079    -1.4540794  4.7229285   -1.3495653  0
-40.63919   11.879825   0.26731083  4.509876    -0.3005377  1
-82.577805  38.87009    -0.6941721  0.41522327  -3.7065275  0
-91.21994   13.109437   -7.270507   2.081625    -4.206697   0
-47.69479   17.02262    -24.102415  -0.9498974  -6.126767   2
-76.956795  17.869856   -1.6058419  4.2835464   -1.3354894  0
-52.443146  46.593403   -3.4466643  1.1810641   -1.9001787  2
-67.86523   14.28042    0.71933913  2.1071763   1.3627108   1
-47.336437  9.525495    -20.755278  6.523259    -3.422134   2
-42.978676  12.458537   0.07322929  1.3635784   0.09735282  1
-24.21139   38.562397   0.042716235 6.6496754   -1.9689865  2
-48.612396  11.766575   -0.748889   3.8106124   2.109056    1
-49.890644  14.508443   0.36204648  1.7602062   -0.42747113 1
-58.165733  18.751013   -3.8809242  5.257564    -1.4671975  0
-31.926224  8.061624    -0.9180617  3.1844578   1.3856677   1
-49.51432   13.603332   1.1162373   0.88059276  0.8680044   1
-38.187065  22.042477   -9.74126    3.464233    -1.4608487  2
-36.763634  11.885029   -0.3559528  1.2861489   -0.006563603    1
-59.474194  17.596613   -13.849893  2.5668569   -7.367901   2
-20.775812  8.021951    -5.8948507  -1.76145    -3.0236924  1
-44.744774  42.550343   -2.8213162  1.496162    -5.367485   2
-59.297913  15.10593    -15.805616  -0.8902338  -2.0228894  2
-43.05664   17.326857   -21.520315  -0.544733   -5.821276   2
-113.831566 10.970723   -1.0806333  2.6965592   -0.50331205 0
-67.71741   37.033604   -7.5146904  4.7712235   -0.88289934 0
-51.200836  20.278473   -9.158655   4.746186    -5.2653203  2
-43.760933  13.239898   -5.1588607  2.5003295   -2.2052805  0
-53.52218   12.309539   -0.24887963 4.237159    0.52248794  0

How to go about correctly splitting the train data-set based on the Label names off specific ratios?

Thanks for your help and time!


Solution

  • Sampling might be better for your purpose:

    import numpy as np    
    class_0, class_1, class_2 = np.split(df.sample(frac=1, random_state=42), 
                                        [int(.05*len(df)), int(.20*len(df))])