Search code examples
pythonmachine-learningscikit-learncross-validationtrain-test-split

how can I split data in 3 or more parts with sklearn


I want to split data into train,test and validation datasets which are stratification, but sklearn only provides cross_validation.train_test_split which only can divide into 2 pieces. What should i do if i want do this


Solution

  • If you want to use a Stratified Train/Test split, you can use StratifiedKFold in Sklearn

    Suppose X is your features and y are your labels, based on the example here :

    from sklearn.model_selection import StratifiedKFold
    cv_stf = StratifiedKFold(n_splits=3)
    for train_index, test_index in skf.split(X, y):
        print("TRAIN:", train_index, "TEST:", test_index)
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
    

    Update : To split data into say 3 different percentages use numpy.split() can be done like this :

    X_train, X_test, X_validate  = np.split(X, [int(.7*len(X)), int(.8*len(X))])
    y_train, y_test, y_validate  = np.split(y, [int(.7*len(y)), int(.8*len(y))])