Search code examples
pythonmachine-learningclassificationtrain-test-split

What should be passed as input parameter when using train-test-split function twice in python 3.6


Basically i wanted to split my dataset into training,testing and validation set. I therefore have used train_test_split function twice. I have a dataset of around 10-Million rows.

On the first split i have split training and testing dataset into 70-Million training and 30-Million testing. Now to get validation set i am bit confused whether to use splitted testing data or training data as an input parameter of train-test-split in order to get validation set. Give some advise. TIA

X = features 
y = target 

# dividing X, y into train and test and validation data 70% training dataset with 15% testing and 15% validation set 

from sklearn.model_selection import train_test_split 

#features and label splitted into 70-30 
X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size = 0.3, random_state = 0) 

#furthermore test data is splitted into test and validation set 15-15
x_test, x_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5)

Solution

  • Don't make a testing set too small. A 20% testing dataset is fine. It would be better, if you splitted you training dataset into training and validation (80%/20% is a fair split). Considering this, you shall change your code in this way:

    X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size = 0.2, random_state = 0) 
    
    
    x_test, x_val, y_test, y_val = train_test_split(X_train, y_train, test_size=0.25)
    

    This is a common practice to split it like this

    This is a common practice to split a dataset like this.