Search code examples
pythonpandasmachine-learningtraining-datatrain-test-split

Using seperated test and train files with train_test_split()


I have two .csv files that one of them is test.csv and the other one is train.csv. However, as you can predict the test file does not have the target column ('y' in this case) while train file has.

What I wanted to do is first using train file to train the system entirely, then using the test file just to see predictions.

I'm using from sklearn.model_selection import train_test_split() to create train and test examples but it accepts 1 file path only. I want to train the system using train file first, then when it finished I want to get test datas from test.csv file and make the predictions.

So first I tried classic way but decreasing test size so It'll be like "this file used for train only",

import pandas as pd
from sklearn.svm import SVC
dataset = pd.read_csv(r'path\train.csv', sep=",")
X_train, X_test, y_train, y_test = train_test_split(
       X, y, test_size = 0.001, random_state = 45)

clf = SVC(kernel = 'rbf')
clf.fit(X_train, y_train)

but then, when it comes to real test part(which I want to use the data in test.csv that doesn't have target values), how can I import test.csv somehow I can use the test data in trained model above

#get data from test.csv as somehow X_test
clfPredict = clf.predict(X_test)

If this is not possible using train_test_split(), what's the proper way to accomplish this task?


Solution

  • You need to load the train CSV and split it to:

    y_train = df1['Y column']
    X_train = df1.drop('Y Column', axis = 1)
    

    And regarding test:

    X_test = df2
    

    and y_test will be the result from clf.predict(X_test)