Search code examples
pythonscikit-learnlinear-regressiondata-sciencetraining-data

How to pass different set of data to train and test without splitting a dataframe. (python)?


I have gone through multiple questions that help divide your dataframe into train and test, with scikit, without etc.

But my question is I have 2 different csvs ( 2 different dataframes from different years). I want to use one as train and other as test?

How to do so for LinearRegression / any model?


Solution

    • Load the datasets individually.
    • Make sure they are in the same format of rows and columns (features).
    • Use the train set to fit the model.
    • Use the test set to predict the output after training.
    # Load the data
    train = pd.read_csv('train.csv')
    test = pd.read_csv('test.csv')
    
    # Split features and value
    # when trying to predict column "target" 
    X_train, y_train = train.drop("target"), train["target"]
    X_test, y_test = test.drop("target"), test["target"]
    
    # Fit (i.e. train) model
    reg = LinearRegression()
    reg.fit(X_train, y_train)
    
    # Predict
    pred = reg.predict(X_test)
    
    # Score
    accuracy = reg.score(X_test, y_test)