Search code examples
python-3.xlinear-regressiontrain-test-split

Found input variables with inconsistent numbers of samples: [799996, 199999]


I am splitting a single df so why is it giving Inconsistent no of samples in X_train, X_test (if that is what the error means)?

X_train, X_test = train_test_split(df[categorical_cols+ numeric_cols], test_size=0.2, random_state=4)
regression = LinearRegression().fit(X_train, X_test)
regression.score(X)

Solution

  • In your example, the method will do something roughly equivalent to the following:

    1. Generate a random number between 0 and 1 for each record

    2. Put records where the random number is below .2 in the test set

    3. Put the rest in the training set

    There is some randomness to how many actually get put in the train/test sets because the number of random numbers under .2 won't always be exactly 20%.