python machine-learning scikit-learn linear-regression

ValuerError: Found input variables with inconsistent numbers of samples

I wrote the following code to learn the score in the machine learning methods. but I get the following error. what would be the reason??

ValueError: Found input variables with inconsistent numbers of samples: [6396, 1599]

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv('Armenian Market Car Prices.csv')

df['Car Name'] = df['Car Name'].astype('category').cat.codes

df = df.join(pd.get_dummies(df.FuelType, dtype=int))
df = df.drop('FuelType', axis=1)

df['Region'] = df['Region'].astype('category').cat.codes

df['Price'] = df.pop('Price')

X = df.drop('Price', axis=1)
y = df['Price']

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()

model.fit(X_train, y_train)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[358], line 1
----> 1 model.fit(X_train, y_train)

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py:1473, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1466     estimator._validate_params()
   1468 with config_context(
   1469     skip_parameter_validation=(
   1470         prefer_skip_nested_validation or global_skip_validation
   1471     )
   1472 ):
-> 1473     return fit_method(estimator, *args, **kwargs)

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_base.py:609, in LinearRegression.fit(self, X, y, sample_weight)
    605 n_jobs_ = self.n_jobs
    607 accept_sparse = False if self.positive else ["csr", "csc", "coo"]
--> 609 X, y = self._validate_data(
    610     X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
    611 )
    613 has_sw = sample_weight is not None
    614 if has_sw:

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py:650, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
    648         y = check_array(y, input_name="y", **check_y_params)
    649     else:
--> 650         X, y = check_X_y(X, y, **check_params)
    651     out = X, y
    653 if not no_val_X and check_params.get("ensure_2d", True):

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\utils\validation.py:1291, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
   1273 X = check_array(
   1274     X,
   1275     accept_sparse=accept_sparse,
   (...)
   1286     input_name="X",
   1287 )
   1289 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
-> 1291 check_consistent_length(X, y)
   1293 return X, y

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\utils\validation.py:460, in check_consistent_length(*arrays)
    458 uniques = np.unique(lengths)
    459 if len(uniques) > 1:
--> 460     raise ValueError(
    461         "Found input variables with inconsistent numbers of samples: %r"
    462         % [int(l) for l in lengths]
    463     )

ValueError: Found input variables with inconsistent numbers of samples: [6396, 1599]

I tried everything but nothing works or I don't know how to solve the problem.

Jupyternaut:

The error message you provided indicates that there is a problem with the input data. Specifically, it appears that there are two different versions of the input data, one with 6396 samples and another with 1599 samples. This can cause issues when trying to fit a model or perform other operations on the data.

To resolve this issue, you may need to check your code and ensure that you are using the correct version of the input data for each operation. You may also want to try cleaning up the input data by removing any duplicates or inconsistencies.

Solution

The output order of train_test_split is not correct in your code. So your y_train is actualy the X_test, which has a different sample size than train set. Here is the correct order:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)