I wrote the following code to learn the score in the machine learning methods. but I get the following error. what would be the reason??
ValueError: Found input variables with inconsistent numbers of samples: [6396, 1599]
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv('Armenian Market Car Prices.csv')
df['Car Name'] = df['Car Name'].astype('category').cat.codes
df = df.join(pd.get_dummies(df.FuelType, dtype=int))
df = df.drop('FuelType', axis=1)
df['Region'] = df['Region'].astype('category').cat.codes
df['Price'] = df.pop('Price')
X = df.drop('Price', axis=1)
y = df['Price']
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[358], line 1
----> 1 model.fit(X_train, y_train)
File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py:1473, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
1466 estimator._validate_params()
1468 with config_context(
1469 skip_parameter_validation=(
1470 prefer_skip_nested_validation or global_skip_validation
1471 )
1472 ):
-> 1473 return fit_method(estimator, *args, **kwargs)
File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_base.py:609, in LinearRegression.fit(self, X, y, sample_weight)
605 n_jobs_ = self.n_jobs
607 accept_sparse = False if self.positive else ["csr", "csc", "coo"]
--> 609 X, y = self._validate_data(
610 X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
611 )
613 has_sw = sample_weight is not None
614 if has_sw:
File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py:650, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
648 y = check_array(y, input_name="y", **check_y_params)
649 else:
--> 650 X, y = check_X_y(X, y, **check_params)
651 out = X, y
653 if not no_val_X and check_params.get("ensure_2d", True):
File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\utils\validation.py:1291, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
1273 X = check_array(
1274 X,
1275 accept_sparse=accept_sparse,
(...)
1286 input_name="X",
1287 )
1289 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
-> 1291 check_consistent_length(X, y)
1293 return X, y
File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\utils\validation.py:460, in check_consistent_length(*arrays)
458 uniques = np.unique(lengths)
459 if len(uniques) > 1:
--> 460 raise ValueError(
461 "Found input variables with inconsistent numbers of samples: %r"
462 % [int(l) for l in lengths]
463 )
ValueError: Found input variables with inconsistent numbers of samples: [6396, 1599]
I tried everything but nothing works or I don't know how to solve the problem.
Jupyternaut:
The error message you provided indicates that there is a problem with the input data. Specifically, it appears that there are two different versions of the input data, one with 6396 samples and another with 1599 samples. This can cause issues when trying to fit a model or perform other operations on the data.
To resolve this issue, you may need to check your code and ensure that you are using the correct version of the input data for each operation. You may also want to try cleaning up the input data by removing any duplicates or inconsistencies.
The output order of train_test_split
is not correct in your code. So your y_train
is actualy the X_test
, which has a different sample size than train set. Here is the correct order:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)