I need to predict the data from several columns of the pandas dataframe (ml_train_inputs), where there could be columns with several data types, for example: str, float, int, timestamp, etc. In this example, I have tried to prepare the data, as it is done for the model training where I used SKlearn and SVC, before applying the prediction and creating a new column "Predictions".
import pandas as pd
data = {'City': ['New York', 'London', 'Paris', 'Tokyo', 'Moscow', 'Los Angeles', 'Chicago', 'Houston', 'Beijing', 'Shanghai', 'Sydney', 'Melbourne', 'Dubai', 'Singapore', 'Hong Kong', 'Seoul', 'Mumbai', 'Mexico City', 'São Paulo', 'Rio de Janeiro'],
'Country': ['United States', 'United Kingdom', 'France', 'Japan', 'Russia', 'United States', 'United States', 'United States', 'China', 'China', 'Australia', 'Australia', 'United Arab Emirates', 'Singapore', 'China', 'South Korea', 'India', 'Mexico', 'Brazil', 'Brazil'],
'Population': [8175133, 8278000, 2148000, 13350000, 11920000, 3999759, 2718782, 2296193, 21500000, 24150000, 5000000, 4900000, 3320000, 5612000, 7347000, 51190000, 12690000, 21010000, 21295000, 6453000],
'GDP per Capita': [162400, 406000, 40100, 379000, 25200, 60100, 45400, 57400, 16300, 28100, 53400, 44600, 62700, 92400, 64400, 28300, 5200, 9300, 11400, 8800]
}
df_test = pd.DataFrame(data)
start_date = '2022-01-01'
end_date = '2022-12-31'
date_range = pd.date_range(start=start_date, end=end_date, freq='D')
df_test['date'] = np.random.choice(date_range, size=len(df_test))
train_x = df_test
#ml_train_inputs = "['date', 'GDP per Capita', 'Country']" # possible
ml_train_inputs = "['date']"
ml_train_labels = "['Country']"
# Create list
ml_train_inputs_list = ml_train_inputs.strip("[]")
ml_train_inputs_list = [col.strip("'") for col in ml_train_inputs_list.split(",")]
ml_train_labels_list = ml_train_labels.strip("[]")
ml_train_labels_list = [col.strip("'") for col in ml_train_labels_list.split(",")]
# Identify columns with different data types
is_string_column = train_x[ml_train_inputs_list].dtypes.apply(lambda x: x == 'object')
is_float_column = train_x[ml_train_inputs_list].dtypes.apply(lambda x: x == 'float')
is_int_column = train_x[ml_train_inputs_list].dtypes.apply(lambda x: x == 'int')
is_date_column = train_x[ml_train_inputs_list].dtypes.apply(lambda x: x == 'datetime64[ns]')
# Create separate lists for string and float columns
string_columns = train_x[ml_train_inputs_list].columns[is_string_column].to_list()
float_columns = train_x[ml_train_inputs_list].columns[is_float_column].to_list()
int_columns = train_x[ml_train_inputs_list].columns[is_int_column].to_list()
date_columns = train_x[ml_train_inputs_list].columns[is_date_column].to_list()
# Preprocess the string columns
if string_columns:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
train_x_string = train_x[string_columns]
train_x_string_vector = tfidf.fit_transform(train_x_string.apply(lambda x: ' '.join(x), axis=1).values)
test_x_string = test_x[string_columns]
test_x_string_vector = tfidf.transform(test_x_string.apply(lambda x: ' '.join(x), axis=1).values)
# Preprocess the float columns
if float_columns:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_x_float = train_x[float_columns]
train_x_float = scaler.fit_transform(train_x_float)
test_x_float = test_x[float_columns]
test_x_float = scaler.transform(test_x_float)
# Preprocess the int columns
if int_columns:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_x_int = train_x[int_columns]
train_x_int = scaler.fit_transform(train_x_int)
test_x_int = test_x[int_columns]
test_x_int = scaler.transform(test_x_int)
if date_columns:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
train_x_date = ohe.fit_transform(train_x[date_columns])
test_x_date = ohe.transform(test_x[date_columns])
if string_columns and float_columns and int_columns and date_columns:
train_x_vector = np.concatenate((train_x_string_vector.toarray(), train_x_float, train_x_int, train_x_date), axis=1)
test_x_vector = np.concatenate((test_x_string_vector.toarray(), test_x_float, test_x_int, test_x_date), axis=1)
elif string_columns and float_columns and int_columns:
train_x_vector = np.concatenate((train_x_string_vector.toarray(), train_x_float, train_x_int), axis=1)
test_x_vector = np.concatenate((test_x_string_vector.toarray(), test_x_float, test_x_int), axis=1)
elif string_columns and float_columns and date_columns:
train_x_vector = np.concatenate((train_x_string_vector.toarray(), train_x_float, train_x_date), axis=1)
test_x_vector = np.concatenate((test_x_string_vector.toarray(), test_x_float, test_x_date), axis=1)
elif int_columns and date_columns:
train_x_vector = np.concatenate((train_x_int, train_x_date), axis=1)
test_x_vector = np.concatenate((test_x_int, test_x_date), axis=1)
elif string_columns and int_columns:
train_x_vector = np.concatenate((train_x_string_vector.toarray(), train_x_int), axis=1)
test_x_vector = np.concatenate((test_x_string_vector.toarray(), test_x_int), axis=1)
elif string_columns and date_columns:
train_x_vector = np.concatenate((train_x_string_vector.toarray(), train_x_date), axis=1)
test_x_vector = np.concatenate((test_x_string_vector.toarray(), test_x_date), axis=1)
elif float_columns and int_columns:
train_x_vector = np.concatenate((train_x_float, train_x_int), axis=1)
test_x_vector = np.concatenate((test_x_float, test_x_int), axis=1)
elif float_columns and date_columns:
train_x_vector = np.concatenate((train_x_float, train_x_date), axis=1)
test_x_vector = np.concatenate((test_x_float, test_x_date), axis=1)
elif string_columns:
train_x_vector = train_x_string_vector
test_x_vector = test_x_string_vector
elif float_columns:
train_x_vector = train_x_float
test_x_vector = test_x_float
elif int_columns:
train_x_vector = train_x_int
test_x_vector = test_x_int
elif date_columns:
train_x_vector = train_x_date
test_x_vector = test_x_date
new_df = df_test
predictions = best_model.predict(train_x_vector)
new_df["Predictions"] = predictions
but I got the following error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-54-1f6c5a970910> in <module>
99
100 new_df = df_test
--> 101 predictions = best_model.predict(train_x_vector)
102 new_df["Predictions"] = predictions
103
4 frames
/usr/local/lib/python3.8/dist-packages/sklearn/base.py in _check_n_features(self, X, reset)
398
399 if n_features != self.n_features_in_:
--> 400 raise ValueError(
401 f"X has {n_features} features, but {self.__class__.__name__} "
402 f"is expecting {self.n_features_in_} features as input."
ValueError: X has 19 features, but SVC is expecting 15 features as input.
In short, I am trying to be consistent with the trained ml model, where there could be several column types and data types to apply to the respective dataframe.
I tried to make the code work with loops, but it didn't work either.
Do you have any suggestions on how to solve this?
It is a little unclear what best_model
is. However, it seems that you fit your model with a dataset containing 15 features and now are trying to predict with a dataset containing 19 features. I think this is what user betelgeuse commented.
Here is a minimal example of why the error is being generated, using the official example from sklearn's SVC documentation -
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])
from sklearn.svm import SVC
clf = SVC(gamma='auto')
clf.fit(X, y)
As you can see here the dataset has 2 features. You can verify this by doing -
print("The number of features that X has = ", X.shape[1])
Now, if I do this -
print(clf.predict([[-0.8, -1, 3,4,5]]))
I get the following error -
ValueError: X has 5 features, but SVC is expecting 2 features as input.