python dataframe machine-learning scikit-learn sklearn-pandas

Problems applying a Sklearn ML model to a datraframe pandas with multiple columns and data types

I need to predict the data from several columns of the pandas dataframe (ml_train_inputs), where there could be columns with several data types, for example: str, float, int, timestamp, etc. In this example, I have tried to prepare the data, as it is done for the model training where I used SKlearn and SVC, before applying the prediction and creating a new column "Predictions".

import pandas as pd
data = {'City': ['New York', 'London', 'Paris', 'Tokyo', 'Moscow', 'Los Angeles', 'Chicago', 'Houston', 'Beijing', 'Shanghai', 'Sydney', 'Melbourne', 'Dubai', 'Singapore', 'Hong Kong', 'Seoul', 'Mumbai', 'Mexico City', 'São Paulo', 'Rio de Janeiro'],
        'Country': ['United States', 'United Kingdom', 'France', 'Japan', 'Russia', 'United States', 'United States', 'United States', 'China', 'China', 'Australia', 'Australia', 'United Arab Emirates', 'Singapore', 'China', 'South Korea', 'India', 'Mexico', 'Brazil', 'Brazil'],
        'Population': [8175133, 8278000, 2148000, 13350000, 11920000, 3999759, 2718782, 2296193, 21500000, 24150000, 5000000, 4900000, 3320000, 5612000, 7347000, 51190000, 12690000, 21010000, 21295000, 6453000],
        'GDP per Capita': [162400, 406000, 40100, 379000, 25200, 60100, 45400, 57400, 16300, 28100, 53400, 44600, 62700, 92400, 64400, 28300, 5200, 9300, 11400, 8800]
       }

df_test = pd.DataFrame(data)

start_date = '2022-01-01'
end_date = '2022-12-31'
date_range = pd.date_range(start=start_date, end=end_date, freq='D')
df_test['date'] = np.random.choice(date_range, size=len(df_test))

train_x = df_test
#ml_train_inputs = "['date', 'GDP per Capita', 'Country']" # possible
ml_train_inputs = "['date']"
ml_train_labels = "['Country']"

# Create list
ml_train_inputs_list = ml_train_inputs.strip("[]")
ml_train_inputs_list = [col.strip("'") for col in ml_train_inputs_list.split(",")]
ml_train_labels_list = ml_train_labels.strip("[]")
ml_train_labels_list = [col.strip("'") for col in ml_train_labels_list.split(",")]


# Identify columns with different data types
is_string_column = train_x[ml_train_inputs_list].dtypes.apply(lambda x: x == 'object')
is_float_column = train_x[ml_train_inputs_list].dtypes.apply(lambda x: x == 'float')
is_int_column = train_x[ml_train_inputs_list].dtypes.apply(lambda x: x == 'int')
is_date_column = train_x[ml_train_inputs_list].dtypes.apply(lambda x: x == 'datetime64[ns]')

# Create separate lists for string and float columns
string_columns = train_x[ml_train_inputs_list].columns[is_string_column].to_list()
float_columns = train_x[ml_train_inputs_list].columns[is_float_column].to_list()
int_columns = train_x[ml_train_inputs_list].columns[is_int_column].to_list()
date_columns = train_x[ml_train_inputs_list].columns[is_date_column].to_list()

# Preprocess the string columns
if string_columns:
    from sklearn.feature_extraction.text import TfidfVectorizer
    tfidf = TfidfVectorizer(stop_words='english')
    train_x_string = train_x[string_columns]
    train_x_string_vector = tfidf.fit_transform(train_x_string.apply(lambda x: ' '.join(x), axis=1).values)
    test_x_string = test_x[string_columns]
    test_x_string_vector = tfidf.transform(test_x_string.apply(lambda x: ' '.join(x), axis=1).values)

# Preprocess the float columns
if float_columns:
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    train_x_float = train_x[float_columns]
    train_x_float = scaler.fit_transform(train_x_float)
    test_x_float = test_x[float_columns]
    test_x_float = scaler.transform(test_x_float)

# Preprocess the int columns
if int_columns:
  from sklearn.preprocessing import StandardScaler
  scaler = StandardScaler()
  train_x_int = train_x[int_columns]
  train_x_int = scaler.fit_transform(train_x_int)
  test_x_int = test_x[int_columns]
  test_x_int = scaler.transform(test_x_int)

if date_columns:
    from sklearn.preprocessing import OneHotEncoder
    ohe = OneHotEncoder(handle_unknown='ignore')
    train_x_date = ohe.fit_transform(train_x[date_columns])
    test_x_date = ohe.transform(test_x[date_columns])

if string_columns and float_columns and int_columns and date_columns:
    train_x_vector = np.concatenate((train_x_string_vector.toarray(), train_x_float, train_x_int, train_x_date), axis=1)
    test_x_vector = np.concatenate((test_x_string_vector.toarray(), test_x_float, test_x_int, test_x_date), axis=1)
elif string_columns and float_columns and int_columns:
    train_x_vector = np.concatenate((train_x_string_vector.toarray(), train_x_float, train_x_int), axis=1)
    test_x_vector = np.concatenate((test_x_string_vector.toarray(), test_x_float, test_x_int), axis=1)
elif string_columns and float_columns and date_columns:
    train_x_vector = np.concatenate((train_x_string_vector.toarray(), train_x_float, train_x_date), axis=1)
    test_x_vector = np.concatenate((test_x_string_vector.toarray(), test_x_float, test_x_date), axis=1)
elif int_columns and date_columns:
    train_x_vector = np.concatenate((train_x_int, train_x_date), axis=1)
    test_x_vector = np.concatenate((test_x_int, test_x_date), axis=1)
elif string_columns and int_columns:
    train_x_vector = np.concatenate((train_x_string_vector.toarray(), train_x_int), axis=1)
    test_x_vector = np.concatenate((test_x_string_vector.toarray(), test_x_int), axis=1)
elif string_columns and date_columns:
    train_x_vector = np.concatenate((train_x_string_vector.toarray(), train_x_date), axis=1)
    test_x_vector = np.concatenate((test_x_string_vector.toarray(), test_x_date), axis=1)
elif float_columns and int_columns:
    train_x_vector = np.concatenate((train_x_float, train_x_int), axis=1)
    test_x_vector = np.concatenate((test_x_float, test_x_int), axis=1)
elif float_columns and date_columns:
    train_x_vector = np.concatenate((train_x_float, train_x_date), axis=1)
    test_x_vector = np.concatenate((test_x_float, test_x_date), axis=1)
elif string_columns:
    train_x_vector = train_x_string_vector
    test_x_vector = test_x_string_vector
elif float_columns:
    train_x_vector = train_x_float
    test_x_vector = test_x_float
elif int_columns:
    train_x_vector = train_x_int
    test_x_vector = test_x_int
elif date_columns:
    train_x_vector = train_x_date
    test_x_vector = test_x_date

new_df = df_test
predictions = best_model.predict(train_x_vector)
new_df["Predictions"] = predictions

but I got the following error

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-54-1f6c5a970910> in <module>
     99 
    100 new_df = df_test
--> 101 predictions = best_model.predict(train_x_vector)
    102 new_df["Predictions"] = predictions
    103 

4 frames

/usr/local/lib/python3.8/dist-packages/sklearn/base.py in _check_n_features(self, X, reset)
    398 
    399         if n_features != self.n_features_in_:
--> 400             raise ValueError(
    401                 f"X has {n_features} features, but {self.__class__.__name__} "
    402                 f"is expecting {self.n_features_in_} features as input."

ValueError: X has 19 features, but SVC is expecting 15 features as input.

In short, I am trying to be consistent with the trained ml model, where there could be several column types and data types to apply to the respective dataframe.

I tried to make the code work with loops, but it didn't work either.

Do you have any suggestions on how to solve this?

Solution

It is a little unclear what best_model is. However, it seems that you fit your model with a dataset containing 15 features and now are trying to predict with a dataset containing 19 features. I think this is what user betelgeuse commented.

Here is a minimal example of why the error is being generated, using the official example from sklearn's SVC documentation -

import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])
from sklearn.svm import SVC
clf = SVC(gamma='auto')
clf.fit(X, y)

As you can see here the dataset has 2 features. You can verify this by doing -

print("The number of features that X has = ", X.shape[1])

Now, if I do this -

print(clf.predict([[-0.8, -1, 3,4,5]]))

I get the following error -

ValueError: X has 5 features, but SVC is expecting 2 features as input.