I am working on the Housing Prices problem hosted on Kaggle. While building my model, I figured that it makes sense to reuse some of the code that I've been using for the train dataset, on the test set as well so I took the code performing mutual operations into one function definition. In this function, I am handling missing values and using its return to perform one-hot-encoding and using it on Random Forest Regression. However, its throwing the following error:
Traceback (most recent call last):
File "C:/Users/security/Downloads/AP/Boston-Kaggle/Model.py", line 56, in <module>
sel.fit(x_train, y_train)
File "C:\Users\security\AppData\Roaming\Python\Python37\site-packages\sklearn\feature_selection\from_model.py", line 196, in fit
self.estimator_.fit(X, y, **fit_params)
File "C:\Users\security\AppData\Roaming\Python\Python37\site-packages\sklearn\ensemble\forest.py", line 249, in fit
X = check_array(X, accept_sparse="csc", dtype=DTYPE)
File "C:\Users\security\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\validation.py", line 542, in check_array
allow_nan=force_all_finite == 'allow-nan')
File "C:\Users\security\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\validation.py", line 56, in _assert_all_finite
raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
I did not have this problem while using the same code without organizing it into a function. def feature_selection_and_engineering(df)
is the function in question. The following is my entire code:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/test.csv")
def feature_selection_and_engineering(df):
# Creating a series of how many NaN's are in each column
nan_counts = df.isna().sum()
# Creating a template list
nan_columns = []
# Iterating over the series and if the value is more than 0 (i.e there are some NaN's present)
for i in range(0, len(nan_counts)):
if nan_counts[i] > 0:
nan_columns.append(df.columns[i])
# Iterating through all the columns which are known to have NaN's
for i in nan_columns:
if df[nan_columns][i].dtypes == 'float64':
df[i] = df[i].fillna(df[i].mean())
elif df[nan_columns][i].dtypes == 'object':
df[i] = df[i].fillna('XX')
# Creating a template list
categorical_columns = []
# Iterating across all the columns,
# checking if they're of the object datatype and if they are, appending them to the categorical list
for i in range(0, len(df.dtypes)):
if df.dtypes[i] == 'object':
categorical_columns.append(df.columns[i])
return categorical_columns
# take one-hot encoding
OHE_sdf = pd.get_dummies(feature_selection_and_engineering(train))
# drop the old categorical column from original df
train.drop(columns = feature_selection_and_engineering(train), axis = 1, inplace = True)
# attach one-hot encoded columns to original data frame
train = pd.concat([train, OHE_sdf], axis = 1, ignore_index = False)
# Dividing the training dataset into train/test sets with the test size being 20% of the overall dataset.
x_train, x_test, y_train, y_test = train_test_split(train, train['SalePrice'], test_size = 0.2, random_state = 42)
randomForestRegressor = RandomForestRegressor(n_estimators=1000)
# Invoking the Random Forest Classifier with a 1.25x the mean threshold to select correlating features
sel = SelectFromModel(RandomForestClassifier(n_estimators = 100), threshold = '1.25*mean')
sel.fit(x_train, y_train)
selected = sel.get_support()
# linearRegression.fit(x_train, y_train)
randomForestRegressor.fit(x_train, y_train)
# Assigning the accuracy of the model to the variable "accuracy"
accuracy = randomForestRegressor.score(x_train, y_train)
# Predicting for the data in the test set
predictions = randomForestRegressor.predict(feature_selection_and_engineering(test))
# Writing the predictions to a new CSV file
submission = pd.DataFrame({'Id': test['PassengerId'], 'SalePrice': predictions})
filename = 'Boston-Submission.csv'
submission.to_csv(filename, index=False)
print(accuracy*100, "%")
new error:
Traceback (most recent call last):
File "/home/onur/Documents/Boston-Kaggle/Model.py", line 76, in <module>
x_train, encoder = feature_selection_and_engineering(x_train)
File "/home/onur/Documents/Boston-Kaggle/Model.py", line 57, in feature_selection_and_engineering
encoder = train_one_hot_encoder(df, categorical_columns)
File "/home/onur/Documents/Boston-Kaggle/Model.py", line 30, in train_one_hot_encoder
return enc.fit(categorical_df)
File "/opt/anaconda/envs/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 493, in fit
self._fit(X, handle_unknown=self.handle_unknown)
File "/opt/anaconda/envs/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 80, in _fit
X_list, n_samples, n_features = self._check_X(X)
File "/opt/anaconda/envs/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 67, in _check_X
force_all_finite=needs_validation)
File "/opt/anaconda/envs/lib/python3.7/site-packages/sklearn/utils/validation.py", line 542, in check_array
allow_nan=force_all_finite == 'allow-nan')
File "/opt/anaconda/envs/lib/python3.7/site-packages/sklearn/utils/validation.py", line 60, in _assert_all_finite
raise ValueError("Input contains NaN")
ValueError: Input contains NaN
Reusing code is a good idea, but beware how the scope of variables change when you put code into a function.
The error you are getting is caused because there are NaN
values in the array you input into the random forest. In your feature_engineering_and_selection()
function, you are removing the NaN
values, but df
is never returned from the function, so the original, unmodified df
is used in the model.
I suggest splitting up your feature_engineering_and_selection()
function into different components. Here I made a function that just removes NaN
s.
# Iterates through the columns and fixes any NaNs
def remove_nan(df):
replace_dict = {}
for col in df.columns:
# If there are any NaN values in this column
if pd.isna(df[col]).any():
# Replace NaN in object columns with 'N/A'
if df[col].dtypes == 'object':
replace_dict[col] = 'N/A'
# Replace NaN in float columns with 0
elif df[col].dtypes == 'float64':
replace_dict[col] = 0
df = df.fillna(replace_dict)
return df
I suggest filling NaN
numerical values with 0 instead of the mean. For this data, there are 3 numerical columns with nan values: LotFrontage
(feet of street connected to property), MasVnrArea
(masonry veener area), GarageYrBlt
(garage year built). If there is no garage, then there is no garage year built, so it makes sense to have the year as 0 instead of the average year, etc.
There is also some work that needs to be done with the one hot encoder you have set up. Creating a one-hot-encoding can be tricky, because the training data and the test data need to have the same columns. If you have the following training and test data
Train
| House Type |
| ---------- |
| Mansion |
| Ranch |
Test
| House Type |
| ---------- |
| Mansion |
| Duplex |
Then if using pd.get_dummies()
the train columns will be [house_type_mansion, house_type_ranch]
and the test columns will be [house_type_mansion, house_type_duplex]
, which won't work. However, using sklearn, you can fit a one hot encoder to your train data. When transforming the test dataset, it will create the same columns as the train data set. The handle_unknown
parameter will tell the encoder what to do with duplex
in the test set, either ignore
or error
.
# Fits an sklearn one hot encoder
def train_one_hot_encoder(df, categorical_columns):
# take one-hot encoding of categorical columns
categorical_df = df[categorical_columns]
enc = OneHotEncoder(sparse=False, handle_unknown='ignore')
return enc.fit(categorical_df)
To combine the categorical and non categorical data, again I suggest making a separate function
# One hot encodes the given dataframe
def one_hot_encode(df, categorical_columns, encoder):
# Get dataframe with only categorical columns
categorical_df = df[categorical_columns]
# Get one hot encoding
ohe_df = pd.DataFrame(encoder.transform(categorical_df), columns=encoder.get_feature_names())
# Get float columns
float_df = df.drop(categorical_columns, axis=1)
# Return the combined array
return pd.concat([float_df, ohe_df], axis=1)
Finally, your feature_engineering_and_selection()
function can call all of those functions.
def feature_selection_and_engineering(df, encoder=None):
df = remove_nan(df)
categorical_columns = get_categorical_columns(df)
# If there is no encoder, train one
if encoder == None:
encoder = train_one_hot_encoder(df, categorical_columns)
# Encode Data
df = one_hot_encode(df, categorical_columns, encoder)
# Return the encoded data AND encoder
return df, encoder
There were a few things I had to fix to make the code run, I have included the entire modified script in a gist here https://gist.github.com/kylelrichards11/6be90d92a7dd6a5cc9a5290dae3ff94e