Search code examples
pythonpandasmachine-learningscikit-learnkaggle

Python can't take input while using functions


I am working on the Housing Prices problem hosted on Kaggle. While building my model, I figured that it makes sense to reuse some of the code that I've been using for the train dataset, on the test set as well so I took the code performing mutual operations into one function definition. In this function, I am handling missing values and using its return to perform one-hot-encoding and using it on Random Forest Regression. However, its throwing the following error:

Traceback (most recent call last):
  File "C:/Users/security/Downloads/AP/Boston-Kaggle/Model.py", line 56, in <module>
    sel.fit(x_train, y_train)
  File "C:\Users\security\AppData\Roaming\Python\Python37\site-packages\sklearn\feature_selection\from_model.py", line 196, in fit
    self.estimator_.fit(X, y, **fit_params)
  File "C:\Users\security\AppData\Roaming\Python\Python37\site-packages\sklearn\ensemble\forest.py", line 249, in fit
    X = check_array(X, accept_sparse="csc", dtype=DTYPE)
  File "C:\Users\security\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\validation.py", line 542, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File "C:\Users\security\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\validation.py", line 56, in _assert_all_finite
    raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

I did not have this problem while using the same code without organizing it into a function. def feature_selection_and_engineering(df) is the function in question. The following is my entire code:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/test.csv")

def feature_selection_and_engineering(df):
    # Creating a series of how many NaN's are in each column
    nan_counts = df.isna().sum()

    # Creating a template list
    nan_columns = []

    # Iterating over the series and if the value is more than 0 (i.e there are some NaN's present)
    for i in range(0, len(nan_counts)):
        if nan_counts[i] > 0:
            nan_columns.append(df.columns[i])

    # Iterating through all the columns which are known to have NaN's
    for i in nan_columns:
        if df[nan_columns][i].dtypes == 'float64':
            df[i] = df[i].fillna(df[i].mean())
        elif df[nan_columns][i].dtypes == 'object':
            df[i] = df[i].fillna('XX')

    # Creating a template list
    categorical_columns = []

    # Iterating across all the columns,
    # checking if they're of the object datatype and if they are, appending them to the categorical list
    for i in range(0, len(df.dtypes)):
        if df.dtypes[i] == 'object':
            categorical_columns.append(df.columns[i])

    return categorical_columns

# take one-hot encoding
OHE_sdf = pd.get_dummies(feature_selection_and_engineering(train))

# drop the old categorical column from original df
train.drop(columns = feature_selection_and_engineering(train), axis = 1, inplace = True)

# attach one-hot encoded columns to original data frame
train = pd.concat([train, OHE_sdf], axis = 1, ignore_index = False)

# Dividing the training dataset into train/test sets with the test size being 20% of the overall dataset.
x_train, x_test, y_train, y_test = train_test_split(train, train['SalePrice'], test_size = 0.2, random_state = 42)

randomForestRegressor = RandomForestRegressor(n_estimators=1000)

# Invoking the Random Forest Classifier with a 1.25x the mean threshold to select correlating features
sel = SelectFromModel(RandomForestClassifier(n_estimators = 100), threshold = '1.25*mean')
sel.fit(x_train, y_train)

selected = sel.get_support()

# linearRegression.fit(x_train, y_train)
randomForestRegressor.fit(x_train, y_train)

# Assigning the accuracy of the model to the variable "accuracy"
accuracy = randomForestRegressor.score(x_train, y_train)

# Predicting for the data in the test set
predictions = randomForestRegressor.predict(feature_selection_and_engineering(test))

# Writing the predictions to a new CSV file
submission = pd.DataFrame({'Id': test['PassengerId'], 'SalePrice': predictions})
filename = 'Boston-Submission.csv'
submission.to_csv(filename, index=False)

print(accuracy*100, "%")

new error:

    Traceback (most recent call last):
  File "/home/onur/Documents/Boston-Kaggle/Model.py", line 76, in <module>
    x_train, encoder = feature_selection_and_engineering(x_train)
  File "/home/onur/Documents/Boston-Kaggle/Model.py", line 57, in feature_selection_and_engineering
    encoder = train_one_hot_encoder(df, categorical_columns)
  File "/home/onur/Documents/Boston-Kaggle/Model.py", line 30, in train_one_hot_encoder
    return enc.fit(categorical_df)
  File "/opt/anaconda/envs/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 493, in fit
    self._fit(X, handle_unknown=self.handle_unknown)
  File "/opt/anaconda/envs/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 80, in _fit
    X_list, n_samples, n_features = self._check_X(X)
  File "/opt/anaconda/envs/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 67, in _check_X
    force_all_finite=needs_validation)
  File "/opt/anaconda/envs/lib/python3.7/site-packages/sklearn/utils/validation.py", line 542, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File "/opt/anaconda/envs/lib/python3.7/site-packages/sklearn/utils/validation.py", line 60, in _assert_all_finite
    raise ValueError("Input contains NaN")
ValueError: Input contains NaN

Solution

  • Reusing code is a good idea, but beware how the scope of variables change when you put code into a function.

    The error you are getting is caused because there are NaN values in the array you input into the random forest. In your feature_engineering_and_selection() function, you are removing the NaN values, but df is never returned from the function, so the original, unmodified df is used in the model.

    I suggest splitting up your feature_engineering_and_selection() function into different components. Here I made a function that just removes NaNs.

    # Iterates through the columns and fixes any NaNs
    def remove_nan(df):
        replace_dict = {}
    
        for col in df.columns:
    
            # If there are any NaN values in this column
            if pd.isna(df[col]).any():
    
                # Replace NaN in object columns with 'N/A'
                if df[col].dtypes == 'object':
                    replace_dict[col] = 'N/A'
    
                # Replace NaN in float columns with 0
                elif df[col].dtypes == 'float64':
                    replace_dict[col] = 0
    
        df = df.fillna(replace_dict)
    
        return df
    

    I suggest filling NaN numerical values with 0 instead of the mean. For this data, there are 3 numerical columns with nan values: LotFrontage (feet of street connected to property), MasVnrArea (masonry veener area), GarageYrBlt (garage year built). If there is no garage, then there is no garage year built, so it makes sense to have the year as 0 instead of the average year, etc.

    There is also some work that needs to be done with the one hot encoder you have set up. Creating a one-hot-encoding can be tricky, because the training data and the test data need to have the same columns. If you have the following training and test data

    Train

    | House Type |
    | ---------- |
    | Mansion    |
    | Ranch      |
    

    Test

    | House Type |
    | ---------- |
    | Mansion    |
    | Duplex     |
    

    Then if using pd.get_dummies() the train columns will be [house_type_mansion, house_type_ranch] and the test columns will be [house_type_mansion, house_type_duplex], which won't work. However, using sklearn, you can fit a one hot encoder to your train data. When transforming the test dataset, it will create the same columns as the train data set. The handle_unknown parameter will tell the encoder what to do with duplex in the test set, either ignore or error.

    # Fits an sklearn one hot encoder
    def train_one_hot_encoder(df, categorical_columns):
        # take one-hot encoding of categorical columns
        categorical_df = df[categorical_columns]
        enc = OneHotEncoder(sparse=False, handle_unknown='ignore')
        return enc.fit(categorical_df)
    

    To combine the categorical and non categorical data, again I suggest making a separate function

    # One hot encodes the given dataframe
    def one_hot_encode(df, categorical_columns, encoder):
        # Get dataframe with only categorical columns
        categorical_df = df[categorical_columns]
        # Get one hot encoding
        ohe_df = pd.DataFrame(encoder.transform(categorical_df), columns=encoder.get_feature_names())
        # Get float columns
        float_df = df.drop(categorical_columns, axis=1)
        # Return the combined array
        return pd.concat([float_df, ohe_df], axis=1)
    

    Finally, your feature_engineering_and_selection() function can call all of those functions.

    def feature_selection_and_engineering(df, encoder=None):
        df = remove_nan(df)
        categorical_columns = get_categorical_columns(df)
        # If there is no encoder, train one
        if encoder == None:
            encoder = train_one_hot_encoder(df, categorical_columns)
        # Encode Data
        df = one_hot_encode(df, categorical_columns, encoder)
        # Return the encoded data AND encoder
        return df, encoder
    

    There were a few things I had to fix to make the code run, I have included the entire modified script in a gist here https://gist.github.com/kylelrichards11/6be90d92a7dd6a5cc9a5290dae3ff94e