Search code examples
pythonpandasdata-conversioncategorical-data

Encoding categorical data to numerical


I'm using this Kaggle dataset, and I'm trying to convert the categorical values to numerical, so I can apply regression.

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

Here's an example of what I have tried so far.

train_data = pd.read_csv('train.csv')

column_contents = []
for row in train_data['Street']:
 if type(row) not in (int,float):
 column_contents.append(row)
 unique_contents = set(column_contents)

ds = {}
for i,j in enumerate(unique_contents):
 ds[j] = i 

train_data['Street'] = train_data['Street'].replace(ds.keys(), list(map(str, ds.values())), regex=True)

Thereafter, I created the following function to apply it to all the columns of the df:

def calculation(df,column):
 column_contents = []
 for row in df[column]:
  if type(row) not in (int,float):
   column_contents.append(row)
   unique_contents = set(column_contents)

 ds = {}
 for i,j in enumerate(unique_contents):
  ds[j] = i 

df[column] = df[column].replace(ds.keys(), list(map(str, ds.values())), regex=True)

return df[column]

for column in train_data:
 train_data[column] = calculation(train_data,column)

However, this function does not work, and I think it wrong in many levels. Any help will be appreciated. Also I am aware that this can be done using other modules (numpy) but I'd rather do it this way to practice.


Solution

  • You have coded it correctly expect using the regex=True in replace. Since you want to replace the matched keys with values you should not use regex. Also NaNs have to be handled separately.

    Also in the method calculation you are already replacing the column in the dataframe so you don't have to return it and assign it again.

    Code:

    train_data = pd.read_csv('train.csv')
    # Replace all NaNs with -1
    train_data = train_data.fillna(-1)
    
    def calculation(df,column):
      column_contents = []
      for row in df[column]:
        if type(row) not in (int,float):
          column_contents.append(row)
      
      unique_contents = set(column_contents)
      ds = {}
      for i,j in enumerate(unique_contents):
        ds[j] = i 
      
      df[column] = df[column].replace(ds.keys(), list(map(str, ds.values()))).astype(float)
    
    for column in train_data:
      calculation(train_data,column)
    
    print (train_data.dtypes)
    

    Output:

    Id               float64
    MSSubClass       float64
    MSZoning         float64
    LotFrontage      float64
    LotArea          float64
                      ...   
    MoSold           float64
    YrSold           float64
    SaleType         float64
    SaleCondition    float64
    SalePrice        float64
    Length: 81, dtype: object
    

    As you can see all the columns are converted into float.