python pandas data-conversion categorical-data

Encoding categorical data to numerical

I'm using this Kaggle dataset, and I'm trying to convert the categorical values to numerical, so I can apply regression.

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

Here's an example of what I have tried so far.

train_data = pd.read_csv('train.csv')

column_contents = []
for row in train_data['Street']:
 if type(row) not in (int,float):
 column_contents.append(row)
 unique_contents = set(column_contents)

ds = {}
for i,j in enumerate(unique_contents):
 ds[j] = i 

train_data['Street'] = train_data['Street'].replace(ds.keys(), list(map(str, ds.values())), regex=True)

Thereafter, I created the following function to apply it to all the columns of the df:

def calculation(df,column):
 column_contents = []
 for row in df[column]:
  if type(row) not in (int,float):
   column_contents.append(row)
   unique_contents = set(column_contents)

 ds = {}
 for i,j in enumerate(unique_contents):
  ds[j] = i 

df[column] = df[column].replace(ds.keys(), list(map(str, ds.values())), regex=True)

return df[column]

for column in train_data:
 train_data[column] = calculation(train_data,column)

However, this function does not work, and I think it wrong in many levels. Any help will be appreciated. Also I am aware that this can be done using other modules (numpy) but I'd rather do it this way to practice.

Solution

You have coded it correctly expect using the regex=True in replace. Since you want to replace the matched keys with values you should not use regex. Also NaNs have to be handled separately.

Also in the method calculation you are already replacing the column in the dataframe so you don't have to return it and assign it again.

Code:

train_data = pd.read_csv('train.csv')
# Replace all NaNs with -1
train_data = train_data.fillna(-1)

def calculation(df,column):
  column_contents = []
  for row in df[column]:
    if type(row) not in (int,float):
      column_contents.append(row)
  
  unique_contents = set(column_contents)
  ds = {}
  for i,j in enumerate(unique_contents):
    ds[j] = i 
  
  df[column] = df[column].replace(ds.keys(), list(map(str, ds.values()))).astype(float)

for column in train_data:
  calculation(train_data,column)

print (train_data.dtypes)

Output:

Id               float64
MSSubClass       float64
MSZoning         float64
LotFrontage      float64
LotArea          float64
                  ...   
MoSold           float64
YrSold           float64
SaleType         float64
SaleCondition    float64
SalePrice        float64
Length: 81, dtype: object

As you can see all the columns are converted into float.