I'm using this Kaggle dataset, and I'm trying to convert the categorical values to numerical, so I can apply regression.
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
Here's an example of what I have tried so far.
train_data = pd.read_csv('train.csv')
column_contents = []
for row in train_data['Street']:
if type(row) not in (int,float):
column_contents.append(row)
unique_contents = set(column_contents)
ds = {}
for i,j in enumerate(unique_contents):
ds[j] = i
train_data['Street'] = train_data['Street'].replace(ds.keys(), list(map(str, ds.values())), regex=True)
Thereafter, I created the following function to apply it to all the columns of the df:
def calculation(df,column):
column_contents = []
for row in df[column]:
if type(row) not in (int,float):
column_contents.append(row)
unique_contents = set(column_contents)
ds = {}
for i,j in enumerate(unique_contents):
ds[j] = i
df[column] = df[column].replace(ds.keys(), list(map(str, ds.values())), regex=True)
return df[column]
for column in train_data:
train_data[column] = calculation(train_data,column)
However, this function does not work, and I think it wrong in many levels. Any help will be appreciated. Also I am aware that this can be done using other modules (numpy) but I'd rather do it this way to practice.
You have coded it correctly expect using the regex=True
in replace. Since you want to replace the matched keys with values you should not use regex
. Also NaNs have to be handled separately.
Also in the method calculation
you are already replacing the column in the dataframe so you don't have to return it and assign it again.
train_data = pd.read_csv('train.csv')
# Replace all NaNs with -1
train_data = train_data.fillna(-1)
def calculation(df,column):
column_contents = []
for row in df[column]:
if type(row) not in (int,float):
column_contents.append(row)
unique_contents = set(column_contents)
ds = {}
for i,j in enumerate(unique_contents):
ds[j] = i
df[column] = df[column].replace(ds.keys(), list(map(str, ds.values()))).astype(float)
for column in train_data:
calculation(train_data,column)
print (train_data.dtypes)
Output:
Id float64
MSSubClass float64
MSZoning float64
LotFrontage float64
LotArea float64
...
MoSold float64
YrSold float64
SaleType float64
SaleCondition float64
SalePrice float64
Length: 81, dtype: object
As you can see all the columns are converted into float
.