I was trying encoding of data in the dataset named as train.csv
provided in this github repository. I used the following code to do so.
import pandas as pd
from sklearn import preprocessing
df = pd.read_csv(r'train.csv',index_col='Id')
df.head()
df['MSSubClass'].fillna(df['MSSubClass'].mean()//1)
df['MSZoning'].fillna(df['MSZoning'].mode())
label_encoder = preprocessing.LabelEncoder()
for col in df.columns:
if df[col].dtype == 'O':
print(df[col])
df[col] = label_encoder.fit_transform(df[col])
print(df)
And while encoding, the following output prompted.
MSSubClass
MSZoning
LotFrontage
LotArea
Street
Alley
TypeError: '<' not supported between instances of 'str' and 'float'
But when I looked the dataset, there wasn't any '<'
in the Alley
column.
And the previous columns have been encoded, but the Alley
column is causing an error. Please help me!
There is problem your missing values are not replaced in all columns, need assign back, also added .iloc[0]
to mode
for select first, if 2 or more values:
from sklearn import preprocessing
df = pd.read_csv(r'train.csv',index_col='Id')
print (df)
colsNum = df.select_dtypes(np.number).columns
colsObj = df.columns.difference(colsNum)
df[colsNum] = df[colsNum].fillna(df[colsNum].mean()//1)
df[colsObj] = df[colsObj].fillna(df[colsObj].mode().iloc[0])
label_encoder = preprocessing.LabelEncoder()
for col in colsObj:
print(df[col])
df[col] = label_encoder.fit_transform(df[col])
print (df)
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \
Id
1 60 3 65.0 8450 1 0 3
2 20 3 80.0 9600 1 0 3
3 60 3 68.0 11250 1 0 0
4 70 3 60.0 9550 1 0 0
5 60 3 84.0 14260 1 0 0
... ... ... ... ... ... ...
1456 60 3 62.0 7917 1 0 3
1457 20 3 85.0 13175 1 0 3
1458 70 3 66.0 9042 1 0 3
1459 20 3 68.0 9717 1 0 3
1460 20 3 75.0 9937 1 0 3
LandContour Utilities LotConfig ... PoolArea PoolQC Fence \
Id ...
1 3 0 4 ... 0 2 2
2 3 0 2 ... 0 2 2
3 3 0 4 ... 0 2 2
4 3 0 0 ... 0 2 2
5 3 0 2 ... 0 2 2
... ... ... ... ... ... ...
1456 3 0 4 ... 0 2 2
1457 3 0 4 ... 0 2 2
1458 3 0 4 ... 0 2 0
1459 3 0 4 ... 0 2 2
1460 3 0 4 ... 0 2 2
MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
Id
1 2 0 2 2008 8 4 208500
2 2 0 5 2007 8 4 181500
3 2 0 9 2008 8 4 223500
4 2 0 2 2006 8 0 140000
5 2 0 12 2008 8 4 250000
... ... ... ... ... ... ...
1456 2 0 8 2007 8 4 175000
1457 2 0 2 2010 8 4 210000
1458 2 2500 5 2010 8 4 266500
1459 2 0 4 2010 8 4 142125
1460 2 0 6 2008 8 4 147500
[1460 rows x 80 columns]