I want to know how does LabelEncoder() function. This is a part of my code
for att in all_features_test:
if (str(test_home_data[att].dtypes) == 'object'):
test_home_data[att].fillna( 'Nothing', inplace = True)
train_home_data[att].fillna( 'Nothing', inplace = True)
train_home_data[att] = LabelEncoder().fit_transform(train_home_data[att])
test_home_data[att] = LabelEncoder().fit_transform(test_home_data[att])
else:
test_home_data[att].fillna( 0, inplace = True)
train_home_data[att].fillna( 0, inplace = True)
Both train and test data set has an attribute 'Condition' which can hold values - Bad, Average and Good
Lets say LabelEncoder() would encode Bad as 0, Average as 2, and Good as 1 in train_home_data. Now would that be same for test_home data?
If not, then what should I do?
I got the answer for this I guess.
Code
data1 = [('A', 1), ('B', 2),('C', 3) ,('D', 4)]
data2 = [('D', 1), ('A', 2),('A', 3) ,('B', 4)]
df1 = pd.DataFrame(data1, columns = ['col1', 'col2'])
df2 = pd.DataFrame(data2, columns = ['col1', 'col2'])
print(df1['col1'])
print(df2['col1'])
df1['col1'] = LabelEncoder().fit_transform(df1['col1'])
df2['col1'] = LabelEncoder().fit_transform(df2['col1'])
print(df1['col1'])
print(df2['col1'])
Output
0 A
1 B
2 C
3 D
Name: col1, dtype: object # df1
0 D
1 A
2 A
3 B
Name: col1, dtype: object # df2
0 0
1 1
2 2
3 3
Name: col1, dtype: int64 #df1 encoded
0 2
1 0
2 0
3 1
Name: col1, dtype: int64 #df2 encoded
B of df1 is encoded to 1.
and,
B of df2 is encoded to 1 as well
So if I encode training and testing data sets, then the encoded values in training set would reflect in testing data set (only if both are label encoded)