Search code examples
scikit-learnlinear-regression

How does LabelEncoder() encode values?


I want to know how does LabelEncoder() function. This is a part of my code

for att in all_features_test:
if (str(test_home_data[att].dtypes) == 'object'):
    test_home_data[att].fillna( 'Nothing', inplace = True)
    train_home_data[att].fillna( 'Nothing', inplace = True)

    train_home_data[att] = LabelEncoder().fit_transform(train_home_data[att])
    test_home_data[att] = LabelEncoder().fit_transform(test_home_data[att])
else:
    test_home_data[att].fillna( 0, inplace = True)
    train_home_data[att].fillna( 0, inplace = True)

Both train and test data set has an attribute 'Condition' which can hold values - Bad, Average and Good

Lets say LabelEncoder() would encode Bad as 0, Average as 2, and Good as 1 in train_home_data. Now would that be same for test_home data?

If not, then what should I do?


Solution

  • I got the answer for this I guess.

    Code

    data1 = [('A', 1), ('B', 2),('C', 3) ,('D', 4)]
    data2 = [('D', 1), ('A', 2),('A', 3) ,('B', 4)]
    
    df1 = pd.DataFrame(data1, columns = ['col1', 'col2'])
    df2 = pd.DataFrame(data2, columns = ['col1', 'col2'])
    
    print(df1['col1'])
    print(df2['col1'])
    
    df1['col1'] = LabelEncoder().fit_transform(df1['col1'])
    df2['col1'] = LabelEncoder().fit_transform(df2['col1'])
    
    print(df1['col1'])
    print(df2['col1'])
    

    Output

    0    A
    1    B
    2    C
    3    D
    Name: col1, dtype: object # df1
    0    D
    1    A
    2    A
    3    B
    Name: col1, dtype: object # df2
    0    0
    1    1
    2    2
    3    3
    Name: col1, dtype: int64 #df1 encoded
    0    2
    1    0
    2    0
    3    1
    Name: col1, dtype: int64 #df2 encoded
    

    B of df1 is encoded to 1.

    and,

    B of df2 is encoded to 1 as well

    So if I encode training and testing data sets, then the encoded values in training set would reflect in testing data set (only if both are label encoded)