Search code examples
python-3.xdataframescikit-learnsklearn-pandaslabel-encoding

How to encode a dataset having multiple datatypes?


I have a dataset like:

e = pd.DataFrame({
    'col1': ['A', 'A', 'B', 'W', 'F', 'C'],
    'col2': [2, 1, 9, 8, 7, 4],
    'col3': [0, 1, 9, 4, 2, 3],
    'col4': ['a', 'B', 'c', 'D', 'e', 'F']
})

Here I encoded the data using sklearn.preprocessing.LabelEncoder. By the following lines of code:

x = list(e.columns)
# Import label encoder 
from sklearn import preprocessing 
  
# label_encoder object knows how to understand word labels. 
label_encoder = preprocessing.LabelEncoder() 
for i in x:  
# Encode labels in column 'species'. 
    e[i] = label_encoder.fit_transform(e[i])
print(e) 

But this is encoding even the numeric datapoint of int type, which is not required.

Encoded dataset :

col1  col2  col3  col4
0     0     1     0     3
1     0     0     1     0
2     1     5     5     4
3     4     4     4     1
4     3     3     2     5
5     2     2     3     2

How can I rectify this?


Solution

  • One really simple possibility would be to only encode columns with string values. E.g., tweaking your code to be:

    import pandas as pd
    from sklearn import preprocessing 
    
    
    e = pd.DataFrame({
        'col1': ['A', 'A', 'B', 'W', 'F', 'C'],
        'col2': [2, 1, 9, 8, 7, 4],
        'col3': [0, 1, 9, 4, 2, 3],
        'col4': ['a', 'B', 'c', 'D', 'e', 'F']
    })
    
    
    label_encoder = preprocessing.LabelEncoder() 
    for col in e.columns:  
        if e[col].dtype == 'O':
            e[col] = label_encoder.fit_transform(e[col])
    
    print(e) 
    
    

    or better yet:

    import pandas as pd
    from sklearn import preprocessing 
    
    
    def encode_labels(ser):
        if ser.dtype == 'O':
            return label_encoder.fit_transform(ser)
        else:
            return ser
    
    
    label_encoder = preprocessing.LabelEncoder() 
    e = pd.DataFrame({
        'col1': ['A', 'A', 'B', 'W', 'F', 'C'],
        'col2': [2, 1, 9, 8, 7, 4],
        'col3': [0, 1, 9, 4, 2, 3],
        'col4': ['a', 'B', 'c', 'D', 'e', 'F']
    })
    
    
    e_encoded = e.apply(encode_labels)
    print(e_encoded)