Search code examples
pythondataframesklearn-pandas

decode pandas data frame with sklearn


I have a data frame with many columns. some of them are string and some other are integer. I used this code to encode my data frame:

le = LabelEncoder()
for col in df.columns:
    df_encoded[col] = df.apply(le.fit_transform)

it worked! but when I want to decode it with this code:

for col in df.columns:
    df_decoded[col] = df_encoded.apply(le.inverse_transform)

I receive this error:

ValueError: ('The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()', 'occurred at index MYCOLUMNNAME')

Solution

  • The type of data differs from column to column, so using apply with fit_transform won't work here. It will seem to work properly but the LabelEncoder will be fitted to the rightmost column at the end of the operation, so when you'll try to apply the inverse_transform, the LabelEncoder will replace all the elements in the other columns with the ones it saw in the rightmost column. E.g.:

    df = pd.DataFrame([{'A': 1, 'B': 'p'}, {'A': 1, 'B': 'q'},  {'A': 2, 'B': 'o'},  {'A': 3, 'B': 'p'}])
    df
       A  B
    0  1  p
    1  1  q
    2  2  o
    3  3  p
    
    df = df.apply(le.fit_transform)
    df
       A  B
    0  0  1
    1  0  2
    2  1  0
    3  2  1   # Looks fine
    
    df.apply(le.inverse_transform)
       A  B
    0  o  p
    1  o  q
    2  p  o
    3  q  p   # Whoops
    

    You will see the same result even if you iterate over the columns one by one and perform the fit_transform and inverse_transform.

    You need to fit the encoder to the correct column before inversing:

    le = LabelEncoder()
    df_encoded = pd.DataFrame(columns=df.columns)
    df_decoded = pd.DataFrame(columns=df.columns)
    
    for col in df.columns:
        df_encoded[col] = le.fit_transform(df[col])
    
    df_encoded
       A  B
    0  0  1
    1  0  2
    2  1  0
    3  2  1
    
    for col in df.columns:
        le = le.fit(df[col])
        df_decoded[col] = le.inverse_transform(df_encoded[col])
    
    df_decoded
    
       A  B
    0  1  p
    1  1  q
    2  2  o
    3  3  p   # Yeay