Search code examples
pythonpandasscikit-learnpermutationsklearn-pandas

LabelEncoding a permutation of combination of columns


I'd like to create class labels for a permutation of two columns using sklearn's LabelEncoder(). How do I achieve the following behavior?

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv("data.csv", sep=",")
df
#    A    B    
# 0  1  Yes 
# 1  2   No 
# 2  3  Yes 
# 3  4  Yes

I'd like to have the permutation of combination of A && B rather than encoding these two columns separately:

df['A'].astype('category')
#Categories (4, int64): [1, 2, 3, 4, ]

df['B'].astype('category')
#Categories (2, object): ['Yes','No']

#Column C should have 4 * 2 classes:
(1,Yes)=1  (1,No)=5
(2,Yes)=2  (2,No)=6
(3,Yes)=3  (3,No)=7
(4,Yes)=4  (4,No)=8

#Newdf
#    A    B  C    
# 0  1  Yes  1
# 1  2   No  6
# 2  3  Yes  3
# 3  4  Yes  4

Solution

  • We can create the mapping df with cross merge

    out = df.merge(df[['B']].drop_duplicates().merge(df['A'].drop_duplicates(),how='cross').assign(C=lambda x : x.index+1))
    Out[415]: 
       A    B  C
    0  1  Yes  1
    1  2   No  6
    2  3  Yes  3
    3  4  Yes  4
    

    More info

    df[['B']].drop_duplicates().merge(df['A'].drop_duplicates(),how='cross').assign(C=lambda x : x.index+1)
    Out[417]: 
         B  A  C
    0  Yes  1  1
    1  Yes  2  2
    2  Yes  3  3
    3  Yes  4  4
    4   No  1  5
    5   No  2  6
    6   No  3  7
    7   No  4  8