Search code examples
pythonscikit-learnsklearn-pandas

How to binary encode tow mixed features?


I have a dataset looking like this one:

import pandas as pd

pd.DataFrame({"A": [2, 2, 1, 0, 5, 3, 0, 4, 5], "B": [1, 0, 0, 0, 1, 1, 1, 0, 0]})

   A  B
0  2  1
1  2  0
2  1  0
3  0  0
4  5  1
5  3  1
6  0  1
7  4  0

(I know that A is between 0 and 5; B is only 0 or 1)

I would like to transform it and get:

    A0_B0 A1_B0 A2_B0 A3_B0 ...  A5_B1
0   0     0     0     0     ...
1   0     0     1     0     ...
2   0     1     0     0     ...
...

(knowing which column correspond to which combination is important)

with a method that can be integrated with sklearn Pipeline and/or sklearn_pandas DataFrameMapper (need to be reproducible on a test sample).

For now, I have tried using OneHotEncoder or LabelBinarizer but they apply to A or B columns without mixing them.

I have also tried to it manually with a custom transformer, but DataFrameMapper looses column names:

from sklearn.base import BaseEstimator, TransformerMixin

class ABTransformer(BaseEstimator, TransformerMixin):
    def fit(self, x, y=None):
        return self

    def transform(self, x):
        A = x.A
        B = x.B

        A0_B0 = np.logical_and((A==0), (B == 0))
        A1_B0 = np.logical_and((A==1), (B == 0))
        ...

       data = pd.DataFrame(np.stack((A0_B0, A1_B0,.... ), axis=1), 
             columns=["A0_B0", "A1_B0", ...]
       )
       return data


 mapper = DataFrameMapper([
        (["A", "B"], [ABTransformer()] ,  {'input_df':True, "alias": None}),
        ],
        df_out=True, sparse=False)

At the end, the data I get are labelled: "A_B_0", "A_B_1", etc...

Is there a way to achieve the desired output?


Solution

  • Given that the number of distinct values for column A and B is n_A and n_B respectively, and all values are represented as the zero-based integers, you can use the following transform function.

    def transform(self, x):
        indices = x.B * n_A + x.A
        columns = ["A%d_B%d" % (j, i) for i in range(n_B) for j in range(n_A)]
        onehot = np.eye(n_A * n_B)[indices]
        data = pd.DataFrame(data=onehot, columns=columns)
        return data