Search code examples
pythonscikit-learnlinear-regressionsparse-matrixcategorical-data

Using categorical encoding across multiple dataframes in python


I have a DataFrame X_Train with two categorical columns and a numerical column, for example:

A B N
'a1' 'b1' 0.5
'a1' 'b2' -0.8
'a2' 'b2' 0.1
'a2' 'b3' -0.2
'a3' 'b4' 0.4

Before sending this into a sklearn's linear regression, I change it into a sparse matrix. To do that, I need to change the categorical data into numerical indexes like so:

X_Train['acat'] = pd.factorize(X_Train['A'])[0]
X_Train['bcat'] = pd.factorize(X_Train['B'])[0]

Then I change it into a sparse matrix:

X_Train_Sparse = scipy.sparse.coo_matrix((X_Train.N, (X_Train.acat, X_Train.bcat)))

I have another similar DataFrame, X_Test, for example:

A B N
'a4' 'b3' 0.6
'a5' 'b5' -0.1
'a6' 'b2' -0.1
'a6' 'b1' -0.5
'a6' 'b3' 0.3

I also need to change this to a sparse matrix. How do I use the same bcat categorization from X_Train for X_Test so that the linear regression treats 'b1' in X_Train as the same variable as 'b1' in X_Test? Implicit in this is that, if there is any B value in X_Test that is not in X_Train, this B value should be dropped because there was no learning from this B value so no prediction can be made from it.


Solution

  • You have to apply the categorical encoding in advance of splitting:

    Sample:

    import pandas as pd
    df = pd.DataFrame({'A': {0: "'a1'",  1: "'a1'",  2: "'a2'",  3: "'a2'",  4: "'a3'",  5: "'a4'",  6: "'a5'",  7: "'a6'",  8: "'a6'",  9: "'a6'"}, 'B': {0: "'b1'",  1: "'b2'",  2: "'b2'",  3: "'b3'",  4: "'b4'",  5: "'b3'",  6: "'b5'",  7: "'b2'",  8: "'b1'",  9: "'b3'"}, 'N': {0: 0.5,  1: -0.8,  2: 0.1,  3: -0.2,  4: 0.4,  5: 0.6,  6: -0.1,  7: -0.1,  8: -0.5,  9: 0.3}})
    

    Code:

    # Encode categories
    df['B'] = pd.Categorical(df['B'])
    # Split data into train/test
    df_train, df_test = df.iloc[:5], df.iloc[5:]
    

    Result:

    df_train['B'].cat.codes
    
    Out[58]: 
    0    0
    1    1
    2    1
    3    2
    4    3
    dtype: int8
    
    df_test['B'].cat.codes
    
    Out[59]: 
    5    2
    6    4
    7    1
    8    0
    9    2
    dtype: int8