I have a DataFrame X_Train with two categorical columns and a numerical column, for example:
A | B | N |
---|---|---|
'a1' | 'b1' | 0.5 |
'a1' | 'b2' | -0.8 |
'a2' | 'b2' | 0.1 |
'a2' | 'b3' | -0.2 |
'a3' | 'b4' | 0.4 |
Before sending this into a sklearn's linear regression, I change it into a sparse matrix. To do that, I need to change the categorical data into numerical indexes like so:
X_Train['acat'] = pd.factorize(X_Train['A'])[0]
X_Train['bcat'] = pd.factorize(X_Train['B'])[0]
Then I change it into a sparse matrix:
X_Train_Sparse = scipy.sparse.coo_matrix((X_Train.N, (X_Train.acat, X_Train.bcat)))
I have another similar DataFrame, X_Test, for example:
A | B | N |
---|---|---|
'a4' | 'b3' | 0.6 |
'a5' | 'b5' | -0.1 |
'a6' | 'b2' | -0.1 |
'a6' | 'b1' | -0.5 |
'a6' | 'b3' | 0.3 |
I also need to change this to a sparse matrix. How do I use the same bcat categorization from X_Train for X_Test so that the linear regression treats 'b1' in X_Train as the same variable as 'b1' in X_Test? Implicit in this is that, if there is any B value in X_Test that is not in X_Train, this B value should be dropped because there was no learning from this B value so no prediction can be made from it.
You have to apply the categorical encoding in advance of splitting:
Sample:
import pandas as pd
df = pd.DataFrame({'A': {0: "'a1'", 1: "'a1'", 2: "'a2'", 3: "'a2'", 4: "'a3'", 5: "'a4'", 6: "'a5'", 7: "'a6'", 8: "'a6'", 9: "'a6'"}, 'B': {0: "'b1'", 1: "'b2'", 2: "'b2'", 3: "'b3'", 4: "'b4'", 5: "'b3'", 6: "'b5'", 7: "'b2'", 8: "'b1'", 9: "'b3'"}, 'N': {0: 0.5, 1: -0.8, 2: 0.1, 3: -0.2, 4: 0.4, 5: 0.6, 6: -0.1, 7: -0.1, 8: -0.5, 9: 0.3}})
Code:
# Encode categories
df['B'] = pd.Categorical(df['B'])
# Split data into train/test
df_train, df_test = df.iloc[:5], df.iloc[5:]
Result:
df_train['B'].cat.codes
Out[58]:
0 0
1 1
2 1
3 2
4 3
dtype: int8
df_test['B'].cat.codes
Out[59]:
5 2
6 4
7 1
8 0
9 2
dtype: int8