Starting with a training data set for a variable var1
as:
var1
A
B
C
D
I want to create a model (let's call it dummy_model1
) that would then transform the training data set to:
var1_A var1_B var1_C var1_D
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
This functionality (or similar) exists in, among others, the dummies
package in R
and get_dummies
in Pandas
, or even case
statements in SQL
.
I'd like to then be able to apply dummy_model1
to a new data set:
var1
C
7
#
A
and get the following output:
var1_A var1_B var1_C var1_D
0 0 1 0
0 0 0 0
0 0 0 0
1 0 0 0
I know I can do this in SQL
with 'case' statements but would love to automate the process given I have ~2,000 variables. Also, the new data sets will almost always have "bad" data (e.g., 7
and #
in the above example).
Somewhat language agnostic (as long as its open source) but would prefer Python
or R
. Please note the data is over 500GB so that limits some of my options. Thanks in advance.
Assuming var1
fits in memory on its own, here is a possible solution:
First, read in var1
.
Next, use get_dummies
to get all the "training" categories encoded as dummy variables. Store the column names as a list or an array.
Then, read in the first few rows of your training dataset to get the column names and store them as a list (or if you know these already you can skip this step).
Create a new list or array containing the dummy variable column names and the relevant other columns (this could just be every column in the dataset except var1
). This will be the final columns encoding.
Then, read in your test data. Use get_dummies
to encode var1
in your test data, knowing it may be missing categories or have extraneous categories. Then reindex the data to match the final columns encoding.
After reindexing, you will end up a test dataset with var1
dummies consistent with your training var1
.
To illustrate:
import pandas as pd
import numpy as np
training = pd.DataFrame({
'var1': ['a','b','c'],
'other_var':[4,7,3],
'yet_another':[8,0,2]
})
print training
other_var var1 yet_another
0 4 a 8
1 7 b 0
2 3 c 2
test = pd.DataFrame({
'var1': ['a','b','q'],
'other_var':[9,4,2],
'yet_another':[9,1,5]
})
print test
other_var var1 yet_another
0 9 a 9
1 4 b 1
2 2 q 5
var1_dummied = pd.get_dummies(training.var1, prefix='var1')
var_dummy_columns = var1_dummied.columns.values
print var_dummy_columns
array(['var1_a', 'var1_b', 'var1_c'], dtype=object)
final_encoding_columns = np.append(training.drop(['var1'], axis = 1).columns, var_dummy_columns)
print final_encoding_columns
array(['other_var', 'yet_another', 'var1_a', 'var1_b', 'var1_c'], dtype=object)
test_encoded = pd.get_dummies(test, columns=['var1'])
print test_encoded
other_var yet_another var1_a var1_b var1_q
0 9 9 1 0 0
1 4 1 0 1 0
2 2 5 0 0 1
test_encoded_reindexed = test_encoded.reindex(columns = final_encoding_columns, fill_value=0)
print test_encoded_reindexed
other_var yet_another var1_a var1_b var1_c
0 9 9 1 0 0
1 4 1 0 1 0
2 2 5 0 0 0
This should be what you want, based on the expected output in your question and the comments.
If the test data easily fits in memory, you can easily extend this to multiple variables. Just save and then update final_encoding_columns
iteratively for each training variable you want to encode. Then pass all of those columns to the columns=
argument when reindexing the test data. Reindex with your complete final_encoding_columns
and you should be all set.