Search code examples
pythonrdummy-variable

Create a Model for Dummy Variables


Starting with a training data set for a variable var1 as:

var1
A
B
C
D

I want to create a model (let's call it dummy_model1) that would then transform the training data set to:

var1_A  var1_B  var1_C  var1_D
1       0       0       0
0       1       0       0
0       0       1       0
0       0       0       1

This functionality (or similar) exists in, among others, the dummies package in R and get_dummies in Pandas, or even case statements in SQL.

I'd like to then be able to apply dummy_model1 to a new data set:

var1
C
7
#
A

and get the following output:

var1_A  var1_B  var1_C  var1_D
0       0       1       0
0       0       0       0
0       0       0       0
1       0       0       0

I know I can do this in SQL with 'case' statements but would love to automate the process given I have ~2,000 variables. Also, the new data sets will almost always have "bad" data (e.g., 7 and # in the above example).

Somewhat language agnostic (as long as its open source) but would prefer Python or R. Please note the data is over 500GB so that limits some of my options. Thanks in advance.


Solution

  • Assuming var1 fits in memory on its own, here is a possible solution:

    First, read in var1.

    Next, use get_dummies to get all the "training" categories encoded as dummy variables. Store the column names as a list or an array.

    Then, read in the first few rows of your training dataset to get the column names and store them as a list (or if you know these already you can skip this step).

    Create a new list or array containing the dummy variable column names and the relevant other columns (this could just be every column in the dataset except var1). This will be the final columns encoding.

    Then, read in your test data. Use get_dummies to encode var1 in your test data, knowing it may be missing categories or have extraneous categories. Then reindex the data to match the final columns encoding.

    After reindexing, you will end up a test dataset with var1 dummies consistent with your training var1.

    To illustrate:

    import pandas as pd
    import numpy as np
    
    training = pd.DataFrame({
            'var1': ['a','b','c'],
            'other_var':[4,7,3],
            'yet_another':[8,0,2]
        })
    
    print training
       other_var var1  yet_another
    0          4    a            8
    1          7    b            0
    2          3    c            2
    
    test = pd.DataFrame({
            'var1': ['a','b','q'],
            'other_var':[9,4,2],
            'yet_another':[9,1,5]
        })
    
    print test
       other_var var1  yet_another
    0          9    a            9
    1          4    b            1
    2          2    q            5
    
    
    var1_dummied = pd.get_dummies(training.var1, prefix='var1')
    var_dummy_columns =  var1_dummied.columns.values
    
    print var_dummy_columns
    array(['var1_a', 'var1_b', 'var1_c'], dtype=object)
    
    final_encoding_columns = np.append(training.drop(['var1'], axis = 1).columns, var_dummy_columns)
    
    print final_encoding_columns
    array(['other_var', 'yet_another', 'var1_a', 'var1_b', 'var1_c'], dtype=object)
    
    
    test_encoded = pd.get_dummies(test, columns=['var1'])
    
    print test_encoded
       other_var  yet_another  var1_a  var1_b  var1_q
    0          9            9       1       0       0
    1          4            1       0       1       0
    2          2            5       0       0       1
    
    test_encoded_reindexed = test_encoded.reindex(columns = final_encoding_columns, fill_value=0)
    
    print test_encoded_reindexed
       other_var  yet_another  var1_a  var1_b  var1_c
    0          9            9       1       0       0
    1          4            1       0       1       0
    2          2            5       0       0       0
    

    This should be what you want, based on the expected output in your question and the comments.

    If the test data easily fits in memory, you can easily extend this to multiple variables. Just save and then update final_encoding_columns iteratively for each training variable you want to encode. Then pass all of those columns to the columns= argument when reindexing the test data. Reindex with your complete final_encoding_columns and you should be all set.