Search code examples
pythonpandasmachine-learningdummy-variable

Dummy variables when not all categories are present across multiple features & data sets


I want to ask an extension of this question, which talks about adding a label to missing classes to make sure the dummies are encoded as blanks correctly.

Is there a way to do this automatically across multiple sets of data and have the labels automatically synched between the two? (I.e. for Test & Training sets). I.e. the same columns but different classes of data represented in each?

E.g.: Suppose I have the following two dataframes:

df1 = pd.DataFrame.from_items([('col1', list('abc')), ('col2', list('123'))])
df2 = pd.DataFrame.from_items([('col1', list('bcd')), ('col2', list('234'))])   

df1
   col1   col2 
1   a     1
2   b     2    
3   c     3
df2
   col1    col2
1   b      2
2   c      3
3   d      4

I want to have:

df1
    col1_a col1_b col1_c col1_d col2_1 col2_2 col2_3 col2_4
1   1      0      0      0      1      0      0      0      
2   0      1      0      0      0      1      0      0
3   0      0      1      0      0      0      1      0

df2
    col1_a col1_b col1_c col1_d col2_1 col2_2 col2_3 col2_4
1   0      1      0      0      0      1      0      0
2   0      0      1      0      0      0      1      0
3   0      0      0      1      0      0      0      1

WITHOUT having to specify in advance that
col1_labels = ['a', 'b', 'c', 'd'], col2_labels = ['1', '2', '3', '4']

And can I do this systematically for many columns all at once? I'm imagining a fuction that when passed in two or more dataframes (assuming columns are the same for all):

  1. reads which columns in the pandas dataframe are categories
  2. figures out what that overall labels are
  3. and then provides the category labels to each column Does that seem right? Is there a better way?

Solution

  • I think you need reindex by union of all columns if same categorical columns names in both Dataframes:

    print (df1)  
      df1
    1   a
    2   b
    3   c
    
    print (df2)
      df1
    1   b
    2   c
    3   d
    
    df1 = pd.get_dummies(df1)
    df2 = pd.get_dummies(df2)
    
    union = df1.columns | df2.columns
    df1 = df1.reindex(columns=union, fill_value=0)
    df2 = df2.reindex(columns=union, fill_value=0)
    print (df1)
       df1_a  df1_b  df1_c  df1_d
    1      1      0      0      0
    2      0      1      0      0
    3      0      0      1      0
    print (df2)
       df1_a  df1_b  df1_c  df1_d
    1      0      1      0      0
    2      0      0      1      0
    3      0      0      0      1