Search code examples
pythondataframedummy-variable

How to get dummy variable when test_set and train_set have different unique values?


The train_set is:

  type
0    a
1    b
2    c
3    d
4    e

If I use pd.get_dummies, I will get 5 columns:

   type_a  type_b  type_c  type_d  type_e
0       1       0       0       0       0
1       0       1       0       0       0
2       0       0       1       0       0
3       0       0       0       1       0
4       0       0       0       0       1

The test_set is:

  type
0    a
1    b
2    c
3    d

If I use pd.get_dummies, I will get only 4 columns:

   type_a  type_b  type_c  type_d
0       1       0       0       0
1       0       1       0       0
2       0       0       1       0
3       0       0       0       1

I want it to be:

   type_a  type_b  type_c  type_d type_e
0       1       0       0       0      0
1       0       1       0       0      0
2       0       0       1       0      0
3       0       0       0       1      0

Solution

  • You can try reindex with all the desired columns and fill_value=0:

    pd.get_dummies(test_set).reindex(
        ["type_a", "type_b", "type_c", "type_d", "type_e"], axis=1, fill_value=0)
    

    output

    #    type_a  type_b  type_c  type_d  type_e
    # 0       1       0       0       0       0
    # 1       0       1       0       0       0
    # 2       0       0       1       0       0
    # 3       0       0       0       1       0