Search code examples
pythonsequentialone-hot-encodingdummy-variable

How to encode dummy variables in Python for sequential data such that the same order is maintained always?


A simple issue really, I have a dataset that is too large to hold in to memory and thus must load it then perform machine learning on it sequentially. One of my features is categorical and I would like to do convert it to a dummy variable, but I have two issues:

1) Not all of the categories are present during a splice. So I would like to add the extra categories even if they are not presented in the current slice

2) The columns would have to maintain the same order as they were before.



This is an example of the problem:

In[1]: import pandas as pd
        splice1 = pd.Series(list('bdcccb'))
Out[1]: 0    b
        1    d
        2    c
        3    c
        4    c
        5    b 
        dtype: object

In[2]: splice2 = pd.Series(list('accd'))
Out[2]: 0    a
        1    c
        2    c
        3    d
        dtype: object

In[3]: splice1_dummy = pd.get_dummies(splice1)
Out[3]:     b   c   d
          0 1   0   0
          1 0   0   1
          2 0   1   0
          3 0   1   0
          4 0   1   0
          5 1   0   0

In[4]: splice2_dummy = pd.get_dummies(splice2)
Out[4]:     a   c   d
          0 1   0   0
          1 0   1   0
          2 0   1   0
          3 0   0   1

Edit: How to deal with the N-1 rule. A dummy variable has to be dropped, but which one to drop? Every new splice would hold different categorical variables.


Solution

  • So if you pass the categories in the exact order that you want, get_dummies will maintain it regardless. The code shows how its done.

    In[1]: from pandas.api.types import CategoricalDtype
    
           splice1 = pd.Series(list('bdcccb'))
           splice1 = splice1.astype(CategoricalDtype(categories=['a','c','b','d']))
    
           splice2 = pd.Series(list('accd'))
           splice2 = splice2.astype(CategoricalDtype(categories=['a','c','b','d']))
    
    In[2]: splice1_dummy = pd.get_dummies(splice1)
    Out[2]:     a   c   b   d
            0   0   0   1   0
            1   0   0   0   1
            2   0   1   0   0
            3   0   1   0   0
            4   0   1   0   0
            5   0   0   1   0
    
    In[3]:  splice2_dummy = pd.get_dummies(splice2)
    Out[3]:     a   c   b   d
            0   1   0   0   0
            1   0   1   0   0
            2   0   1   0   0
            3   0   0   0   1
    

    Although, I still haven't solved the issue of which variable to drop.