Search code examples
pythonpandasscikit-learndata-analysiscategorical-data

How to transform ordinal values into categorical ones?


In a Pandas DataFrame, how can a column that represents a categorical feature (e.g. whether the day is a working day or a weekend) in an ordinal numerical form (say, 1 for working day, 2 for weekend) be transformed so that it represents the values in an categorical way, something like (0, 1) for working days and (1, 0) for weekends, so that the values are not comparable?

There is the alternative of using pd.get_dummies (or the OneHotEncoder), which would create two columns with 0s and 1s, and then merge the two columns in tuples, but is there not direct way of doing that?

Example: I have:

    datetime    temp    daytype
0   2011-01-01  9.84    2
1   2011-01-02  9.02    2
2   2011-01-03  9.02    1
3   2011-01-04  9.84    1
4   2011-01-05  9.84    1
5   2011-01-06  9.84    1

I would like:

    datetime    temp    daytype
0   2011-01-01  9.84    (1, 0)
1   2011-01-02  9.02    (1, 0)
2   2011-01-03  9.02    (0, 1)
3   2011-01-04  9.84    (0, 1)
4   2011-01-05  9.84    (0, 1)
5   2011-01-06  9.84    (0, 1)

(I'm starting to think that maybe I'm getting it wrong - is this not the default way of representing categorical values?)


Solution

  • You can create your dummies/one hot vector and then combine them into a tuple:

    Your original data looks something like this

    import pandas as pd
    df = pd.DataFrame({"daytype": [2, 2, 1, 1, 1, 2]})
    print(df)
    
       daytype
    0        2
    1        2
    2        1
    3        1
    4        1
    5        2
    

    We can create dummy variables, which as you correctly pointed out, will result in separate columns:

    dummies = pd.get_dummies(df["daytype"]).astype(int)
    print(dummies)
    
       1  2
    0  0  1
    1  0  1
    2  1  0
    3  1  0
    4  1  0
    5  0  1
    

    But then you can take those separate columns and zip them together and assign the result back as a column in your original:

    df["combined"] = list(zip(dummies[1], dummies[2]))
    

    Giving you:

    print(df)
    
       daytype combined
    0        2   (0, 1)
    1        2   (0, 1)
    2        1   (1, 0)
    3        1   (1, 0)
    4        1   (1, 0)
    5        2   (0, 1)
    

    Of course you can entirely replace the original column with the combined if you want, I just separated them for clarity

    The above will give you the desired results you posted in the original question, however you can also directly set the type of that specific column as mentioned in the comments:

    df['daytype'] = df['daytype'].astype('category')
    

    Ultimately it will come down to what you want to use the column to do