Search code examples
pythonpandasnumpydataframecategorization

How to change multiple Pandas DF columns to categorical without a loop


I have a DataFrame where I want to change several columns from type 'object' to 'category'.

I can change several columns at the same time for float,

dftest[['col3', 'col4', 'col5', 'col6']] = \
    dftest[['col3', 'col4', 'col5', 'col6']].astype(float)

For 'category' I can not do it the same, I need to do one by one (or in a loop like here).

for col in ['col1', 'col2']:
    dftest[col] = dftest[col].astype('category')

Question: Is there any way of doing the change for all wanted columns at once like in the 'float' example?

If I try to do several columns at the same time I have:

dftest[['col1','col2']] = dftest[['col1','col2']].astype('category')
## NotImplementedError: > 1 ndim Categorical are not supported at this time

My current working test code:

import numpy as np
import pandas as pd 

factors= np.array([
        ['a', 'xx'],
        ['a', 'xx'],
        ['ab', 'xx'],
        ['ab', 'xx'],
        ['ab', 'yy'],
        ['cc', 'yy'],
        ['cc', 'zz'],
        ['d', 'zz'],
        ['d', 'zz'],
        ['g', 'zz'] 
        ])

values = np.random.randn(10,4).round(2)

dftest = pd.DataFrame(np.hstack([factors,values]), 
                  columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6'])

#dftest[['col1','col2']] = dftest[['col1','col2']].astype('category')
## NotImplementedError: > 1 ndim Categorical are not supported at this time

## it works with individual astype
#dftest['col2'] = dftest['col2'].astype('category')
#dftest['col1'] = dftest['col1'].astype('category')

print(dftest)

## doing a loop
for col in ['col1', 'col2']:
    dftest[col] = dftest[col].astype('category')


dftest[['col3', 'col4', 'col5', 'col6']] = \
    dftest[['col3', 'col4', 'col5', 'col6']].astype(float)

dftest.dtypes

output:

col1    category
col2    category
col3     float64
col4     float64
col5     float64
col6     float64
dtype: object

== [update] ==

I don't have a problem using the loop now that I know the trick, but I asked the question because I wanted to learn/understand WHY I need to do a loop for 'category' and not for float, if there is no other way of doing it.


Solution

  • It's not immediately clear what the result of dftest[['col1','col2']].astype('category') should be, i.e. whether the resulting columns should share the same categories or not.

    Looping over columns make each column have a separate set of categories. (I believe this is a desired result in your example.)

    On the other hand, .astype(float) works differently: it ravels the underlying values to a 1d array, casts it to floats, and then reshape it back to the original shape. This way it may be faster than just iterating over columns. You can emulate this behaviour for category with higher level functions:

    result = dftest[['col1', 'col2']].stack().astype('category').unstack()
    

    But then you get a single set of categories shared by the both columns:

    result['col1']
    Out[36]: 
    0     a
    1     a
    2    ab
    3    ab
    4    ab
    5    cc
    6    cc
    7     d
    8     d
    9     g
    Name: col1, dtype: category
    Categories (8, object): [a < ab < cc < d < g < xx < yy < zz]