I have a DataFrame where I want to change several columns from type 'object' to 'category'.
I can change several columns at the same time for float,
dftest[['col3', 'col4', 'col5', 'col6']] = \
dftest[['col3', 'col4', 'col5', 'col6']].astype(float)
For 'category' I can not do it the same, I need to do one by one (or in a loop like here).
for col in ['col1', 'col2']:
dftest[col] = dftest[col].astype('category')
Question: Is there any way of doing the change for all wanted columns at once like in the 'float' example?
If I try to do several columns at the same time I have:
dftest[['col1','col2']] = dftest[['col1','col2']].astype('category')
## NotImplementedError: > 1 ndim Categorical are not supported at this time
My current working test code:
import numpy as np
import pandas as pd
factors= np.array([
['a', 'xx'],
['a', 'xx'],
['ab', 'xx'],
['ab', 'xx'],
['ab', 'yy'],
['cc', 'yy'],
['cc', 'zz'],
['d', 'zz'],
['d', 'zz'],
['g', 'zz']
])
values = np.random.randn(10,4).round(2)
dftest = pd.DataFrame(np.hstack([factors,values]),
columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6'])
#dftest[['col1','col2']] = dftest[['col1','col2']].astype('category')
## NotImplementedError: > 1 ndim Categorical are not supported at this time
## it works with individual astype
#dftest['col2'] = dftest['col2'].astype('category')
#dftest['col1'] = dftest['col1'].astype('category')
print(dftest)
## doing a loop
for col in ['col1', 'col2']:
dftest[col] = dftest[col].astype('category')
dftest[['col3', 'col4', 'col5', 'col6']] = \
dftest[['col3', 'col4', 'col5', 'col6']].astype(float)
dftest.dtypes
output:
col1 category
col2 category
col3 float64
col4 float64
col5 float64
col6 float64
dtype: object
== [update] ==
I don't have a problem using the loop now that I know the trick, but I asked the question because I wanted to learn/understand WHY I need to do a loop for 'category' and not for float, if there is no other way of doing it.
It's not immediately clear what the result of dftest[['col1','col2']].astype('category')
should be, i.e. whether the resulting columns should share the same categories or not.
Looping over columns make each column have a separate set of categories. (I believe this is a desired result in your example.)
On the other hand, .astype(float)
works differently: it ravels the underlying values to a 1d array, casts it to floats, and then reshape it back to the original shape. This way it may be faster than just iterating over columns. You can emulate this behaviour for category
with higher level functions:
result = dftest[['col1', 'col2']].stack().astype('category').unstack()
But then you get a single set of categories shared by the both columns:
result['col1']
Out[36]:
0 a
1 a
2 ab
3 ab
4 ab
5 cc
6 cc
7 d
8 d
9 g
Name: col1, dtype: category
Categories (8, object): [a < ab < cc < d < g < xx < yy < zz]