Search code examples
pythonpandascategorical-data

Cannot interpret 'CategoricalDtype'


I would like to split variables into the different types. For example:

Tweets   ID    Registration Date   num_unique_words   photo_profile  range
object  int64  object              float64             int64         category       

What I did is:

type_dct = {str(k): list(v) for k, v in df.groupby(df.dtypes, axis=1)} but I have got a TypeError:

TypeError: Cannot interpret 'CategoricalDtype(categories=['<5',
 '>=5'], ordered=True)' as a data type

range can take two values: '<5' and '>=5'.

I hope you can help to handle this error.

df = pd.DataFrame({'Tweets': ['Tweet 1 from user 1', 'Tweet 2 from user 1', 
                              'Tweet 1 from user 3', 'Tweet 10 from user 1'], 
                   'ID': [124, 124, 12, 124], 
                   'Registration Date': ['2020-12-02', '2020-11-21', 
                                         '2020-12-02', '2020-12-02'], 
                   'num_unique_words': [41, 42, 12, 69], 
                   'photo_profile': [1, 0, 1, 1], 
                   'range': ['<5', '<5', '>=5', '<5']}, 
                  index=['falcon', 'dog', 'spider', 'fish'])

Solution

  • Update:

    That was surprisingly more complicated that I thought it would be, but here is a work around using list comprehension:

    type_dct = {str(k): list(v) for k, v in df.groupby([i.name for i in df.dtypes], axis=1)}
    

    Output:

    {'category': ['range'],
     'int64': ['ID', 'num_unique_words', 'photo_profile'],
     'object': ['Tweets', 'Registration Date']}
    

    pd.CategorialDtypes by itself doesn't work well in the groupby, we must use the name attribute of that object.


    Use pd.DataFrame.select_dtypes

    Example from docs.

    df = pd.DataFrame({'a': [1, 2] * 3,
                       'b': [True, False] * 3,
                       'c': [1.0, 2.0] * 3})
    df
            a      b  c
    0       1   True  1.0
    1       2  False  2.0
    2       1   True  1.0
    3       2  False  2.0
    4       1   True  1.0
    5       2  False  2.0
    df.select_dtypes(include='bool')
       b
    0  True
    1  False
    2  True
    3  False
    4  True
    5  False
    df.select_dtypes(include=['float64'])
       c
    0  1.0
    1  2.0
    2  1.0
    3  2.0
    4  1.0
    5  2.0
    df.select_dtypes(exclude=['int64'])
           b    c
    0   True  1.0
    1  False  2.0
    2   True  1.0
    3  False  2.0
    4   True  1.0
    5  False  2.0