Search code examples
pythonpandassklearn-pandaspandas-datareader

Model Pandas Data Frame column as List of Category


I would like to reduce the memory output of my Pandas dataframe. I am parsing a Json where some of the columns are multi-valued list of categories, such as :

querySelectedBrands":["b1","b2","b3"]

This is automatically inferred as an 'object' column, but ideally is a List of Category. Whenever I have a column which is single valued categorical, it is quite simple to make the conversion :

interactions[col] = interactions[col].astype('category')

But what about a column I want to set type as a List of category ? Later on I will encode this column transforming it in a series of Boolean columns, so I am not sure if the initial memory benefit of transforming in a list of 'category' is going to be beneficial. Thanks !


Solution

  • No, this isn't possible

    Using a Pandas series to hold lists is inadvisable because it will always be of dtype object and represent pointers to arbitrary types. As such, operations on such a series will not be vectorisable and will have a memory overhead attached.

    An alternative

    If you have a set number of items in each list, you can split your series of lists into multiple series, see Pandas split column of lists into multiple columns. Then make each series a categorical:

    for col in ['col1', 'col2', 'col3']:
        df[col] = df[col].astype('category')