I would like to reduce the memory output of my Pandas dataframe
.
I am parsing a Json where some of the columns are multi-valued list of categories, such as :
querySelectedBrands":["b1","b2","b3"]
This is automatically inferred as an 'object' column, but ideally is a List of Category. Whenever I have a column which is single valued categorical, it is quite simple to make the conversion :
interactions[col] = interactions[col].astype('category')
But what about a column I want to set type as a List of category ? Later on I will encode this column transforming it in a series of Boolean columns, so I am not sure if the initial memory benefit of transforming in a list of 'category' is going to be beneficial. Thanks !
Using a Pandas series to hold lists is inadvisable because it will always be of dtype object
and represent pointers to arbitrary types. As such, operations on such a series will not be vectorisable and will have a memory overhead attached.
If you have a set number of items in each list, you can split your series of lists into multiple series, see Pandas split column of lists into multiple columns. Then make each series a categorical:
for col in ['col1', 'col2', 'col3']:
df[col] = df[col].astype('category')