I'm attempting to sort a section of the categorical data from the kaggle Housing Prices dataset. I'm attempting to sort the columns by the number of unique categories. To avoid cluttering the question, I've included a small section of the dataframe (80 total columns) to demonstrate the problem.
LotConfig Street MSZoning
0 Inside Pave RL
1 FR2 Pave RL
2 Corner Grvl RL
I'm aiming to sort the columns to achieve the following result.
MSZoning Street LotConfig
0 RL Pave Inside
1 RL Pave FR2
2 RL Grvl Corner
I'm currently using the following code.
sorted_categories = sorted(df.columns, key = lambda col: df[col].nunique())
df = df[sorted_categories]
Is there a better way to solve this problem? Am I missing a pandas method to achieve this?
You can call nunique
directly and index your initial DataFrame using the result:
u = df.nunique().sort_values().index
df[u]
MSZoning Street LotConfig
0 RL Pave Inside
1 RL Pave FR2
2 RL Grvl Corner
df.nunique()
will return a Series of unique values per column.
>>> df.nunique()
LotConfig 3
Street 2
MSZoning 1
dtype: int64
However, on your small example, your approach is actually faster than mine, albeit a bit more verbose. I would test this on your actual DataFrame, since your method isn't incorrect, and if it improves performance, I would go with that instead. I believe my method should be faster on a large frame since it avoids calling nunique
many times.