Search code examples
pythonpandaskaggle

Sort Pandas columns by number of unique groups


I'm attempting to sort a section of the categorical data from the kaggle Housing Prices dataset. I'm attempting to sort the columns by the number of unique categories. To avoid cluttering the question, I've included a small section of the dataframe (80 total columns) to demonstrate the problem.

    LotConfig    Street    MSZoning
0   Inside       Pave      RL
1   FR2          Pave      RL
2   Corner       Grvl      RL

I'm aiming to sort the columns to achieve the following result.

    MSZoning    Street    LotConfig    
0   RL          Pave      Inside       
1   RL          Pave      FR2          
2   RL          Grvl      Corner      

I'm currently using the following code.

sorted_categories =  sorted(df.columns, key = lambda col: df[col].nunique())
df = df[sorted_categories]

Is there a better way to solve this problem? Am I missing a pandas method to achieve this?


Solution

  • You can call nunique directly and index your initial DataFrame using the result:

    u = df.nunique().sort_values().index
    df[u]
    

      MSZoning Street LotConfig
    0       RL   Pave    Inside
    1       RL   Pave       FR2
    2       RL   Grvl    Corner
    

    df.nunique() will return a Series of unique values per column.

    >>> df.nunique()
    LotConfig    3
    Street       2
    MSZoning     1
    dtype: int64
    

    However, on your small example, your approach is actually faster than mine, albeit a bit more verbose. I would test this on your actual DataFrame, since your method isn't incorrect, and if it improves performance, I would go with that instead. I believe my method should be faster on a large frame since it avoids calling nunique many times.