Search code examples
pythondataframeloopsfactorslevels

Having Several Levels in columns Python


I have several columns in a dataframe- each with several factors/levels in it (10+) . In every column, 3-4 factors make up 85-90% of the values. I have several columns in the data. Going through each column and making dummy variables of the top 3-4 would take a lot of time. Simply putting get_dummies would increase size of data exponentially. Is there any useful way that can be suggested in which I can automatically take the top 3-4 factors as dummy variables pushing the rest into ‘Others’ category , for each column? I am using python


Solution

  • You could find the nlargest by column, and replace values not in the top 3 with other as you are creating your dummies.

    import pandas as pd
    
    df = pd.DataFrame({'type':['a','a','a','b','b','b','c','d','e'],
                      'size': ['s','s','s','m','m','s','l','l','xl']})
    
    for col in ['type','size']:
        df = pd.concat([df,
                        pd.get_dummies(df[col].replace(df.loc[~df[col].isin(df[col].value_counts().nlargest(3).index)][col].unique(),
                                                       'other'), 
                                       prefix=col)],
                       axis=1)
    

    Output

      type size  type_a  type_b  type_c  type_other  size_l  size_m  size_other  \
    0    a    s       1       0       0           0       0       0           0   
    1    a    s       1       0       0           0       0       0           0   
    2    a    s       1       0       0           0       0       0           0   
    3    b    m       0       1       0           0       0       1           0   
    4    b    m       0       1       0           0       0       1           0   
    5    b    s       0       1       0           0       0       0           0   
    6    c    l       0       0       1           0       1       0           0   
    7    d    l       0       0       0           1       1       0           0   
    8    e   xl       0       0       0           1       0       0           1   
    
       size_s  
    0       1  
    1       1  
    2       1  
    3       0  
    4       0  
    5       1  
    6       0  
    7       0  
    8       0