Search code examples
pythonrpandasforcats

Equivalent of fct_lump in pandas


Is there a function in Python that does what the R fct_lump function does (i.e. to group all groups that are too small into one 'OTHER' group)?

Example below:

library(dplyr)
library(forcats)

> x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))

> x
 [1] A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A B B B B B B B B
[49] B B C C C C C D D D D D D D D D D D D D D D D D D D D D D D D D D D E F G H I
Levels: A B C D E F G H I

> x %>% fct_lump_n(3)
 [1] A     A     A     A     A     A     A     A     A     A     A     A     A     A     A     A    
[17] A     A     A     A     A     A     A     A     A     A     A     A     A     A     A     A    
[33] A     A     A     A     A     A     A     A     B     B     B     B     B     B     B     B    
[49] B     B     Other Other Other Other Other D     D     D     D     D     D     D     D     D    
[65] D     D     D     D     D     D     D     D     D     D     D     D     D     D     D     D    
[81] D     D     Other Other Other Other Other
Levels: A B D Other

Solution

  • pip install siuba 
    #( in python or anaconda prompth shell)
    
    #use library as:
    from siuba.dply.forcats import fct_lump, fct_reorder 
    
    #just like fct_lump of R :
    
    df['Your_column'] = fct_lump(df['Your_column'], n= 10)
    
    df['Your_column'].value_counts() # check your levels
    
    #it reduces the level to 10, lumps all the others as 'Other'