Search code examples
rdataframer-factordrop

R: drop factors with certain values


I have a data.frame containing a factor column. I want to (a) drop from the data.frame any rows where the value in that column does not appear in at least 8 rows and (b) drop those levels from the factor.

In the below case, it would be the factors C, D, and G.

> table(x.train$oilType)

 A  B  C  D  E  F  G 
30 21  3  6  9  8  2 

From what I can tell, 'droplevels' only works if the factor is not being used at all. I gave this a shot with no success.

> droplevels(x.train$oilType[-c(C,D,G)])
Error in NextMethod("[") : object 'G' not found

Any guidance?


Solution

  • You can use add_count() to get the counts for each value of the factor, then filter() to keep rows where the count is >= 8. You then can drop levels with droplevels and mutate.

    library(dplyr)
    
    # Example factor
    df <- data.frame(fac = as.factor(c(rep("a", 3), rep("b", 8), rep("c", 9))))
    df$fac %>% table()
    #> .
    #> a b c 
    #> 3 8 9
    
    # Keep only rows where the value of `fac` for that row is observed in at least
    # 8 rows and drop unused levels
    result <- df %>%
      add_count(fac) %>%
      filter(n >= 8) %>%
      mutate(fac = droplevels(fac))
    
    print(result)
    #>    fac n
    #> 1    b 8
    #> 2    b 8
    #> 3    b 8
    #> 4    b 8
    #> 5    b 8
    #> 6    b 8
    #> 7    b 8
    #> 8    b 8
    #> 9    c 9
    #> 10   c 9
    #> 11   c 9
    #> 12   c 9
    #> 13   c 9
    #> 14   c 9
    #> 15   c 9
    #> 16   c 9
    #> 17   c 9
    
    levels(result$fac)
    #> [1] "b" "c"