Search code examples
rdataframedata-handling

Remove columns with factors that has less than 5 observations per level


I have a dataset composed of more than 100 columns and all columns are of type factor. Ex:

          animal               fruit               vehicle              color 
             cat              orange                   car               blue 
             dog               apple                   bus              green 
             dog               apple                   car              green 
             dog              orange                   bus              green

In my dataset i need to remove all columns with factors thas has less than 5 observations per level. In this example, if i want to remove all columns with amount of observations per levels less than or equal to 1, like blue or cat, the algorithm will remove the columns animal and color. What is the most elegant way to do this?


Solution

  • We can use Filter with table

    Filter(function(x) !any(table(x) < 2), df1)
    #  fruit vehicle
    #1 orange     car
    #2  apple     bus
    #3  apple     car
    #4 orange     bus
    

    data

    df1 <- structure(list(animal = structure(c(1L, 2L, 2L, 2L), .Label = c("cat", 
    "dog"), class = "factor"), fruit = structure(c(2L, 1L, 1L, 2L
    ), .Label = c("apple", "orange"), class = "factor"), vehicle = structure(c(2L, 
    1L, 2L, 1L), .Label = c("bus", "car"), class = "factor"), color = structure(c(1L, 
    2L, 2L, 2L), .Label = c("blue", "green"), class = "factor")),
    row.names = c(NA, 
    -4L), class = "data.frame")