Search code examples
rdataframelistdplyrlapply

How to filter a list of dataframes based on a unique count of categorical factors in each dataframe?


I have a dataframe that I split into a list of dataframes based on a categorical variable in the dataframe:

list <- split(mpg, mpg$manufacturer)

I want to filter the list to only include dataframes where one of the categorical columns in each dataframe contain at least 5 unique factors, and remove those with less than 5. I have tried lapply and filter over the dataset, but the result is filtering each dataframe, not the list entirely, as well as: filteredlist <- lapply(list, function(x) length(unique(x$class) >= 5)) and am stumped.

Thanks, Any help would be appreciated!


Solution

  • First let's take a look at how many unique classes there are:

    sapply(list, \(x) length(unique(x$class)))
       #    audi  chevrolet      dodge       ford      honda    hyundai       jeep land rover    lincoln 
       #       2          3          3          3          1          2          1          1          1 
       # mercury     nissan    pontiac     subaru     toyota volkswagen 
       #       1          3          1          3          4          3 
    

    So, with this data, the >= 5 isn't a great example because it will have 0 results. Let's do >= 3 so we can expect a non-empty result.

    ## with Filter
    filteredlist <- Filter(list, f = function(x) length(unique(x$class)) >= 3)
    length(filteredlist)
    # [1] 7
    
    ## or with sapply and `[`
    sapply_filter = list[sapply(list, \(x) length(unique(x$class))) >= 3]
    length(sapply_filter)
    # [1] 7
    

    Note that in your attempt lapply(list, function(x) length(unique(x$class) >= 5)) you have a parentheses typo, you want length(unique()) >= 5) not length(unique(...) >= 5))