Search code examples
rcountbioinformaticslapplybiomart

How to use lapply to count unique values from a list in r


I have asked a similar question on here before about how to count unique values from a dataframe, but I need to use "lapply" instead because the way I used previously doesn't work or I cant get it to work with a list. I have also been told the using one of the apply functions would be better.

This represents my data:

species1 <- data.frame(var_1 = c("a","a","a","b", "b", "b"), var_2 = c("c","c","d", "d", "e", "e"))

species2 <- data.frame(var_1 = c("f","f","f","g", "g", "g"), var_2 = c("h","h","i", "i", "j", "j"))

all_species <- list()

all_species[["species1"]] <- species1
all_species[["species2"]] <- species2

I want to use lapply to get the number of unique rows for each of my lists, for example, I need an output like:

count_all_species <- list()
count_all_species[["species1"]] <- data.frame(var_1 = c("a", "b"), unique_number = c("2", "2"))

Then the same for the second list using the "lapply" function


Solution

  • Here is an option with tidyverse. We loop through the list of data.frame (with map), grouped by 'var_1', summarise to get the number of distinct elements in 'var_2' (n_distinct)

    library(dplyr)
    library(purrr)
    map(all_species, ~ .x %>%
                         group_by(var_1) %>% 
                         summarise(unique_number = n_distinct(var_2)))
    

    Or use the distinct after looping through the list and then do a count

    map(all_species, ~ .x %>% 
                         distinct() %>% 
                         dplyr::count(var_1))
    

    Update

    If the variable name changes, then we can use position in summarise_at

    map(all_species, ~ .x %>%
                         group_by(var_1) %>% 
                         summarise_at(1, n_distinct))
    

    Or another option is to convert the column name string to a symbol (rlang::sym) and then do the evaluation (!!)

    map(all_species, ~ .x %>%
                 group_by(var_1) %>% 
                 summarise(unique_number = n_distinct(!! rlang::sym(names(.x)[2]))))