Search code examples
rlapplysapply

Obtaining a vector with sapply and use it to remove rows from dataframes in a list with lapply


I have a list with dataframes:

df1 <- data.frame(id = seq(1:10), name = LETTERS[1:10])
df2 <- data.frame(id = seq(11:20), name = LETTERS[11:20])
mylist <- list(df1, df2)

I want to remove rows from each dataframe in the list based on a condition (in this case, the value stored in column id). I create an empty vector where I will store the ids:

ids_to_remove <- c() 

Then I apply my function:

sapply(mylist, function(df) {
  
  rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
  a <- rows_above_th$id # obtain the ids of the rows above the threshold 
  ids_to_remove <- append(ids_to_remove, a) # append each id to the vector
  
},

simplify = T

) 

However, with or without simplify = T, this returns a matrix, while my desired output (ids_to_remove) would be a vector containing the ids, like this:

ids_to_remove <- c(9,10,9,10)

Because lastly I would use it in this way on single dataframes:

for(i in 1:length(ids_to_remove)){

                  mylist[[1]] <- mylist[[1]] %>%
                    filter(!id == ids_to_remove[i])

                }

And like this on the whole list (which is not working and I don´t get why):

i = 1
lapply(mylist, 
       function(df) {
         
                for(i in 1:length(ids_to_remove)){
                  df <- df %>%
                    filter(!id == ids_to_remove[i])
                           
                  i = i + 1
         
                }
} )
      

I get the errors may be in the append part of the sapply and maybe in the indexing of the lapply. I played around a bit but couldn´t still find the errors (or a better way to do this).

EDIT: original data has 70 dataframes (in a list) for a total of 2 million rows


Solution

  • If you are using sapply/lapply you want to avoid trying to change the values of global variables. Instead, you should return the values you want. For example generate a vector if IDs to remove for each item in the list as a list

    ids_to_remove <- lapply(mylist, function(df) {
      rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
      rows_above_th$id # obtain the ids of the rows above the threshold
    }) 
    

    And then you can use that list with your data list and mapply to iterate the two lists together

    mapply(function(data, ids) {
      data %>% dplyr::filter(!id %in% ids)
    }, mylist, ids_to_remove, SIMPLIFY=FALSE)