Search code examples
rloops

Need to run a function over list of dataframes in R


I have around 30 dataframes with varying number of samples, but same metadata columns. For example, the columns are Sample ID,Date of collection,Place of collection,Days since sample collection to mention a few.

I want to summarize them based on "Place of collection" and "Days since sample collection". For this I'm using the below function -

check_summary_df <- function(x) {
summarized_data <- x %>% group_by(place_of_collection, day) %>% summarize(count = n())
summarized_data$df_name <- deparse(substitute(x)) # adding this as a column so I can track the df_name
return(summarized_data)
}

And it is providing me with a dataframe with the required summary. My df names are non-standard, so I have put them in a list using input_df_list <- c('df1','collected_by_x','collected_by_y') and now I want to loop the function over the list. I tried a simple for loop -

for (i in 1:length(input_df_list)) { check_summary_df(input_df_list[i])}

And got the below error -

Error in UseMethod("group_by") : 
  no applicable method for 'group_by' applied to an object of class "character"

From what I am seeing, the input_df_list[i] of the loop is recognizing the input as a character string, rather than recognizing it as a dataframe. How can I change this behaviour? Or is there any other way to loop over a list of data frame?


Solution

  • The idiomatic way to do this in R is to create a list of data frames, rather than a list of names, and then iterate over that. As you already have input_df_list, a character vector of names, you can do this with get(). Here's an example:

    # Vector of names
    input_df_list <- c("iris", "mtcars", "cars")
    
    # Create a list of data frames
    df_list <- lapply(input_df_list, \(nm) get(nm)) |>
        setNames(input_df_list)
    
    # Simple function we can apply to all data frames
    check_summary_df  <- function(dat) {
        names(dat)
    }
    
    # Apply function to each data frame
    lapply(df_list, check_summary_df)
    
    # $iris
    # [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
    
    # $mtcars
    #  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"
    
    # $cars
    # [1] "speed" "dist" 
    

    You could also add x <- get(x) in the top line of your function but you'll find your R code will be more readable if you work with lists of data frames.