Search code examples
rdataframecsvimportmean

How to calculate the average of the same column (with same name) in 100s different csv files with part of file name in common?


I have a bunch of csv files that are structured like this:

df <- data.frame (first_column  = c(3, 2, 6, 7),
                  second_column = c(7, 5, 1, 8))

All the csv files have a name like

"type1_1.csv"
"type1_2.csv"
...
"type2_1.csv"
"type2_2.csv"
...

Each of these csv's have first_column and second_column. What I want is to create a new dataframe that looks like this:

# name        meanofsecond_column
# type1_1     5.25
# ...

What I started doing already, was to individually write out each:

type1_1 <- read_csv("type1_1.csv")
type1_1mean <- mean(type1_1$second_column)
...
df <- data.frame (name  = c(type1_1, type1_2...),
                  meanofsecondcolumn = c(type1_1mean, type1_2mean...))

However, since there are more than 100 csv files, this method is not very efficient or clean. How can I make it more condensed?


Solution

  • # path where your csv files are (here current working directory)
    CSV_FOLDER <- "."
    
    # list all csv files in given directory
    # second parameter is a regex meaning ends with .csv
    # third parameter make function return file names with path
    csv_files <- list.files(CSV_FOLDER, "\\.csv$", full.names=TRUE)
    
    # apply given function on each file and collect results in a list
    res <- lapply(csv_files, function(csv_file) {
      # read current file
      tmp <- readr::read_csv(csv_file)
    
      # build a data.frame from filename (without path) and mean of second column
      return(data.frame(
        name = basename(csv_file),
        meanofsecondcolumn = mean(tmp$second_column)
      ))
    })
    
    # rbind all single line data.frames in a single data.frame
    res <- do.call("rbind", res)