Search code examples
rsplitsapply

RSME on dataframe of multiple files in R


My goal is to read many files into R, and ultimately, run a Root Mean Square Error (rmse) function on each pair of columns within each file. I have this code:

    #This calls all the files into a dataframe
filnames <- dir("~/Desktop/LGsampleHUCsWgraphs/testRSMEs", pattern = "*_45Fall_*")
    #This reads each file
read_data <- function(z){
       dat <- read_excel(z, skip = 0, ) 
       return(dat)
    }
    #This combines them into one list and splits them by the names in the first column
datalist <- lapply(filnames, read_data)
    bigdata <- rbindlist(datalist, use.names = T)
    splitByHUCs <- split(bigdata, f = bigdata$HUC...1 , sep = "\n", lex.order = TRUE)

So far, all is working well. Now I want to apply an rmse [library(Metrics)] analysis on each of the "splits" created above. I don't know what to call the "splits". Here I have used names but that is an R reserved word and won't work. I tried the bigdata object but that didn't work either. I also tried to use splitByHUCs, and rMSEs.

rMSEs <- sapply(splitByHUCs, function(x) rmse(names$Predicted, names$Actual)) 
write.csv(rMSEs, file = "~/Desktop/testRMSEs.csv")

The rmse code works fine when I run it on a single file and create a name for the dataframe:

read_excel("bcc1_45Fall_1010002.xlsm")
bcc1F1010002 <- read_excel("bcc1_45Fall_1010002.xlsm")
rmse(bcc1F1010002$Predicted, bcc1F1010002$Actual)

The "splits" are named by the "splitByHUCs" script, like this:sample of split results

They are named for the file they came from, appropriately. I need some kind of reference name for the rmse formula and I don't know what it would be. Any ideas? Thanks. I made some small versions of the files, but I don't know how to add them here.


Solution

  • As it is a list, we can loop over the list with sapply/lapply as in the OP's code, but the names$ is incorrect as the lambda function object is x which signifies each of the elements of the list (i.e. a data.frame). Therefore, instead of names$, use x$

    sapply(splitByHUCs, function(x) rmse(x$Predicted, x$Actual))