Search code examples

Merging data from many files and plot them

I have written application that is analyzing data and writing results in CSV file. It contains three columns: id, diff and count.
1. id is the id of the cycle - in theory the greater id, the lower diff should be
2. Diff is the sum of

(Estimator - RealValue)^2
for each observation in the cycle

3 count is number of observation during cycle

For 15 different values of parameter K, I am generating CSV file with name: %K%.csv , where %K% is the used value. My total number of files is 15.

What I would like to do, is to write in R simple loop, that will be able to plot content of my files in order to let me decide, which value of K is the best (for which in general the diff is the lowest.

For single file I am doing something like

 ggplot(data = data) + geom_point(aes(x= id, y=sqrt(diff/count)))

Does it make sense what I am trying to do ? Please note that statistics is completely not my domain, nor is R (but you probably could figure out this already).

Is there any better approach I can choose? And from theoretical point of view, am I doing what I am expecting to do?

I Would be very greateful for any comments, hints, critic and answers


  • Edited to clean up some typos and address the multiple K value issue.

    I'm going to assume that you've placed all your .csv files in a single directory (and there's nothing else in this directory). I will also assume that each .csv really do have the same structure (same number of columns, in the same order). I would begin by generating a list of the file names:

    myCSVs <- list.files("path/to/directory")

    Then I would 'loop' over the list of file names using lapply, reading each file into a data frame using read.csv:

    #This function just reads in the file and
    # appends a column with the K val taken from the file
    # name. You may need to tinker with the particulars here.
    myFun <- function(fn){
         tmp <- read.csv(fn)
         tmp$K <- strsplit(fn,".",fixed = TRUE)[[1]][1]
    dataList <- lapply(myCSVs, FUN = myFun,...)

    Depending on the structure of your .csv's you may need to pass some additional arguments to read.csv. Finally, I would combine this list of data frames into a single data frame:

    myData <-, dataList)

    Then you should have all your data in a single data frame, myData, that you can pass to ggplot.

    As for the statistical aspect of your question, it's a little difficult to offer an opinion without concrete examples of your data. Once you've figured the programming part out, you could ask a separate question that provides some sample data (either here, or on and folks will be able to suggest some visualization or analysis techniques that may help.