Search code examples
pythonrfinancial

extract the same column (different length) from multiple .csv and cbind in one data frame - R


So I have 1300 csv-files of financial data with the following format:

          Date     Open     High      Low    Close
1 Nov 28, 2017 0.233394 0.234871 0.223832 0.225542 
2 Nov 27, 2017 0.225910 0.234219 0.212298 0.233113
3 Nov 26, 2017 0.229367 0.235126 0.215153 0.226367
4 Nov 25, 2017 0.234212 0.239257 0.223383 0.228617
5 Nov 24, 2017 0.215836 0.236280 0.209834 0.234195
6 Nov 23, 2017 0.228887 0.232974 0.214334 0.216585

My goal is to extract the column "Open" from each file and bind the columns together so it will look something like this:

     Date     "File1"   "File2"  "File3"  ... "File 1300"         
1 Nov 28, 2017 0.233394 0.234871 0.223832 ... 0.225542 
2 Nov 27, 2017 0.225910 0.234219 0.212298 ... 0.233117 
3 Nov 26, 2017 0.229367 0.235126 0.215153 ... 0.226367 
4 Nov 25, 2017 0.234212 NA       0.223383 ... 0.228617 
5 Nov 24, 2017 0.215836 NA       0.209834 ... 0.234195 
6 Nov 23, 2017 0.228887 NA       0.214334 ... NA

I know there has been a lot of questions here already on how to extract columns from multiple csv-files and how to bind them together, and I am almost there, but the problem is that the length of the files are different as I have observations based on dates. For example, one file can have observations from Des 3, 2014, while another can have from nov 23, 2017. Otherwise, the files are identical in terms of format, and they all have their last observation on Nov 28, 2017.

Below you can see my code so far

# Get a List of all files in directory 
filenames <- list.files(".../path, pattern="*.csv", full.names=F)

# Loading column "Open" from each file 
for(i in filenames){   
    filepath <- file.path(".../path", paste(i,sep=""))   
    assign(i, read.csv(filepath, header=T, sep = ";", 
        colClasses=c(NA, NA, "NULL", "NULL", "NULL"))) 
}

#making a list of all data frames
df_list <- lapply(ls(), function(x) if (class(get(x)) == "data.frame") get(x)) 

#merging  
library(dplyr) 
res2 <- Reduce(function(...) left_join(..., by=c("Date")), df_list)

As you can see I load all the csv-files into the global environment and make it to a list, where I then want to merge the columns into one data frame. The problem seems to be the merging part, where the Reduce or dplyr-package doesn't seem to solve the problem.

So my question is if you have any solutions to column bind all the Date-columns in one data frame and sorting by date? And also, are there any quick fix to make the filenames as headers for the columns?

I am also unsure if R is the best way to solve this. I am not that familiar with Python, but if you think it is easier, I can try to use that instead.


Solution

  • This is untested. Two approaches on how to import. One is iterative with a for loop while the second one imports all the data and then uses Reduce to merge 2. object to first, third to merge of 1. and 2., forth to merge of 1., 2. and 3. and so on.

    # approach #1
    result <- read.csv(filenames[1], header = TRUE, sep = ";")[, c("Date", "Open")]
    
    for (i in filenames[-1]) {
      out <- read.csv(i, header = TRUE, sep = ";")[, c("Date", "Open")]
      colnames(out) <- c("Date", basename(filenames[i]))
      result <- merge(result, by = "Date")
    }
    
    
    # approach #2
    alldata <- lapply(filenames, FUN = function(x) {
      out <- read.csv(i, header = TRUE, sep = ";")[, c("Date", "Open")]
      colnames(out) <- c("Date", basename(x))
      out
    })
    
    result <- Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "Date", all = TRUE), alldata)