So I have 1300 csv-files of financial data with the following format:
Date Open High Low Close
1 Nov 28, 2017 0.233394 0.234871 0.223832 0.225542
2 Nov 27, 2017 0.225910 0.234219 0.212298 0.233113
3 Nov 26, 2017 0.229367 0.235126 0.215153 0.226367
4 Nov 25, 2017 0.234212 0.239257 0.223383 0.228617
5 Nov 24, 2017 0.215836 0.236280 0.209834 0.234195
6 Nov 23, 2017 0.228887 0.232974 0.214334 0.216585
My goal is to extract the column "Open" from each file and bind the columns together so it will look something like this:
Date "File1" "File2" "File3" ... "File 1300"
1 Nov 28, 2017 0.233394 0.234871 0.223832 ... 0.225542
2 Nov 27, 2017 0.225910 0.234219 0.212298 ... 0.233117
3 Nov 26, 2017 0.229367 0.235126 0.215153 ... 0.226367
4 Nov 25, 2017 0.234212 NA 0.223383 ... 0.228617
5 Nov 24, 2017 0.215836 NA 0.209834 ... 0.234195
6 Nov 23, 2017 0.228887 NA 0.214334 ... NA
I know there has been a lot of questions here already on how to extract columns from multiple csv-files and how to bind them together, and I am almost there, but the problem is that the length of the files are different as I have observations based on dates. For example, one file can have observations from Des 3, 2014, while another can have from nov 23, 2017. Otherwise, the files are identical in terms of format, and they all have their last observation on Nov 28, 2017.
Below you can see my code so far
# Get a List of all files in directory
filenames <- list.files(".../path, pattern="*.csv", full.names=F)
# Loading column "Open" from each file
for(i in filenames){
filepath <- file.path(".../path", paste(i,sep=""))
assign(i, read.csv(filepath, header=T, sep = ";",
colClasses=c(NA, NA, "NULL", "NULL", "NULL")))
}
#making a list of all data frames
df_list <- lapply(ls(), function(x) if (class(get(x)) == "data.frame") get(x))
#merging
library(dplyr)
res2 <- Reduce(function(...) left_join(..., by=c("Date")), df_list)
As you can see I load all the csv-files into the global environment and make it to a list, where I then want to merge the columns into one data frame. The problem seems to be the merging part, where the Reduce or dplyr-package doesn't seem to solve the problem.
So my question is if you have any solutions to column bind all the Date-columns in one data frame and sorting by date? And also, are there any quick fix to make the filenames as headers for the columns?
I am also unsure if R is the best way to solve this. I am not that familiar with Python, but if you think it is easier, I can try to use that instead.
This is untested. Two approaches on how to import. One is iterative with a for loop while the second one imports all the data and then uses Reduce
to merge 2. object to first, third to merge of 1. and 2., forth to merge of 1., 2. and 3. and so on.
# approach #1
result <- read.csv(filenames[1], header = TRUE, sep = ";")[, c("Date", "Open")]
for (i in filenames[-1]) {
out <- read.csv(i, header = TRUE, sep = ";")[, c("Date", "Open")]
colnames(out) <- c("Date", basename(filenames[i]))
result <- merge(result, by = "Date")
}
# approach #2
alldata <- lapply(filenames, FUN = function(x) {
out <- read.csv(i, header = TRUE, sep = ";")[, c("Date", "Open")]
colnames(out) <- c("Date", basename(x))
out
})
result <- Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "Date", all = TRUE), alldata)