I used the following function to merge all .csv files in my directory into one dataframe:
multmerge = function(mypath){
filenames = list.files(path = mypath, full.names = TRUE)
rbindlist(lapply(filenames,fread),fill = TRUE) }
dataframe = multmerge(path)
This code produces this error:
Error in rbindlist(lapply(filenames, fread), fill = TRUE) : Internal error: column 25 of result is determined to be integer64 but maxType=='character' != REALSXP
The code has worked on the same csv files before...I'm not sure what's changed and what the error message means.
So in looking at the documentation of fread I just noticed there is an integer64 option so are you dealing with integers greater than 2^31?
EDIT: I added the tryCatch which will print a formatted message to the console indicating which files are causing the error with the actual error message. However for rbindlist to then execute over the normal files you need to create a dummy list that will produce an extra column called ERROR which will have NAs in all rows except the bottom one(s) which will have the name of the problem file as its value(s).
I suggest after you run this code through once, delete the ERROR column and extra row(s) from the data.table and then save this combined file as a .csv. I would then move all the files that combined properly into a different folder and only have the current combined file and the ones that didn't load properly in the path. Then rerun the function without the colClasses specified. I combined everything into one script so it's hopefully less confusing:
#First Initial run without colClasses
multmerge = function(mypath){
filenames = list.files(path = mypath, full.names = TRUE)
rbindlist(lapply(filenames,function(i) tryCatch(fread(i),
error = function(e) {
cat("\nError reading in file:",i,"\t") #Identifies problem files by name
message(e) #Prints error message without stopping loop
list(ERROR=i) #Adds a placeholder column so rbindlist will execute
})), #End of tryCatch and lapply
fill = TRUE) #rbindlist arguments
} #End of function
#You should get the original error message and identify the filename.
dataframe = multmerge(path)
#Delete placeholder column and extra rows
#You will get as many extra rows as you have problem files -
#most likely just the one with column 25 or any others that had that same issue with column 25.
#Note the out of bounds error message will probably go away with the colClasses argument pulled out.)
#Save this cleaned file to something like: fwrite(dataframe,"CurrentCombinedData.csv")
#Move all files but problem file into new folder
#Now you should only have the big one and only one in your path.
#Rerun the function but add the colClasses argument this time
#Second run to accommodate the problem file(s) - We know the column 25 error this time but maybe in the future you will have to adapt this by adding the appropriate column.
multmerge = function(mypath){
filenames = list.files(path = mypath, full.names = TRUE)
rbindlist(lapply(filenames,function(i) tryCatch(fread(i,colClasses = list(character = c(25))),
error = function(e) {
cat("\nError reading in file:",i,"\t") #Identifies problem files by name
message(e) #Prints error message without stopping loop
list(ERROR=i) #Adds a placeholder column so rbindlist will execute
})), #End of tryCatch and lapply
fill = TRUE) #rbindlist arguments
} #End of function
dataframe2 = multmerge(path)
Now we know the source of the error is column 25 which we can specify in colClasses. If you run the code and you get the same error message for a different column simply add the number of that column after the 25. Once you have the dataframe inputted I would check what is going on in that column (or any others if you must add other columns). Maybe there was a data entry error in one of the files or different encoding of an NA value. That's why I say to initially convert that column to character
first because you will lose less information than converting to numeric
first.
Once you have no errors always write the cleaned combined data.table to a csv that is contained in your folder and always move the individual files that have been combined into the other folder. That way when you add new files you will only be combining the big one and a few others so that in the future you can see what is going on easier. Just keep notes as to which files gave you trouble and which columns. Does that make sense?
Because files are often so idiosyncratic you will have to be flexible but this approach to the workflow should make it easy to identify problem files and add what you need to add to the fread to make it work. Basically archive the files that have been processed and keep track of exceptions like the column 25 one and keep the most current combined file and ones that haven't been processed together in the active path. Hope that helps and good luck!