Search code examples
rcsvread.csv

R Read-Functions for CSV-files


I have an issue, where I'm reading in big (+500mb) CSV-files and then want to verify that all data has been read in correctly. To do so, I have been using a comparison between length() of readLines() and nrow() of read.csv2.

The following is my R-code:

df <- readFileFromServer(HOST, KEY,
       paste0(SERVER_PATH, SERVER_FOLDER),
       FILENAME, 
       FUN = read.csv2,
       sep = ";", 
       quote = "", encoding = "UTF-8", skipNul = TRUE)

df_check <- readFileFromServer(HOST, KEY,
               paste0(SERVER_PATH, SERVER_FOLDER),
               FILENAME, 
               FUN = readLines,skipNul = TRUE)`

Then I verify that all data was loaded, by checking:

if(nrow(df) != (length(df_check) - dif)){
  stop("some error msg")
}

dif is set to 1, to account for header in the CSV-files. This check is the part that fails for a given CSV-file. This has been working as intended up until this point, but now this check is causing issues, but I cannot fully understand why.

The one CSV-file that fails the check has "NULL" in the data, which I believe readLines interprets as a delimiter, thus causing a new line, and then the check fails, but I'm really not sure. I tried parsing different parameters to my readfunctions, but issue still persists.

I expect readlines and read.csv2 to result in equal the same length()-1 and nrow() respectively, as shown in my code-snippet.


Solution

  • This is not a proper answer, but it was too long for a comment. This would be my debug strategy here.

    • Pick a file that fails. Slurp it with readLines.
    • Save the file locally using writeLines.
    • Your first job is to make sure that the check fails also when the file is loaded from the disk. My first thought would be that the file transfer the first time you have run readFilesFromServer and the second time were not precisely identical.

    Now. If your problem persists for the given file when you read it locally with read.csv (different number of rows than number of lines in the readLine output), your job becomes much easier (and faster, probably) to solve.

    First, take a look at the beginning of the CSV file and at its end. Are they as they should be? Do they match the data in the head and tail of your data frame? If yes, then you need to find the missing lines systematically.

    Since CSV is just comma separated files, you can compare each line read from the CSV file with readLines with the line as it should be based on the table you have read using read.csv. How this should be done, depends on how your original csv file looks like (whether you need to insert quotes etc.). Basically, you need to figure out a way of restoring the lines of the CSV file from the data in your data frame, and then looking for the first line that is different.

    Here is some code to give you an idea what I mean:

    ## first, prepare data – for this example only!
    f <- file("test.csv", "w")
    writeLines(c("a,b,c", "1,what ever,42", "12,89,one"), f)
    close(f)
    
    ## actual test
    ## first, read the file with readlines
    f <- file("test.csv", "r")
    rl <- readLines(f)
    close(f)
    
    ## then, read it with test.csv
    csv <- read.csv("test.csv")
    
    ## third, prepare the lines as they should look based on the CSV
    rl_sim <- do.call(paste, c(csv, sep=","))
    
    ## find the first mismatch
    for(i in 1:length(rl_sim)) {
      if(rl_sim[i] != rl[i + 1]) {
        message("Problems start at line ", i, "\n", rl_sim[i], rl[i + 1])
        break
      }
    }