Search code examples
rlarge-files

How to manage rogue data rows while reading fixed width files using laf_open_fwf in R


I am trying to read a large file using the below piece of code :

 laf <- laf_open_fwf(paste(input$dir,"/",filename,sep=""), column_widths = col_width, 
                            column_types=rep("character",length(col_width)),                                                                    
                            column_names = column_names)

The performance is good but my issue is that, lets say the file has about 100,000 lines of data which are all in conformance with the fixed width definition ; but in some cases there can be a few lines of data which are "rogue" as in they dont conform to the fixed widths of each column - data in some columns or lets just say one column might be longer or shorter and when this happens, the output of this reader is completely broken.

What I figured is that every data line that is parsed subsequent to the first rogue line the parser encounters, is not parsed correctly. This happens especially when the last column of the rogue data row has excessive data(is longer than the defined width for it)

So any ideas on how to work around this would be much appreciated.


Solution

  • Unfortunately, LaF assumes that all lines have an equal length. It uses the width of the lines to quickly skip to the requested lines. To go to line X it knows to go to byte (X - 1) * (sum(column_widths) + 1/2) from the beginning of the lines (The 1/2 depends on the eof line character used \n/\r\n).

    The only solution is to remove the 'rogue' lines from the file. Below I give a pure R example of how to do this. It is reasonably fast.

    Generate and example file with ~2% 'rogue' lines:

    lines <- c("abcde3.14", "efghi-123", "abcdef2.11")
    lines <- sample(lines, 1E6, prob = c(0.44, 0.44, 0.02), replace=TRUE)
    writeLines(lines, "test.dat")
    

    Read the file in chunks writing lines with the correct length to one connection and the other lines to another connection. By opening the connections outside the loop and keeping them open this is reasonably fast:

    widths <- c(5,4)
    types <- c("string", "numeric")
    names <- c("a", "b")
    library(LaF)
    
    
    con <- file("test.dat", "rt")
    ok <- file("ok.dat", "wt")
    notok <- file("notok.dat", "wt")
    while (TRUE) {
      l <- readLines(con, n = 1E5) # increase n for faster reading; used 1E5 as example
      if (length(l) == 0) break;
      s <- nchar(l) == sum(widths)
      writeLines(l[s], con = ok)
      writeLines(l[!s], con = notok)
    }
    close(notok)
    close(ok)
    close(con)
    

    The file with the correct lines can then be parsed by LaF:

    laf <- laf_open_fwf("ok.dat", column_types = types, column_names = names, 
      column_widths = widths)
    laf[,]
    

    And you can inspect the other file to see what the errors are.