Search code examples
rdataframedata.tablefwritefread

Simple fread operation with fill=TRUE fails


The following code generates data files where each row has a different number of columns. The option fill=TRUE appears to work only when a certain character limit is reached. For instance compare lines 1-3 with lines 9-11, noting that both of these examples work as expected. How can I read the entirety of notworking1.dat with fill=TRUE enabled and not just the first 100 rows?

for (i in seq(1000,1099,by=1)) 
    cat(file="working1.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "working1.dat", fill=TRUE)

for (i in seq(1000,1101,by=1)) 
    cat(file="notworking1.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "notworking1.dat", fill=TRUE)

for (i in seq(1,101,by=1)) 
    cat(file="working2.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "working2.dat", fill=TRUE)

The following solution will also fail

df <- fread(input = "notworking1.dat", fill=TRUE, col.names=paste0("V", seq_len(1101)))

Warning Message received:

Warning message: In data.table::fread(input = "notworking1.dat", fill = TRUE) : Stopped early on line 101. Expected 1099 fields but found 1100. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<1 2 3 4 ...


Solution

  • We could find out maximum number of columns and add that many columns, then fread:

    x <- readLines("notworking1.dat")
    myHeader <- paste(paste0("V", seq(max(lengths(strsplit(x, " ", fixed = TRUE))))), collapse = " ")
    
    # write with headers
    write(myHeader, "tmp_file.txt")
    write(x, "tmp_file.txt", append = TRUE)
    # read as usual with fill
    d1 <- fread("tmp_file.txt", fill = TRUE)
    
    # check output
    dim(d1)
    # [1]  102 1101
    d1[100:102, 1101]
    #    V1101
    # 1:    NA
    # 2:    NA
    # 3:  1101
    

    But as we already have the data imported with readLines, we could just parse it:

    x <- readLines("notworking1.dat")
    xSplit <- strsplit(x, " ", fixed = TRUE)
    
    # rowbind unequal length list, and convert to data.table
    d2 <- data.table(t(sapply(xSplit, '[', seq(max(lengths(xSplit))))))
    
    # check output
    dim(d2)
    # [1]  102 1101
    d2[100:102, 1101]
    #    V1101
    # 1:  <NA>
    # 2:  <NA>
    # 3:  1101
    

    It is a known issue GitHub issue 5119, not implemented but it is suggested fill will take integer as input, too. So the solution would be something like:

    d <- fread(input = "notworking1.dat", fill = 1101)