Search code examples
rdatasetlarge-data

r data.table readcsv file increases column amount


I have the issue, that I am trying to read immense amounts of data from csv files (Probably around 80 million rows separated into around 200 files)

Some of the files are not well structured. After a few hundred thousand rows, for some reason, the rows are ending with a comma (","), but no additional information behind this comma. A short example to illustrate this behaviour:

a,b,c
1,2,3
d,e,f,
4,5,6,

The rows have 19 columns. I tried manually telling readcsv to read it as 20 columns, using colClasses and col.names and fill=TRUE

all.files <- list.files(getwd(), full.names=T, recursive=T)

lapply(all.files, fread,
  select=c(5,6,9),
  col.names=paste0("V",seq_len(20)),
  #colClasses=c("V1"="character","V2"="character","V3"="integer"),
  colClasses=c(<all 20 data types, 20th arbitrarily as integer>),
  fill=T)

Another workaround I tried was to not use fread at all, by doing

data <- lapply(all.files, readLines)
data <- unlist(data)
data <- as.data.table(tstrsplit(data,","))
data <- data[, c("V5","V6","V9"), with=F]

However, this approach leads to "Error: memory exhausted", which I believe might be solved by actually only reading the required 3 columns, instead of all 19.

Any hints on how to use fread for this scenario is greatly appreciated.


Solution

  • You can try using readr::read_csv as follows:

    library(readr)
    
    txt <- "a,b,c
    1,2,3
    d,e,f,
    4,5,6,"
    
    read_csv(txt)
    

    results in the expected result:

    # A tibble: 3 × 3
          a     b     c
      <chr> <chr> <chr>
    1     1     2     3
    2     d     e     f
    3     4     5     6
    

    And the following warning

    Warning: 2 parsing failures.
    row col  expected    actual
      2  -- 3 columns 4 columns
      3  -- 3 columns 4 columns
    

    To only read specific columns use cols_only as follows:

    read_csv(txt, 
             col_types = cols_only(a = col_character(),
                                   c = col_character()))