r data.table readcsv file increases column amount

I have the issue, that I am trying to read immense amounts of data from csv files (Probably around 80 million rows separated into around 200 files)

Some of the files are not well structured. After a few hundred thousand rows, for some reason, the rows are ending with a comma (","), but no additional information behind this comma. A short example to illustrate this behaviour:

a,b,c
1,2,3
d,e,f,
4,5,6,

The rows have 19 columns. I tried manually telling readcsv to read it as 20 columns, using colClasses and col.names and fill=TRUE

all.files <- list.files(getwd(), full.names=T, recursive=T)

lapply(all.files, fread,
  select=c(5,6,9),
  col.names=paste0("V",seq_len(20)),
  #colClasses=c("V1"="character","V2"="character","V3"="integer"),
  colClasses=c(<all 20 data types, 20th arbitrarily as integer>),
  fill=T)

Another workaround I tried was to not use fread at all, by doing

data <- lapply(all.files, readLines)
data <- unlist(data)
data <- as.data.table(tstrsplit(data,","))
data <- data[, c("V5","V6","V9"), with=F]

However, this approach leads to "Error: memory exhausted", which I believe might be solved by actually only reading the required 3 columns, instead of all 19.

Any hints on how to use fread for this scenario is greatly appreciated.

Solution

You can try using readr::read_csv as follows:

library(readr)

txt <- "a,b,c
1,2,3
d,e,f,
4,5,6,"

read_csv(txt)

results in the expected result:

# A tibble: 3 × 3
      a     b     c
  <chr> <chr> <chr>
1     1     2     3
2     d     e     f
3     4     5     6

And the following warning

Warning: 2 parsing failures.
row col  expected    actual
  2  -- 3 columns 4 columns
  3  -- 3 columns 4 columns

To only read specific columns use cols_only as follows:

read_csv(txt, 
         col_types = cols_only(a = col_character(),
                               c = col_character()))