I have the issue, that I am trying to read immense amounts of data from csv files (Probably around 80 million rows separated into around 200 files)
Some of the files are not well structured. After a few hundred thousand rows, for some reason, the rows are ending with a comma (","), but no additional information behind this comma. A short example to illustrate this behaviour:
a,b,c
1,2,3
d,e,f,
4,5,6,
The rows have 19 columns. I tried manually telling readcsv to read it as 20 columns, using colClasses and col.names and fill=TRUE
all.files <- list.files(getwd(), full.names=T, recursive=T)
lapply(all.files, fread,
select=c(5,6,9),
col.names=paste0("V",seq_len(20)),
#colClasses=c("V1"="character","V2"="character","V3"="integer"),
colClasses=c(<all 20 data types, 20th arbitrarily as integer>),
fill=T)
Another workaround I tried was to not use fread at all, by doing
data <- lapply(all.files, readLines)
data <- unlist(data)
data <- as.data.table(tstrsplit(data,","))
data <- data[, c("V5","V6","V9"), with=F]
However, this approach leads to "Error: memory exhausted", which I believe might be solved by actually only reading the required 3 columns, instead of all 19.
Any hints on how to use fread for this scenario is greatly appreciated.
You can try using readr::read_csv
as follows:
library(readr)
txt <- "a,b,c
1,2,3
d,e,f,
4,5,6,"
read_csv(txt)
results in the expected result:
# A tibble: 3 × 3
a b c
<chr> <chr> <chr>
1 1 2 3
2 d e f
3 4 5 6
And the following warning
Warning: 2 parsing failures.
row col expected actual
2 -- 3 columns 4 columns
3 -- 3 columns 4 columns
To only read specific columns use cols_only
as follows:
read_csv(txt,
col_types = cols_only(a = col_character(),
c = col_character()))