I have a large text file with 57 columns and ~3 million rows 1.2 GB on which I want to remove the quotes (either double or single) from the cell values e.g..
Column1 Column2
"1" "2" ...
"3" "4" ...
... ...
Currently there are 2 implementations.
Implementation No 1 - WORKING:
Read the file using the readLines base function
system.time(example1 <- gsub("\"", "", readLines("large-file.txt"), perl = TRUE))
user system elapsed
88.756 1.172 90.226
Implementation No 2 - NOT WORKING:
Read the file with fread function from the data.table package and then unquote with gsub.
library(data.table)
system.time(example2 <- fread("Dropbox/Review/case studies/evidence.txt"))
user system elapsed
38.052 6.176 59.764
example2 <- gsub("\"", "", example2)
Which usually leaves my 4GB laptop either dead or after more that 10 minutes I kill the rsession.
I have also tried different solutions using the data.table but without success.
e.g.
example3 <- data[, lapply(.SD, function(x) x %like% "\"")]
In order to at least mark the faulty cells.
How can I combine the fread reading speed with fast gsub-ing in a data.table?
I would delete quotes outside R (eg, using tr
with option -d
).
Use fread
to read output from tr
with -d
option:
data.table::fread("tr -d \'\"\' < INPUT.txt")