Search code examples
rdata.tablegsublarge-files

Remove Quotes from large text file/data.table


I have a large text file with 57 columns and ~3 million rows 1.2 GB on which I want to remove the quotes (either double or single) from the cell values e.g..

Column1 Column2

"1" "2" ...

"3" "4" ...

... ...

Currently there are 2 implementations.

Implementation No 1 - WORKING:

Read the file using the readLines base function

system.time(example1 <- gsub("\"", "", readLines("large-file.txt"), perl = TRUE))

user  system elapsed 

88.756   1.172  90.226 

Implementation No 2 - NOT WORKING:

Read the file with fread function from the data.table package and then unquote with gsub.

library(data.table)

system.time(example2 <- fread("Dropbox/Review/case studies/evidence.txt"))

user  system elapsed 

38.052   6.176  59.764

example2 <- gsub("\"", "", example2)

Which usually leaves my 4GB laptop either dead or after more that 10 minutes I kill the rsession.

I have also tried different solutions using the data.table but without success.

e.g.

example3 <- data[, lapply(.SD, function(x) x %like% "\"")]

In order to at least mark the faulty cells.

How can I combine the fread reading speed with fast gsub-ing in a data.table?


Solution

  • I would delete quotes outside R (eg, using tr with option -d).

    Use fread to read output from tr with -d option:

    data.table::fread("tr -d \'\"\' < INPUT.txt")