I am currently attempting to use read_table()
function from the readr package on a few large data files. I only want the second column so I set all the other columns NULL with this argument in the function:
col_types = c(paste("_", "c", paste(rep("_", 20000), sep = "", collapse = ""), sep = "", collapse = ""))
EDIT: There should be an underdash between the 1st and 3rd pair of closed quotes in the code above.
However, read_table seems to insist on reading in the entire data file (And using up excessive memory and causing a crash) instead of just reading in column 2.
With read.table()
, I have tried a similar argument: colClasses = c("NULL", "character", rep("NULL", 20000)
which works perfectly without taking up excess memory but I would like to use read_table
since it is supposedly faster. Any ideas on why read_table
is taking up so much memory even though I am including an argument to only keep one column?
If you only want to read the second column of a large data file, you can also use the fread
function from the data.table
package. The fread
function was also developed for (very) fast file reading.
fread
has a select
argument with which you can determine which columns to load. In your case it would be something like:
dt <- fread("name_of_file.csv", select=2)
This selects only the second column. You can also give it a vector of columns:
dt <- fread("name_of_file.csv", select=c(2,5,10))
or a vector of column names:
dt <- fread("name_of_file.csv", select=c("id","time"))