Search code examples
rreadr

read_table() from readr package in R


I am currently attempting to use read_table() function from the readr package on a few large data files. I only want the second column so I set all the other columns NULL with this argument in the function:

col_types = c(paste("_", "c", paste(rep("_", 20000), sep = "", collapse = ""), sep = "", collapse  = ""))

EDIT: There should be an underdash between the 1st and 3rd pair of closed quotes in the code above.

However, read_table seems to insist on reading in the entire data file (And using up excessive memory and causing a crash) instead of just reading in column 2.

With read.table(), I have tried a similar argument: colClasses = c("NULL", "character", rep("NULL", 20000) which works perfectly without taking up excess memory but I would like to use read_table since it is supposedly faster. Any ideas on why read_table is taking up so much memory even though I am including an argument to only keep one column?


Solution

  • If you only want to read the second column of a large data file, you can also use the fread function from the data.table package. The fread function was also developed for (very) fast file reading.

    fread has a select argument with which you can determine which columns to load. In your case it would be something like:

    dt <- fread("name_of_file.csv", select=2)
    

    This selects only the second column. You can also give it a vector of columns:

    dt <- fread("name_of_file.csv", select=c(2,5,10))
    

    or a vector of column names:

    dt <- fread("name_of_file.csv", select=c("id","time"))