Search code examples
rcsvloadreadrrdata

What are the file formats that read into R the fastest?


It seems most intuitive that .rdata files might be the fasted file format for R to load, but when scanning some of the stack posts it seems that more attention has been on enhancing load times for .csv or other formats. Is there a definitive answer?


Solution

  • Not a definitive answer, but below are times it took to load the same dataframe read in as a .tab file with utils::read.delim(), readr::read_tsv(), data.table::fread() and as a binary .RData file timed using the system.time() function:

    .tab with utils::read.delim

    system.time(
      read.delim("file.tab")
    )
    #   user  system elapsed 
    # 52.279   0.146  52.465
    

    .tab with readr::read_tsv

    system.time(
      read_tsv("file.tab")
    )    
    #   user  system elapsed 
    # 23.417   0.839  24.275
    

    .tab with data.table::fread

    At @Roman 's request the same ~500MB file loaded in a blistering 3 seconds:

    system.time(
      data.table::fread("file.tab")
    )
    # Read 49739 rows and 3005 (of 3005) columns from 0.400 GB file in 00:00:04
    #    user  system elapsed 
    #   3.078   0.092   3.172 
    

    .RData binary file of the same dataframe

    system.time(
      load("file.RData")
    )
    #    user  system elapsed 
    #   2.181   0.028   2.210
    

    Clearly not definitive (sample size = 1!) but in my case with a 500MB data frame:

    1. Binary .RData is quickest
    2. data.frame::fread() is a close second
    3. readr::read_tsv is an order of magnitude slower
    4. utils::read.x is slowest and only half as fast as readr