Search code examples
rdata.tablefwritebzip2bzip

How to save a csv as bzip2 in R, either within fwrite or after saving the csv using fwrite


I have code which uses write.csv to save a large number of files in bzip2 format. Here's a small reproduceable example:

df <- data.frame(A = rnorm(100000), B = rnorm(100000), C = rnorm(100000))
write.csv(df, file = bzfile('df.csv.bzip2'))

I want to speed up the code. I know data.table::fwrite is much faster than write.csv, but I don't know how to get fwrite to save to csv.bzip2. I've optimistically tried the below, but the compression doesn't appear to be working, e.g. the file size is 5.4MB vs. 2.5MB from the write.csv version saved above.

data.table::fwrite(df, 'df2.csv.bzip2') 

Can anyone advise if it's possible to use fwrite to save a compressed csv in bzip2 format? If not, can anyone advise on an alternative way to save a csv via fwrite and then convert to bzip2 format? E.g. something like the below. It's not essential to do the compression within fwrite, I just want to use fwrite to speed up the saving process and for the end product to be a properly-compressed csv.bzip2 file.

data.table::fwrite(df, 'df2.csv') #saves a normal csv
# (add code here which converts the output of ```fwrite``` to a properly-compressed csv.bzip2 file)

NB I'm aware I can save as gzip through fwrite, but I want the file to be in bzip2 format.


Solution

  • You can use R.utils::bzip2 to compress the file afterwards.

    df <- data.frame(A = rnorm(100000), B = rnorm(100000), C = rnorm(100000))
    
    system.time(write.csv(df, file = bzfile("df.csv.bz2")))
    #       User      System verstrichen 
    #      0.912       0.005       0.917 
    
    system.time({data.table::fwrite(df, "df2.csv"); R.utils::bzip2("df2.csv")})
    #       User      System verstrichen 
    #      0.487       0.011       0.473 
    
    system.time(readr::write_csv(df, "df3.csv.bz2")) #Comment from @Ritchie Sacramento
    #       User      System verstrichen                                           
    #      0.743       0.042       0.988 
    
    file.size("df.csv.bz2")
    #[1] 2511607
    
    file.size("df2.csv.bz2")
    #[1] 2232901
    
    file.size("df3.csv.bz2")
    #[1] 2431997