I am importing many (> 300) .csv-files in a project, and I stumbled upon a very strange occurence.
There is a noticable difference in size when comparing the results of read_csv
and read.csv
.
Windows lists the file size of all files to be ~442 MB.
Using readr
library(tidyverse)
datadir <- "Z:\\data\\attachments"
list_of_files <- list.files(path = datadir, full.names = TRUE)
readr_data <- lapply(list_of_files, function(x) {
read_csv(x, col_types = cols())
})
object.size(readr_data)
#> 416698080 bytes
str(readr_data[1])
#> List of 1
#> $ : tibble [2,123 x 80] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
Using base
methods
base_data <- lapply(list_of_files, function(x) {
read.csv(x)
})
object.size(base_data)
#> 393094616 bytes
str(base_data[1])
#> List of 1
#> $ :'data.frame': 2123 obs. of 80 variables:
# Compare size
object.size(readr_data) / object.size(base_data) * 100
#> 106 bytes
Now 6% may not be that much, but that is still 23 MB, and I am still interested in why these are different. Additionally, both of these are smaller than that reported by Windows.
Why are the lists of different size, and is that important?
EDIT: Apparently some of the classes are different. I used this method:
readr_class <- sapply(readr_data[[1]], class)
base_class <- sapply(base_data[[1]], class)
result <- data.frame(readr_class, base_class)
And these are the differences:
readr_class base_class
var1 numeric integer
var2 numeric integer
var3 numeric integer
var4 character integer
Selecting the right functions is of course very important for writing efficient code. The degree of optimization present in different functions and packages will impact how objects are stored, their size, and the speed of operations running on them. Please consider the following.
library(data.table)
a <- c(1:1000000)
b <- rnorm(1000000)
mat <- as.matrix(cbind(a, b))
df <- data.frame(a, b)
dt <- data.table::as.data.table(mat)
cat(paste0("Matrix size: ",object.size(mat), "\ndf size: ", object.size(df), " (",round(object.size(df)/object.size(mat),2) ,")\ndt size: ", object.size(dt), " (",round(object.size(dt)/object.size(mat),2),")" ))
Matrix size: 16000568
df size: 12000848 (0.75)
dt size: 4001152 (0.25)
So here already you see that data.table
stores the same data using 4 times less space than your old matrix
does, and 3 times less than data.frame
. Now about operations speed:
> microbenchmark(df[df$a*df$b>500,], mat[mat[,1]*mat[,2]>500,], dt[a*b>500])
Unit: milliseconds
expr min lq mean median uq max neval
df[df$a * df$b > 500, ] 23.766201 24.136201 26.49715 24.34380 30.243300 32.7245 100
mat[mat[, 1] * mat[, 2] > 500, ] 13.010000 13.146301 17.18246 13.41555 20.105450 117.9497 100
dt[a * b > 500] 8.502102 8.644001 10.90873 8.72690 8.879352 112.7840 100
data.table
does the filtering 1.7 times faster than base
on data.frame
, and 2.5 times faster than using a matrix
.
And that's not all, for almost any CSV import, using data.table::fread
will change your life. Give it a try instead of read.csv
or read_csv
.
IMHO data.table
doesn't get half the love it deserves, the best all-round package for performance and a very concise syntax. The following vignettes should put you on your way quickly, and that is worth the effort, trust me.
For further performance improvements Rfast
contains many Rcpp
implementations of popular functions and problems, such as rowSort()
for example.
EDIT: fread
's speed is due to optimizations done at C-code level involving the use of pointers for memory mapping, and coerce-as-you-go techniques, which frankly are beyond my knowledge to explain. This post contains some explanations by the author Matt Dowle, as well as an interesting, if short, piece of discussion between him and the author of dplyr
, Hadley Wickham.