I want to read a .sav file into R. However, it is much too large (>11GB). If I could read in only portions of the data, that should be fine though not ideal. So, is there a way to do any of the following:
haven
but can't seem to get the col_select
argument to work.Thanks for your help!
As far as getting the whole dataset into R, I do not think you would be able to read it in via chunks or any similar workaround and it be more memory efficient for the entire dataset. But, there are easy ways to get the column names and or specific variables that are more memory efficient:
Getting column names can be done using the n_max
argument to read in an empty dataframe (with variable and value labels):
# using n_max = 0 is much more memory efficient
bench::mark(read_sav(temp)[0,],
read_sav(temp, n_max = 0))[1:9]
# A tibble: 2 x 9
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 read_sav(temp)[0, ] 1.11s 1.11s 0.902 76.5MB 6.31 1 7 1.11s
2 read_sav(temp, n_max = 0) 5.86ms 6.13ms 154. 97.2KB 1.98 78 1 505.6ms
Getting specific columns can be done with select-helpers (or indices, or names, etc.) and is more memory efficient:
bench::mark(read_sav(temp)[c("V1", "V5")],
read_sav(temp, col_select = matches("V(1|5)$")))[1:9]
# A tibble: 2 x 9
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 read_sav(temp)[c("V1", "V5")] 1.06s 1.06s 0.939 76.5MB 5.64 1 6 1.06s
2 read_sav(temp, col_select = matches("V(1|5)$")) 186.45ms 187.89ms 5.32 528.7KB 0 3 0 563.5ms
Data set-up:
test <- as.data.frame(matrix(1:1e7, nrow = 1e4, ncol = 1e3))
temp <- tempfile()
write_sav(test, temp)
# file.remove(temp)