Search code examples
rimportspssr-haven

Read first column of .sav file in R


I want to read a .sav file into R. However, it is much too large (>11GB). If I could read in only portions of the data, that should be fine though not ideal. So, is there a way to do any of the following:

  • Read just the header (for column names)
  • Import only certain columns (rather than the entire dataset) - I've tried the functions from haven but can't seem to get the col_select argument to work.
  • Read in the entire dataset - I'm aware of tools for .csv files but not for .sav files.

Thanks for your help!


Solution

  • As far as getting the whole dataset into R, I do not think you would be able to read it in via chunks or any similar workaround and it be more memory efficient for the entire dataset. But, there are easy ways to get the column names and or specific variables that are more memory efficient:

    Getting column names can be done using the n_max argument to read in an empty dataframe (with variable and value labels):

    # using n_max = 0 is much more memory efficient
    bench::mark(read_sav(temp)[0,],
                read_sav(temp, n_max = 0))[1:9]
    # A tibble: 2 x 9
      expression                     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
      <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
    1 read_sav(temp)[0, ]          1.11s    1.11s     0.902    76.5MB     6.31     1     7      1.11s
    2 read_sav(temp, n_max = 0)   5.86ms   6.13ms   154.       97.2KB     1.98    78     1    505.6ms
    

    Getting specific columns can be done with select-helpers (or indices, or names, etc.) and is more memory efficient:

    bench::mark(read_sav(temp)[c("V1", "V5")],
                read_sav(temp, col_select = matches("V(1|5)$")))[1:9]
    # A tibble: 2 x 9
      expression                                           min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
      <bch:expr>                                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
    1 read_sav(temp)[c("V1", "V5")]                      1.06s    1.06s     0.939    76.5MB     5.64     1     6      1.06s
    2 read_sav(temp, col_select = matches("V(1|5)$")) 186.45ms 187.89ms     5.32    528.7KB     0        3     0    563.5ms
    

    Data set-up:

    test <- as.data.frame(matrix(1:1e7, nrow = 1e4, ncol = 1e3))
    temp <- tempfile()
    write_sav(test, temp)
    
    # file.remove(temp)