Search code examples
rr-haven

Why does haven::write_dta() inflate file size and can it be changed?


Sometimes I need to convert SPSS files to DTA files. Usually I use Stat/Transfer, but I thought perhaps I could use R to save money.

When I transfer files using the haven package, however, the resulting file size is dramatically higher than when I use Stat/Transfer.

For example, here's a .sav file I found on the internet. It is 85kb.

Using Stat/Transfer to convert it to convert it yields an even smaller 47kb .dta file.

However, when I run this code I get a .dta file which is 118kb. That's 2.5 times as large as the Stat/Transfer product.

from.sav <- haven::read_sav("PsychBike.sav")
haven::write_dta(from.sav, "PsychBikeFromHaven.dta")

Is there anything I can do to make the output of haven::write_dta() smaller?


Solution

  • This is because write_dta() doesn't compress. i.e., write_dta() often picks an excessively large data storage type. Below is an extreme, yet real-world, example from my work. (File name and varnames are redacted.)

    Notice the file size. It decreased from 1 Mb to 6 kb. 99.4% size reduction. The real dataset actually have millions of observations -- so I'm having a hard time converting this to dta using write_dta(). Probably something needs to be tuned at ReadStat level.

    . desc, size
    
    Contains data from v1.dta
      obs:           100
     vars:            22                          04 Sep 2019 10:19
     size:     1,032,900
    -------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    -------------------------------------------------------------------------------
    var1            double  %10.0g
    var2            str1    %-9s
    var3            double  %td
    var4            double  %td
    var5            str4    %-9s
    var6            str1    %-9s
    var7            str2045 %-9s
    var8            str2045 %-9s
    var9            str2045 %-9s
    var10           str2045 %-9s
    var11           str2045 %-9s
    var12           str5    %-9s
    var13           double  %10.0g
    var14           double  %td
    var15           double  %10.0g
    var16           str3    %-9s
    var17           double  %10.0g
    var18           double  %10.0g
    var19           double  %10.0g
    var20           double  %10.0g
    var21           double  %10.0g
    var22           str2    %-9s
    -------------------------------------------------------------------------------
    Sorted by:
         Note: Dataset has changed since last saved.
    r; t=0.00 10:27:24
    
    . compress
      variable var1 was double now long
      variable var3 was double now int
      variable var4 was double now int
      variable var14 was double now int
      variable var17 was double now byte
      variable var18 was double now long
      variable var19 was double now byte
      variable var20 was double now byte
      variable var7 was str2045 now str1
      variable var8 was str2045 now str1
      variable var9 was str2045 now str1
      variable var10 was str2045 now str1
      variable var11 was str2045 now str1
      (1,026,700 bytes saved)
    r; t=0.00 10:27:34
    
    . desc, size
    
    Contains data from v2.dta
      obs:           100
     vars:            22                          04 Sep 2019 10:19
     size:         6,200
    -------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    -------------------------------------------------------------------------------
    var1            long    %10.0g
    var2            str1    %-9s
    var3            int     %td
    var4            int     %td
    var5            str4    %-9s
    var6            str1    %-9s
    var7            str1    %-9s
    var8            str1    %-9s
    var9            str1    %-9s
    var10           str1    %-9s
    var11           str1    %-9s
    var12           str5    %-9s
    var13           double  %10.0g
    var14           int     %td
    var15           double  %10.0g
    var16           str3    %-9s
    var17           byte    %10.0g
    var18           long    %10.0g
    var19           byte    %10.0g
    var20           byte    %10.0g
    var21           double  %10.0g
    var22           str2    %-9s
    -------------------------------------------------------------------------------
    Sorted by:
         Note: Dataset has changed since last saved.
    r; t=0.00 10:27:37