Sometimes I need to convert SPSS files to DTA files. Usually I use Stat/Transfer, but I thought perhaps I could use R to save money.
When I transfer files using the haven package, however, the resulting file size is dramatically higher than when I use Stat/Transfer.
For example, here's a .sav file I found on the internet. It is 85kb.
Using Stat/Transfer to convert it to convert it yields an even smaller 47kb .dta file.
However, when I run this code I get a .dta file which is 118kb. That's 2.5 times as large as the Stat/Transfer product.
from.sav <- haven::read_sav("PsychBike.sav")
haven::write_dta(from.sav, "PsychBikeFromHaven.dta")
Is there anything I can do to make the output of haven::write_dta()
smaller?
This is because write_dta()
doesn't compress
. i.e., write_dta()
often picks an excessively large data storage type. Below is an extreme, yet real-world, example from my work. (File name and varnames are redacted.)
Notice the file size. It decreased from 1 Mb to 6 kb. 99.4% size reduction. The real dataset actually have millions of observations -- so I'm having a hard time converting this to dta
using write_dta()
. Probably something needs to be tuned at ReadStat
level.
. desc, size
Contains data from v1.dta
obs: 100
vars: 22 04 Sep 2019 10:19
size: 1,032,900
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
var1 double %10.0g
var2 str1 %-9s
var3 double %td
var4 double %td
var5 str4 %-9s
var6 str1 %-9s
var7 str2045 %-9s
var8 str2045 %-9s
var9 str2045 %-9s
var10 str2045 %-9s
var11 str2045 %-9s
var12 str5 %-9s
var13 double %10.0g
var14 double %td
var15 double %10.0g
var16 str3 %-9s
var17 double %10.0g
var18 double %10.0g
var19 double %10.0g
var20 double %10.0g
var21 double %10.0g
var22 str2 %-9s
-------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
r; t=0.00 10:27:24
. compress
variable var1 was double now long
variable var3 was double now int
variable var4 was double now int
variable var14 was double now int
variable var17 was double now byte
variable var18 was double now long
variable var19 was double now byte
variable var20 was double now byte
variable var7 was str2045 now str1
variable var8 was str2045 now str1
variable var9 was str2045 now str1
variable var10 was str2045 now str1
variable var11 was str2045 now str1
(1,026,700 bytes saved)
r; t=0.00 10:27:34
. desc, size
Contains data from v2.dta
obs: 100
vars: 22 04 Sep 2019 10:19
size: 6,200
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
var1 long %10.0g
var2 str1 %-9s
var3 int %td
var4 int %td
var5 str4 %-9s
var6 str1 %-9s
var7 str1 %-9s
var8 str1 %-9s
var9 str1 %-9s
var10 str1 %-9s
var11 str1 %-9s
var12 str5 %-9s
var13 double %10.0g
var14 int %td
var15 double %10.0g
var16 str3 %-9s
var17 byte %10.0g
var18 long %10.0g
var19 byte %10.0g
var20 byte %10.0g
var21 double %10.0g
var22 str2 %-9s
-------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
r; t=0.00 10:27:37