Search code examples
rread-datar-haven

How to load .dta (preserving labels) most comfortable in R?


I work with .dta files and try to make loading data as comfortable as possible. In my view, I need a combination of haven and readstata13.

  • haven looks perfect. It provides best "sub-labels". But it does not provide a column-selector-function. I cannot use read_dta for large files ( ~ 1 GB / on 64 GB RAM, Intel Xeon E5). enter image description here Question: Is there a way to select/load a subset of data?

  • read.dta13 is my best workaround. It has select.cols. But I have to get attr later, save and merge them (for about 10 files).

    Question: How can I manually add these second labels which the haven package creates? (How are they called?)

enter image description here

Here is the MWE:

library(foreign)
write.dta(mtcars, "mtcars.dta")

library(haven)
mtcars <- read_dta("mtcars.dta")

library(readstata13)
mtcars2 <- read.dta13("mtcars.dta", convert.factors = FALSE, select.cols=(c("mpg", "cyl", "vs")))
var.labels <- attr(mtcars2,"var.labels")
data.key.mtcars2 <- data.frame(var.name=names(mtcars2),var.labels)

Solution

  • haven's development version supports selecting columns with the col_select argument:

    library(haven) # devtools::install_github("tidyverse/haven")
    mtcars <- read_dta("mtcars.dta", col_select = c(mpg, cyl, vs))
    

    Alternatively; the column labels in RStudio's viewer are taken from the data frame's columns' "label" attribute. You can use a simple loop to assign them from the labels read by readstata13:

    for (i in seq_along(mtcars2)) {
      attr(mtcars2[[i]], "label") <- var.labels[i]
    }
    
    View(mtcars2)