I work with .dta files and try to make loading data as comfortable as possible. In my view, I need a combination of haven
and readstata13
.
haven
looks perfect. It provides best "sub-labels". But it does not provide a column-selector-function. I cannot use read_dta
for large files ( ~ 1 GB / on 64 GB RAM, Intel Xeon E5).
Question: Is there a way to select/load a subset of data?
read.dta13
is my best workaround. It has select.cols
. But I have to get attr
later, save and merge them (for about 10 files).
Question: How can I manually add these second labels which the haven
package creates? (How are they called?)
Here is the MWE:
library(foreign)
write.dta(mtcars, "mtcars.dta")
library(haven)
mtcars <- read_dta("mtcars.dta")
library(readstata13)
mtcars2 <- read.dta13("mtcars.dta", convert.factors = FALSE, select.cols=(c("mpg", "cyl", "vs")))
var.labels <- attr(mtcars2,"var.labels")
data.key.mtcars2 <- data.frame(var.name=names(mtcars2),var.labels)
haven
's development version supports selecting columns with the col_select
argument:
library(haven) # devtools::install_github("tidyverse/haven")
mtcars <- read_dta("mtcars.dta", col_select = c(mpg, cyl, vs))
Alternatively; the column labels in RStudio's viewer are taken from the data frame's columns' "label"
attribute. You can use a simple loop to assign them from the labels read by readstata13
:
for (i in seq_along(mtcars2)) {
attr(mtcars2[[i]], "label") <- var.labels[i]
}
View(mtcars2)