Search code examples
rdataframetidyversepurrrr-haven

Combine data file and label file together to have one single labelled dataframe in R


I have two data frames, one is survey data (data.csv) and another one is label data (label.csv). Here is the sample data (My original data has about 150 variables)

#sample data

df <- tibble::tribble(
  ~id, ~House_member, ~dob, ~age_quota, ~work, ~sex, ~pss,
  1L,            4L,  1983L,  2L,        2L,     1,      1,
  2L,            1L,  1940L,  7L,        2L,     1,      2,
  3L,            2L,  1951L,  5L,        6L,     1,      1,
  4L,            4L,  1965L,  2L,        2L,     1,      4,
  5L,            3L,  1965L,  2L,        3L,     1,      1,
  6L,            1L,  1951L,  3L,        1L,     1,      3,
  7L,            1L,  1955L,  1L,        1L,     1,      3,
  8L,            4L,  1982L,  2L,        2L,     2,      5,
  9L,            2L,  1990L,  2L,        4L,     2,      3,
  10L,            2L,  1953L, 3L,        2L,     2,      4
)


#sample label data
label <- tibble::tribble(
                ~variable, ~value,                           ~label,
           "House_member",     NA, "How many people live with you?",
           "House_member",     1L,                       "1 person",
           "House_member",     2L,                      "2 persons",
           "House_member",     3L,                      "3 persons",
           "House_member",     4L,                      "4 persons",
           "House_member",     5L,                      "5 persons",
           "House_member",     6L,                      "6 persons",
           "House_member",     7L,                      "7 persons",
           "House_member",     8L,                      "8 persons",
           "House_member",     9L,                      "9 persons",
           "House_member",    10L,                     "10 or more",
                    "dob",     NA,                  "date of brith",
              "age_quota",     NA,                      "age_quota",
              "age_quota",     1L,                          "10-14",
              "age_quota",     2L,                          "15-19",
              "age_quota",     3L,                          "20-29",
              "age_quota",     4L,                          "30-39",
              "age_quota",     5L,                          "40-49",
              "age_quota",     6L,                          "50-70",
              "age_quota",     7L,                           "70 +",
                   "work",     NA,        "what is your occupation?",
                   "work",     1L,                      "full time",
                   "work",     2L,                      "part time",
                   "work",     3L,                        "retired",
                   "work",     4L,                        "student",
                   "work",     5L,                      "housewife",
                   "work",     6L,                     "unemployed",
                   "work",     7L,                          "other",
                   "work",     8L,                   "kid under 15",
                    "sex",     NA,                        "gender?",
                    "sex",     1L,                            "Man",
                    "sex",     2L,                          "Woman",
                    "pss",     NA,       "How often do you use PS?",
                    "pss",     1L,                          "Daily",
                    "pss",     2L,         "several times per week",
                    "pss",     3L,                  "once per week",
                    "pss",     4L,         "several time per month",
                    "pss",     5L,                          "Rarly"
           )

I am wondering is there any way that I can combine these file together to have a one labelled dataframe like SPSS’s style format (dbl+lbl format). I know labelled package which can add a value label to a non labelled vector, like this example:

v <- labelled::labelled(c(1,2,2,2,3,9,1,3,2,NA), c(yes = 1, maybe = 2, no = 3))

I was hoping there is a better/faster way than adding label to each variable one by one.


Solution

  • Another imap_dfc solution:

    library(tidyverse)
    
    df %>% imap_dfc(~{ 
                      label[label$variable==.y,c('label','value')] %>%
                      deframe() %>% # to named vector
                      haven::labelled(.x,.)
                     })
    
    # A tibble: 10 x 7
              id  House_member       dob age_quota           work       sex                        pss
       <int+lbl>     <int+lbl> <int+lbl> <int+lbl>      <int+lbl> <dbl+lbl>                  <dbl+lbl>
     1         1 4 [4 persons]      1983 2 [15-19] 2 [part time]  1 [Man]   1 [Daily]                 
     2         2 1 [1 person]       1940 7 [70 +]  2 [part time]  1 [Man]   2 [several times per week]
     3         3 2 [2 persons]      1951 5 [40-49] 6 [unemployed] 1 [Man]   1 [Daily]                 
     4         4 4 [4 persons]      1965 2 [15-19] 2 [part time]  1 [Man]   4 [several time per month]
     5         5 3 [3 persons]      1965 2 [15-19] 3 [retired]    1 [Man]   1 [Daily]                 
     6         6 1 [1 person]       1951 3 [20-29] 1 [full time]  1 [Man]   3 [once per week]         
     7         7 1 [1 person]       1955 1 [10-14] 1 [full time]  1 [Man]   3 [once per week]         
     8         8 4 [4 persons]      1982 2 [15-19] 2 [part time]  2 [Woman] 5 [Rarly]                 
     9         9 2 [2 persons]      1990 2 [15-19] 4 [student]    2 [Woman] 3 [once per week]         
    10        10 2 [2 persons]      1953 3 [20-29] 2 [part time]  2 [Woman] 4 [several time per month]
    

    Used tibble::deframe and haven::labelled which are included in tidyverse

    Speed comparison after replacing filter/select by direct access to label:

    Waldi <- function() {
    df %>% imap_dfc(~{ 
        label[label$variable==.y,c('label','value')] %>%
          deframe() %>% # to named vector
          haven::labelled(.x,.)})}
    
    Waldi_old <- function() {   
        df %>% imap_dfc(~{ 
          label %>% filter(variable==.y) %>%
            select(label, value) %>%
            deframe() %>% # to named vector
            haven::labelled(.x,.)
        })}
    
    #EDIT : Included TIC33() for-loop solution
    
    microbenchmark::microbenchmark(TIC3(),Waldi(),Anil(),TIC1(),Waldi_old(),Sinh())
    Unit: microseconds
            expr     min       lq      mean   median       uq     max neval   cld
          TIC3()   688.0   871.80   982.280   920.95  1005.55  1801.6   100 a    
         Waldi()  1345.5  1543.60  1804.758  1635.45  1893.75  4306.8   100  b   
          Anil()  4006.8  4476.65  5188.519  4862.95  5439.10 10163.6   100   c  
          TIC1()  3898.2  4278.80  5009.927  4774.95  5277.05 12916.2   100   c  
     Waldi_old() 18712.3 20091.75 21756.140 20609.35 22169.75 33359.8   100    d 
          Sinh() 22730.9 24093.45 25931.412 24946.00 26614.00 38735.3   100     e