Search code examples
rfunctionpanel-data

Better way to combine variables in panel dataset in R?


I'm working with panel data where some of the questionnaire items get asked only once while most others are asked several times across different waves. So let's assume there I have survey responses on a categorical variable.

Even though it's likely panelists will be stable in their reponse on a given question, it may occur that they change their opinion, or even drop out of the panel. Others might enter the panel at a later point in time (in a later wave).

So let's say this leads to a dataset which looks like this, where var_1a and var_1b are two variables that measure the same thing just in two different waves:

mydata<-data.frame(
  var_1a = c(1,2,3,4,5,NA),
  var_1b = c(1,2,5,5,NA,NA)
)

Now I would like to automatically combine these variables into one single variable across the two waves. As a condition, I would like to take the mean if respondents have changed their opinion from one wave to the other. Furthermore, I would like to keep the reponse from wave 1 even if the respondent has droped out of the panel in wave 2.

The way I have been doing this in R is by using this function:

combo<-function(x,...){
  V<-rowMeans(x[, c(...) ], na.rm=T) * 
    ifelse(rowMeans(is.na(x[, c(...) ])) == ncol(x[, c(...) ]), NA, 1)
  ifelse(is.nan(V),NA,V)}

For some reason, this introduces "NaN" if all variables are consistently NA across all waves, hence the last line of code in the function.

This works just fine and I get what I wanted:

> mydata$combo_var<-combo(mydata, c("var_1a","var_1b"))
> mydata

   var_1a var_1b combo_var
1       1      1       1.0
2       2      2       2.0
3       3     -3       0.0
4       4      9       6.5
5       5     NA       5.0
6      NA     NA        NA
7      NA     NA        NA
8      NA     NA        NA
9      NA     NA        NA
10     NA     NA        NA

However, I was wondering if there's a better and easier way to do this. Is there a way I could use dplyr to do the same? Any tidyverse solution?


Solution

  • Some tidyverse options:

    Data

    mydata <- data.frame(
      var_1a = c(1,2,3,4,5,NA),
      var_1b = c(1,2,5,5,NA,NA)
    )
    

    Option 1

    library(dplyr)
    library(purrr)
    
    mydata <- mydata %>% 
      mutate(
        combo_var = pmap_dbl(select(., contains("var")), 
                         ~mean(c(...), na.rm = T))
        )
    

    Result:

     var_1a var_1b combo_var
    1      1      1       1.0
    2      2      2       2.0
    3      3      5       4.0
    4      4      5       4.5
    5      5     NA       5.0
    6     NA     NA       NaN
    

    Option 2

    mydata <- mydata %>% 
      mutate(
        combo_var = rowMeans(select(., contains("var")), na.rm = T)
        )
    

    Result:

     var_1a var_1b combo_var
    1      1      1       1.0
    2      2      2       2.0
    3      3      5       4.0
    4      4      5       4.5
    5      5     NA       5.0
    6     NA     NA       NaN