Search code examples
rdplyrmissing-datamagrittrmethod-missing

Sum NA across specific columns in R


I have data such as this:

  data_in <- read_table2("Id    Q62_1   Q62_2   Q3_1    Q3_2    Q3_3    Q3_4    Q3_5
1   Yes Sometimes                   
2       Always                  
3                           
4   No  Always  Yes             
5                           
6       Always  No  Likely  Yes Always  Always
7   Yes Sometimes   Maybe   Unlikely        Sometimes   Sometimes
8       Always  Yes Likely  No  Always  Always
9       Sometimes       Unlikely        Sometimes   Sometimes
10  No      No  Likely  Maybe       
11      Sometimes   Maybe   Unlikely        Sometimes   Sometimes
12      Always  Yes Likely      Always  Always
")

I would like to calculate the number of missing response within columns that start with Q62 and then from columns Q3_1 to Q3_5 separately.

I know that rowSums is handy to sum numeric variables, but is there a dplyr/piped equivalent to sum na's?

For example, if this were numeric data and I wanted to sum the q62 series, I could use the following:

data_in %>% 
  mutate(Q62_NA = rowSums(select(.,"Q62_1", "Q62_2"))

But how do I sum NAs?

My output should look something like this:

data_out <- read_table2("Id Q62_1   Q62_2   Q3_1    Q3_2    Q3_3    Q3_4    Q3_5    Q62_NA  Q3_NA
1   Yes Sometimes                       0   5
2       Always                      1   5
3                               2   5
4   No  Always  Yes                 0   5
5                               2   5
6       Always  No  Likely  Yes Always  Always  1   
7   Yes Sometimes   Maybe   Unlikely        Sometimes   Sometimes   0   1
8       Always  Yes Likely  No  Always  Always  1   0
9       Sometimes       Unlikely        Sometimes   Sometimes   1   1
10  No      No  Likely  Maybe           1   2
11      Sometimes   Maybe   Unlikely        Sometimes   Sometimes   1   1
12      Always  Yes Likely      Always  Always  1   1
")

Thank you!!


Solution

  • We can wrap the select with is.na to convert it to a logical matrix and then do the rowSums on that matrix to sum the number of TRUE elements per row

    library(dplyr)
    data_in %>% 
       mutate(Q62_NA = rowSums(is.na(select(.,"Q62_1", "Q62_2"))))
    

    Or an option with c_across and rowwise

     data_in %>%
        rowwise %>%
        mutate(Q62_NA = sum(is.na(c_across(starts_with('Q6')))))