Search code examples
rdplyr

How to count how many 1's there are within each row over a set number of columns depending on number of missing/NA's present in each row in R?


I want to create a new column "X11" that sums up all the 1's conditionally based on how many NA's there are within a select number of columns. In this case, I am looking at 4 variables: X1, X2, X3, and X4.

Ex: if there is 1 NA, then I want to look at the remaining 3 variables that have values and count how many 1's there are. If there are 2 NA's, then I want to look at the remaining 2 variables and count how many 1's there are. If I have 3 NA's, then I want to look at the remaining 1 variable and determine if it is a 1. If I have all 4 NA's, then this would give me 0.

I have this data:

df <- data.frame(replicate(10,sample(0:2, 10, rep=TRUE)))
df <- replace(df, df == 0, NA)

My data frame looks like this:

   X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1   1  1 NA  1 NA NA NA  1  1   2
2  NA  1  1 NA  2 NA  2  2 NA   1
3   1 NA  1  1 NA NA  1  2 NA   1
4   2  2  2  1  1  2  1 NA  2   2
5  NA  2 NA  2 NA  2  1 NA  1   1
6   2  2  1  1  2 NA  1  2  1   1
7   1  2 NA NA  2  1  1 NA NA   1
8   2  2 NA NA  1 NA NA  2 NA   1
9   1 NA  1  2  2  1  2 NA NA   1
10 NA  2  1 NA NA NA NA  2  2  NA

I want my output to look like this:

   X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
1   1  1 NA  1 NA NA NA  1  1   2   3
2  NA  1  1 NA  2 NA  2  2 NA   1   2
3   1 NA  1  1 NA NA  1  2 NA   1   3
4   2  2  2  1  1  2  1 NA  2   2   1
5  NA  2 NA  2 NA  2  1 NA  1   1   0
6   2  2  1  1  2 NA  1  2  1   1   2
7   1  2 NA NA  2  1  1 NA NA   1   1
8   2  2 NA NA  1 NA NA  2 NA   1   0
9   1 NA  1  2  2  1  2 NA NA   1   2
10 NA  2  1 NA NA NA NA  2  2  NA   1

Here is an example of my current code:

vars <- c("X1", "X2", "X3", "X4")
df <- df %>%
   mutate(missing_vars = rowSums(across(vars, ~is.na(.))),
          nonmissing_vars = 7-vars)

df <- df %>%
  mutate(zero_na = case_when(missing_vars == 0 & (X1 == 2 & X2 == 2 & X3 == 2 & X4 == 2) ~ 1,
                                (missing_vars == 0 & (X1 == 1 & X2 == 2 & X3 == 2 & X4 == 2) |
                                   (X1 == 2 & X2 == 1 & X3 == 2 & X4 == 2) |
                                   (X1 == 2 & X2 == 2 & X3 == 1 & X4 == 2) |
                                   (X1 == 2 & X2 == 2 & X3 == 2 & X4 == 1)) ~ 2,
                                (missing_vars == 0 & (X1 == 1 & X2 == 1 & X3 == 2 & X4 == 2) |
                                   (X1 == 1 & X2 == 2 & X3 == 1 & X4 == 2) |
                                   (X1 == 1 & X2 == 2 & X3 == 2 & X4 == 1) |
                                   (X1 == 2 & X2 == 1 & X3 == 1 & X4 == 2) |
                                   (X1 == 2 & X2 == 2 & X3 == 1 & X4 == 1) |
                                   (X1 == 2 & X2 == 1 & X3 == 2 & X4 == 1)) ~ 3,
                                (missing_vars == 0 & (X1 == 1 & X2 == 1 & X3 == 1 & X4 == 2) |
                                   (X1 == 1 & X2 == 1 & X3 == 2 & X4 == 1) |
                                   (X1 == 1 & X2 == 2 & X3 == 1 & X4 == 1) |
                                   (X1 == 2 & X2 == 1 & X3 == 1 & X4 == 1)) ~ 4,
                                missing_vars == 0 & (X1 == 1 & X2 == 1 & X3 == 1 & X4 == 1) ~ 5))

brfss <- brfss %>%
  mutate(one_na = case_when(missing_vars == 1 & (is.na(X1) & X2 == 2 & X3 == 2 & X4 == 2) ~ 1,
                                       missing_vars == 1 & (X1 == 2 & is.na(X2) & X3 == 2 & X4 == 2) ~ 1,
                                       missing_vars == 1 & (X1 == 2 & X2 == 2 & is.na(X3) & X4 == 2) ~ 1,
                                       missing_vars == 1 & (X1 == 2 & X2 == 2 & X3 == 2 & is.na(X4)) ~ 1,
                                       missing_vars == 1 & (is.na(X1) & X2 == 1 & X3 == 2 & X4 == 2) ~ 2,
                                       missing_vars == 1 & (X1 == 1 & is.na(X2) & X3 == 2 & X4 == 2) ~ 2,
                                       missing_vars == 1 & (X1 == 1 & X2 == 2 & is.na(X3) & X4 == 2) ~ 2,
                                       missing_vars == 1 & (X1 == 1 & X2 == 2 & X3 == 2 & is.na(X4)) ~ 2,
                                       missing_vars == 1 & (is.na(X1) & X2 == 2 & X3 == 1 & X4 == 2) ~ 2,
                                       missing_vars == 1 & (X1 == 2 & is.na(X2) & X3 == 1 & X4 == 2) ~ 2,
                                       missing_vars == 1 & (X1 == 2 & X2 == 1 & is.na(X3) & X4 == 2) ~ 2,
                                       missing_vars == 1 & (X1 == 2 & X2 == 1 & X3 == 2 & is.na(X4)) ~ 2,
                                       missing_vars == 1 & (is.na(X1) & X2 == 2 & X3 == 2 & X4 == 1) ~ 2,
                                       missing_vars == 1 & (X1 == 2 & is.na(X2) & X3 == 2 & X4 == 1) ~ 2,
                                       missing_vars == 1 & (X1 == 2 & X2 == 2 & is.na(X3) & X4 == 1) ~ 2,
                                       missing_vars == 1 & (X1 == 2 & X2 == 2 & X3 == 1 & is.na(X4)) ~ 2,
                                       missing_vars == 1 & (is.na(X1) & X2 == 1 & X3 == 1 & X4 == 2) ~ 3,
                                       missing_vars == 1 & (X1 == 1 & is.na(X2) & X3 == 1 & X4 == 2) ~ 3,
                                       missing_vars == 1 & (X1 == 1 & X2 == 1 & is.na(X3) & X4 == 2) ~ 3,
                                       missing_vars == 1 & (X1 == 1 & X2 == 1 & X3 == 2 & is.na(X4)) ~ 3,
                                       missing_vars == 1 & (is.na(X1) & X2 == 2 & X3 == 1 & X4 == 1) ~ 3,
                                       missing_vars == 1 & (X1 == 2 & is.na(X2) & X3 == 1 & X4 == 1) ~ 3,
                                       missing_vars == 1 & (X1 == 2 & X2 == 1 & is.na(X3) & X4 == 1) ~ 3,
                                       missing_vars == 1 & (X1 == 2 & X2 == 1 & X3 == 1 & is.na(X4)) ~ 3,
                                       missing_vars == 1 & (is.na(X1) & X2 == 1 & X3 == 2 & X4 == 1) ~ 3,
                                       missing_vars == 1 & (X1 == 1 & is.na(X2) & X3 == 2 & X4 == 1) ~ 3,
                                       missing_vars == 1 & (X1 == 1 & X2 == 2 & is.na(X3) & X4 == 1) ~ 3,
                                       missing_vars == 1 & (X1 == 1 & X2 == 2 & X3 == 1 & is.na(X4)) ~ 3,
                                       missing_vars == 1 & (is.na(X1) & X2 == 1 & X3 == 1 & X4 == 1) ~ 4,
                                       missing_vars == 1 & (X1 == 1 & is.na(X2) & X3 == 1 & X4 == 1) ~ 4,
                                       missing_vars == 1 & (X1 == 1 & X2 == 1 & is.na(X3) & X4 == 1) ~ 4,
                                       missing_vars == 1 & (X1 == 1 & X2 == 1 & X3 == 1 & is.na(X4)) ~ 4))

I repeat this with each combination for 2 NA's, 3 NA's and then 4 NA's and then sum "zero_na", "one_na", etc. to get the final count of the value under X11.

However, I currently have about 300,000 observations and need to do this across 7 different variables with varying amounts of NA's, 1's and 2's. This will be a ridiculous amount of combinations I will have to write and I just wanted to know if there is a more efficient way of writing this code?

Thank you so much in advance!


Solution

  • Try this:

    df["X11"] = apply(df[,c(1:4)],1,\(s) sum(s==1,na.rm=T))
    

    You can achieve the same result using rowwise() and c_across() if you are partial to a dplyr solution

    df %>% rowwise() %>% mutate(x11 = sum(c_across(X1:X4)==1, na.rm=T))
    

    Output:

       X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
    1   1  1 NA  1 NA NA NA  1  1   2   3
    2  NA  1  1 NA  2 NA  2  2 NA   1   2
    3   1 NA  1  1 NA NA  1  2 NA   1   3
    4   2  2  2  1  1  2  1 NA  2   2   1
    5  NA  2 NA  2 NA  2  1 NA  1   1   0
    6   2  2  1  1  2 NA  1  2  1   1   2
    7   1  2 NA NA  2  1  1 NA NA   1   1
    8   2  2 NA NA  1 NA NA  2 NA   1   0
    9   1 NA  1  2  2  1  2 NA NA   1   2
    10 NA  2  1 NA NA NA NA  2  2  NA   1