Search code examples
rtidyversedata-cleaningdata-wrangling

Create a Grouping Column/Variable from other Columns in R


I'm trying to group data into a grouping variable based on whether or not there is data in specific columns. In other words, if there is data in the same row for V1 & V2 below, then I want to put them in the same group (same for V3 & V4, V5 & V6).

For example, I want this:

#Creating dataframe
V1 <- c(1, NA, 6, 3, NA, NA, NA)
V2 <- c(1, NA, 6, 3, NA, NA, NA)
V3 <- c(NA, 3.5, NA, NA, 5, NA, NA)
V4 <- c(NA, 3.5, NA, NA, 5, NA, NA)
V5 <- c(NA, NA, NA, NA, NA, 5, 2)
V6 <- c(NA, NA, NA, NA, NA, 6, 1)


data <- data.frame(V1, V2, V3, V4, V5, V6)

data

  V1 V2  V3  V4 V5 V6
1  1  1  NA  NA NA NA
2 NA NA 3.5 3.5 NA NA
3  6  6  NA  NA NA NA
4  3  3  NA  NA NA NA
5 NA NA 5.0 5.0 NA NA
6 NA NA  NA  NA  5  6
7 NA NA  NA  NA  2  1

To look like this:

  V1 V2  V3  V4 V5 V6 Group
1  1  1  NA  NA NA NA    G1
2 NA NA 3.5 3.5 NA NA    G2
3  6  6  NA  NA NA NA    G1
4  3  3  NA  NA NA NA    G1
5 NA NA 5.0 5.0 NA NA    G2
6 NA NA  NA  NA  5  6    G3
7 NA NA  NA  NA  2  1    G3

I've seen similar posts, but I cannot seem to get them to cooperate the same way with the case_when() function.


Solution

  • 1) Replace each non-NA element with its column number and then sum the rows. Then convert those numbers to 1, 2 and 3 and paste on "G".

    library(dplyr)
    
    data %>%
      mutate(group = (col(.) + 0 * .) %>%
        rowSums(na.rm = TRUE) %>%
        factor %>%
        as.numeric %>%
        paste0("G", .)
      )
    

    giving

      V1 V2  V3  V4 V5 V6 group
    1  1  1  NA  NA NA NA    G1
    2 NA NA 3.5 3.5 NA NA    G2
    3  6  6  NA  NA NA NA    G1
    4  3  3  NA  NA NA NA    G1
    5 NA NA 5.0 5.0 NA NA    G2
    6 NA NA  NA  NA  5  6    G3
    7 NA NA  NA  NA  2  1    G3
    

    2) Even shorter although the formula is specific to the number of columns

    data %>%
      mutate(group = paste0("G", (!is.na(V1)) + 2 * (!is.na(V3)) + 3 * (!is.na(V5))))
    

    3) With case_when

    data %>%
      mutate(group = paste0("G", case_when(
        !is.na(V1) ~ 1,
        !is.na(V3) ~ 2,
        !is.na(V5) ~ 3)))
    

    4) A variation of (1)

    nc <- ncol(data)
    data %>%
      mutate(group = paste0("G", ((!is.na(data)) %*% (1:nc) + 1) / 4))
    

    Note

    Data from question

    data <- data.frame(
      V1 = c(1, NA, 6, 3, NA, NA, NA),
      V2 = c(1, NA, 6, 3, NA, NA, NA),
      V3 = c(NA, 3.5, NA, NA, 5, NA, NA),
      V4 = c(NA, 3.5, NA, NA, 5, NA, NA),
      V5 = rep(c(NA, 5, 2), c(5L, 1L, 1L)),
      V6 = rep(c(NA, 6, 1), c(5L, 1L, 1L))
    )