I'm trying to group data into a grouping variable based on whether or not there is data in specific columns. In other words, if there is data in the same row for V1 & V2 below, then I want to put them in the same group (same for V3 & V4, V5 & V6).
For example, I want this:
#Creating dataframe
V1 <- c(1, NA, 6, 3, NA, NA, NA)
V2 <- c(1, NA, 6, 3, NA, NA, NA)
V3 <- c(NA, 3.5, NA, NA, 5, NA, NA)
V4 <- c(NA, 3.5, NA, NA, 5, NA, NA)
V5 <- c(NA, NA, NA, NA, NA, 5, 2)
V6 <- c(NA, NA, NA, NA, NA, 6, 1)
data <- data.frame(V1, V2, V3, V4, V5, V6)
data
V1 V2 V3 V4 V5 V6
1 1 1 NA NA NA NA
2 NA NA 3.5 3.5 NA NA
3 6 6 NA NA NA NA
4 3 3 NA NA NA NA
5 NA NA 5.0 5.0 NA NA
6 NA NA NA NA 5 6
7 NA NA NA NA 2 1
To look like this:
V1 V2 V3 V4 V5 V6 Group
1 1 1 NA NA NA NA G1
2 NA NA 3.5 3.5 NA NA G2
3 6 6 NA NA NA NA G1
4 3 3 NA NA NA NA G1
5 NA NA 5.0 5.0 NA NA G2
6 NA NA NA NA 5 6 G3
7 NA NA NA NA 2 1 G3
I've seen similar posts, but I cannot seem to get them to cooperate the same way with the case_when()
function.
1) Replace each non-NA element with its column number and then sum the rows. Then convert those numbers to 1, 2 and 3 and paste on "G".
library(dplyr)
data %>%
mutate(group = (col(.) + 0 * .) %>%
rowSums(na.rm = TRUE) %>%
factor %>%
as.numeric %>%
paste0("G", .)
)
giving
V1 V2 V3 V4 V5 V6 group
1 1 1 NA NA NA NA G1
2 NA NA 3.5 3.5 NA NA G2
3 6 6 NA NA NA NA G1
4 3 3 NA NA NA NA G1
5 NA NA 5.0 5.0 NA NA G2
6 NA NA NA NA 5 6 G3
7 NA NA NA NA 2 1 G3
2) Even shorter although the formula is specific to the number of columns
data %>%
mutate(group = paste0("G", (!is.na(V1)) + 2 * (!is.na(V3)) + 3 * (!is.na(V5))))
3) With case_when
data %>%
mutate(group = paste0("G", case_when(
!is.na(V1) ~ 1,
!is.na(V3) ~ 2,
!is.na(V5) ~ 3)))
4) A variation of (1)
nc <- ncol(data)
data %>%
mutate(group = paste0("G", ((!is.na(data)) %*% (1:nc) + 1) / 4))
Data from question
data <- data.frame(
V1 = c(1, NA, 6, 3, NA, NA, NA),
V2 = c(1, NA, 6, 3, NA, NA, NA),
V3 = c(NA, 3.5, NA, NA, 5, NA, NA),
V4 = c(NA, 3.5, NA, NA, 5, NA, NA),
V5 = rep(c(NA, 5, 2), c(5L, 1L, 1L)),
V6 = rep(c(NA, 6, 1), c(5L, 1L, 1L))
)