I wanted to create a group variable with specific conditions based on the values of GT. If GT is 'I', it should generate a different group number. If GT is 'G', the following rules should apply:
My Attempt:
library(tidyverse)
df1 <-
data.frame(
GT = c(rep("G", 9), rep("I", 2), rep("G", 2), rep("I", 1))
)
df2 <-
df1 %>%
mutate(
grp1 = case_when(
GT == "I" ~ row_number()
, .default = consecutive_id(GT)
)
) %>%
group_by(GT, grp1) %>%
mutate(count = n()) %>%
ungroup() %>%
mutate(
grp2 = case_when(
GT == "G" & count %/% 4 > 0 ~ row_number() %/% 5 + row_number(1)
, .default = grp1
)
, grp3 = case_when(
GT == "G"~ (cumsum(GT == "G") - 1) %/% 4 + 1
, .default = grp1
)
)
df2
#> # A tibble: 14 × 5
#> GT grp1 count grp2 grp3
#> <chr> <int> <int> <dbl> <dbl>
#> 1 G 1 9 1 1
#> 2 G 1 9 1 1
#> 3 G 1 9 1 1
#> 4 G 1 9 1 1
#> 5 G 1 9 2 2
#> 6 G 1 9 2 2
#> 7 G 1 9 2 2
#> 8 G 1 9 2 2
#> 9 G 1 9 2 3
#> 10 I 10 1 10 10
#> 11 I 11 1 11 11
#> 12 G 3 2 3 3
#> 13 G 3 2 3 3
#> 14 I 14 1 14 14
The group variable grp2
is very close to requirements. However, I want df2[8:9, 4]
should have different group number than df2[5:7, 4]
. Any hint please!
This is similar to your code above, with just a couple of changes.
The grp1
column is a grouping column where:
GT == "I"
then each row is a groupGT == "G"
then each set of consecutive G
s is a groupThe grp2
column is to get the correct grouping within the sets of consecutive G
s. In general, (rownum() - 1) %/% 4
gives the correct groups except where the last group has only one row. In that case the second from last group value (where row_number() == n() - 1
) needs to have + 1 added.
(In your code row_number() %/% 5
gives correct values for series of up to length 8, (row_number() - 1) %/% 4
produces the correct groupings for larger series).
Finally I use the unique combinations of grp1
and grp2
to create a new id variable to ensure uniqueness.
library(tidyverse)
df1 <- data.frame(GT = c(rep("G", 9), rep("I", 2), rep("G", 2), rep("I", 1), rep("G", 5), "I", rep("G", 10)) )
want <- df1 %>%
mutate(
grp1 = consecutive_id(GT, if_else(GT == "I", row_number(), -1))
) %>%
mutate(
grp2 = case_when(
n() %% 4 != 1 & GT == "G" ~ (row_number() - 1) %/% 4,
GT == "G" ~ if_else(row_number() == n() - 1, (row_number() - 1) %/% 4 + 1, (row_number() - 1) %/% 4),
.default = grp1 # This is where GT == "I"
),
.by = grp1
) %>%
mutate(group = consecutive_id(grp1, grp2)) %>%
select(-grp1, -grp2)
want
#> GT group
#> 1 G 1
#> 2 G 1
#> 3 G 1
#> 4 G 1
#> 5 G 2
#> 6 G 2
#> 7 G 2
#> 8 G 3
#> 9 G 3
#> 10 I 4
#> 11 I 5
#> 12 G 6
#> 13 G 6
#> 14 I 7
#> 15 G 8
#> 16 G 8
#> 17 G 8
#> 18 G 9
#> 19 G 9
#> 20 I 10
#> 21 G 11
#> 22 G 11
#> 23 G 11
#> 24 G 11
#> 25 G 12
#> 26 G 12
#> 27 G 12
#> 28 G 12
#> 29 G 13
#> 30 G 13
Created on 2024-05-25 with reprex v2.1.0
I followed your approach of creating the grp1
column, but I used rep()
along with some additional modulo arithmetic to calculate the grp2
column which allows us to make the final "3+2" group.
Because of this I had to switch from case_when()
to if...else
and add some calls to unique()
to avoid recycling the values in the if
condition.
Edit: Updated as original solution failed for any group of 5 consecutuve 'G's
library(tidyverse)
df1 <- data.frame(GT = c(rep("G", 9), rep("I", 2), rep("G", 2), rep("I", 1), rep("G", 5), "I", rep("G", 10)) )
want <- df1 %>%
mutate(
grp1 = case_when(
GT == "I" ~ row_number(),
.default = consecutive_id(GT)
)
) %>%
mutate(
grp2 = if(unique(GT) == "I" | unique(n() %/% 4) < 1) {
grp1
} else if(unique(n() %% 4) == 1) {
c(rep(1:(n() %/% 4 - 1), each = 4 * (unique(n())>5)), rep(n() %/% 4, 3), rep(n() %/% 4 + 1, 2))
} else {
c(rep(1:(n() %/% 4), each = 4), rep(n() %/% 4 + 1, each = n() %% 4))
} ,
.by = c(GT, grp1)
) %>%
mutate(group = consecutive_id(GT, grp2)) %>%
select(-grp1, -grp2)
want
#> GT group
#> 1 G 1
#> 2 G 1
#> 3 G 1
#> 4 G 1
#> 5 G 2
#> 6 G 2
#> 7 G 2
#> 8 G 3
#> 9 G 3
#> 10 I 4
#> 11 I 5
#> 12 G 6
#> 13 G 6
#> 14 I 7
#> 15 G 8
#> 16 G 8
#> 17 G 8
#> 18 G 9
#> 19 G 9
#> 20 I 10
#> 21 G 11
#> 22 G 11
#> 23 G 11
#> 24 G 11
#> 25 G 12
#> 26 G 12
#> 27 G 12
#> 28 G 12
#> 29 G 13
#> 30 G 13
Created on 2024-05-25 with reprex v2.1.0
Further expalanation of this approach:
This uses a slightly different way of calculating the grp2
column. Whereas the tidyverse solution looks at row_number()
to decide what the value should be at each row, this approach looks at the total number of G
s (i.e. n()
) and calculates a new vector of the same length as n()
.
Think of it as creating incrementing sets of 4 numbers (1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, ...) until it gets within 4 places of n()
. Then it adds a shorter vector (of length n() %% 4
) to fill in the remaining values. This is what c(rep(1:(n() %/% 4), each = 4), rep(n() %/% 4 + 1, each = n() %% 4))
is calculating.
For the exception, where n() %% 4 == 1
we would finish with a group of length 1. Instead we stop our groups of 4 repeated numbers one step earlier and add 2 new sets with length=3 and length=2 respectivley. This is done in c(rep(1:(n() %/% 4 - 1), each = 4 * (unique(n())>5)), rep(n() %/% 4, 3), rep(n() %/% 4 + 1, 2))
.
(The 4 * (unique(n())>5))
resolves to 0 in the case of a run of exactly 5 G
s so it doesn't include any sets of 4 identical numbers, just the 3 + 2 that we want. Neglecting this case was why my original solution failed).