Search code examples
rdplyrflags

How to flag the first occurrence of a character string in a data frame and all rows after?


Suppose we start with a data frame that looks like the below:

   ID STATE_1 STATE_2
1   1    NULL    NULL
2   1     FRY    NULL
3   1    NULL     CRY
4   1     FRY     CRY
5   1    NULL    NULL
6   1    NULL    NULL
7   1     FRY     CRY
8   1    NULL    NULL
9   5    NULL    NULL
10  5    NULL     CRY
11  5     FRY    NULL
12  5    NULL    NULL

Generated as follows for those who would like to replicate:

ID <- c(1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5)
STATE_1 <- c("NULL", "FRY", "NULL", "FRY", "NULL", "NULL", "FRY", "NULL", "NULL", "NULL", "FRY", "NULL")
STATE_2 <- c("NULL", "NULL", "CRY", "CRY", "NULL", "NULL", "CRY", "NULL", "NULL", "CRY", "NULL", "NULL")
df <- data.frame(ID, STATE_1, STATE_2)

How would one add a FLAG column, whereby for each ID, once one of first of multiple states is triggered by an element that is not NULL, then all subsequent rows for that ID are flagged with that first non-NULL status? So for example, with the FLAG column added, the output would look like this:

   ID STATE_1 STATE_2   FLAG   [Explanations for FLAG column elements]
1   1    NULL    NULL   NULL    No flag tripped for ID 1 (yet) so value is NULL
2   1     FRY    NULL    FRY    FRY tripped in STATE_1 so FRY applies for all subsequent rows for this ID
3   1    NULL     CRY    FRY    Ignore CRY since FRY happened first
4   1     FRY     CRY    FRY
5   1    NULL    NULL    FRY
6   1    NULL    NULL    FRY
7   1     FRY     CRY    FRY
8   1    NULL    NULL    FRY
9   5    NULL    NULL   NULL    No flag tripped for ID 5 (yet) so value is NULL
10  5    NULL     CRY    CRY    CRY tripped in STATE_2 so CRY applies for all subsequent rows for this ID
11  5     FRY    NULL    CRY    Ignore FRY since CRY happened first
12  5    NULL    NULL    CRY    But please read the following paragraph, because another possible scenario is omitted for sake of brevity

If in row 10 above, for ID 5, there had both appeared FRY and CRY in that same row, then FRY would have governed for all subsequent ID 5 rows. I didn't include this scenario in my example above. So there is a priority where STATE_1 trumps STATE_2, etc.

In my actual data there are 6 states, compared with the 2 states (STATE_1 and STATE_2) in this example.

I have a strong preference for working with dplyr().

As I researched this, there are several posts addressing flagging data based on a numeric trigger, but I couldn´t find any solutions using character string triggers like in my data. I am hoping there is a character string solution so I can avoid the added steps of converting from character strings to numeric variables.


Solution

  • Yes, this can be done with dplyr and character strings. Here's one way:

    df %>%
      mutate(across(.cols = c("STATE_1", "STATE_2"), ~ na_if(., "NULL"))) %>%
      group_by(ID) %>%
      fill(STATE_1, STATE_2, .direction = "down") %>%
      mutate(flag = if_else(is.na(lag(coalesce(STATE_1, STATE_2))),
                            coalesce(STATE_1, STATE_2),
                            NA_character_)) %>%
      fill(flag, .direction = "down")
    
       ID STATE_1 STATE_2 flag
    1   1    <NA>    <NA> <NA>
    2   1     FRY    <NA>  FRY
    3   1     FRY     CRY  FRY
    4   1     FRY     CRY  FRY
    5   1     FRY     CRY  FRY
    6   1     FRY     CRY  FRY
    7   1     FRY     CRY  FRY
    8   1     FRY     CRY  FRY
    9   5    <NA>    <NA> <NA>
    10  5    <NA>     CRY  CRY
    11  5     FRY     CRY  CRY
    12  5     FRY     CRY  CRY
    

    Step by step, here's what this solution does:

    • (Convert the character "NULL" values to NA.)
    • Group by ID so that we operate within each ID separately.
    • Use fill() to use each non-null value in STATE_1 and STATE_2 to populate all the non-null values below it, until we reach the next non-null value. (You can also fill "up" instead of down; down is the default direction, but I've included it explicitly for the sake of clarity.)
    • Create the flag field. If both states in the immediately preceding row have null values (as identified by lag()), then we want to trigger a new flag; use coalesce() to prioritize STATE_1 over STATE_2. Otherwise, we don't want a new flag; populate with NA for now.
    • Use fill() again to populate flags downward: each new flag populates the rows below it until we get to a new flag.

    This procedure also works for the scenario you described where row 10 has "FRY" for STATE_1:

    df2 = df
    df2$STATE_1[10] = "FRY"
    df2 %>%
      mutate(across(.cols = -c("ID"), ~ na_if(., "NULL"))) %>%
      group_by(ID) %>%
      fill(STATE_1, STATE_2, .direction = "down") %>%
      mutate(flag = if_else(is.na(lag(coalesce(STATE_1, STATE_2))),
                            coalesce(STATE_1, STATE_2),
                            NA_character_)) %>%
      fill(flag, .direction = "down")
    
       ID STATE_1 STATE_2 flag
    1   1    <NA>    <NA> <NA>
    2   1     FRY    <NA>  FRY
    3   1     FRY     CRY  FRY
    4   1     FRY     CRY  FRY
    5   1     FRY     CRY  FRY
    6   1     FRY     CRY  FRY
    7   1     FRY     CRY  FRY
    8   1     FRY     CRY  FRY
    9   5    <NA>    <NA> <NA>
    10  5     FRY     CRY  FRY
    11  5     FRY     CRY  FRY
    12  5     FRY     CRY  FRY