Search code examples
rfilterflags

Flagging an id when having similar columns different values in R


I need to flag an id when they have different grade values in the grade columns. Here how my sample dataset looks like

df <- data.frame(id = c(11,22,33,44,55),
                 grade.1 = c(3,4,5,6,7),
                 grade.2 = c(3,4,5,NA,7),
                 grade.3 = c(4,4,6,5,7),
                 grade.4 = c(NA,NA,NA, 5, 7 ))

df$Grade <- paste0(df$grade.1, df$grade.2, df$grade.3, df$grade.4)

> df
  id grade.1 grade.2 grade.3 grade.4 Grade
1 11       3       3       4      NA 334NA
2 22       4       4       4      NA 444NA
3 33       5       5       6      NA 556NA
4 44       6      NA       5       5 6NA55
5 55       7       7       7       7  7777

When an id has different grade values in grade.1 grade.2 grade.3 and grade.4, that row needs to be flagged. Having NA in that column does not affect the flagging.

In other words, if the Grade column at the end has any differential numbers, that id needs to be flagged.

My desired output should look like this:

> df
  id grade.1 grade.2 grade.3 grade.4        flag
1 11       3       3       4      NA     flagged
2 22       4       4       4      NA Not_flagged
3 33       5       5       6      NA     flagged
4 44       6      NA       5       5     flagged
5 55       7       7       7       7 Not_flagged

Any ideas? Thanks!


Solution

  • A base R solution using rle omitting NA values.

    df$flag <- apply(df[,2:5], 1, function(x) 
      ifelse(length(rle(x[!is.na(x)])$lengths)==1, "not_flagged", "flagged"))
    
    df
      id grade.1 grade.2 grade.3 grade.4        flag
    1 11       3       3       4      NA     flagged
    2 22       4       4       4      NA not_flagged
    3 33       5       5       6      NA     flagged
    4 44       6      NA       5       5     flagged
    5 55       7       7       7       7 not_flagged
    

    Data

    df <- structure(list(id = c(11, 22, 33, 44, 55), grade.1 = c(3, 4, 
    5, 6, 7), grade.2 = c(3, 4, 5, NA, 7), grade.3 = c(4, 4, 6, 5, 
    7), grade.4 = c(NA, NA, NA, 5, 7)), class = "data.frame", row.names = c(NA, 
    -5L))