Search code examples
rdataframeconditional-statementsdata-management

How to leave only rows that meet a specific condition in R


I have a data frame that contains around 700 cases with 1800 examinations. Some cases underwent several different modalities. I want to leave only one examination result based on the specific condition of the modality.

Here is a dummy data frame:

df <- data.frame (ID = c("1", "1", "1", "2", "2", "3", "4", "4", "5", "5"),
                  c1 = c("A", "B", "C", "A", "C", "A", "A", "B", "B", "C"),
                  x1 = c(5, 4, 5, 3, 1, 3, 4, 2, 3, 5),
                  x2 = c(4, 3, 7, 9, 1, 2, 4, 7, 5, 0))

There are five cases with 10 exams. [c1] is the exam modality (condition), and the results are x1 and x2.

I want to leave only one row based on the following condition:

C > B > A

I want to leave the row with C first; if not, leave the row with B; If C and B are absent, leave the row with A.

Desired output:

output <- data.frame (ID = c("1", "2", "3", "4", "5"),
                      c1 = c("C", "C", "A", "B", "C"),
                      x1 = c(5, 1, 3, 2, 5),
                      x2 = c(7, 1, 2, 7, 0))

Solution

  • You can arrange the data based on required correct order and for each ID select it's 1st row.

    library(dplyr)
    
    req_order <- c('C', 'B', 'A')
    
    df %>%
      arrange(ID, match(c1, req_order)) %>%
      distinct(ID, .keep_all = TRUE)
    
    #  ID    c1       x1    x2
    #  <chr> <chr> <dbl> <dbl>
    #1 1     C         5     7
    #2 2     C         1     1
    #3 3     A         3     2
    #4 4     B         2     7
    #5 5     C         5     0
    

    In base R, this can be written as :

    df1 <- df[order(match(df$c1, req_order)), ]
    df1[!duplicated(df1$ID), ]