I have a data frame that contains around 700 cases with 1800 examinations. Some cases underwent several different modalities. I want to leave only one examination result based on the specific condition of the modality.
Here is a dummy data frame:
df <- data.frame (ID = c("1", "1", "1", "2", "2", "3", "4", "4", "5", "5"),
c1 = c("A", "B", "C", "A", "C", "A", "A", "B", "B", "C"),
x1 = c(5, 4, 5, 3, 1, 3, 4, 2, 3, 5),
x2 = c(4, 3, 7, 9, 1, 2, 4, 7, 5, 0))
There are five cases with 10 exams. [c1] is the exam modality (condition), and the results are x1 and x2.
I want to leave only one row based on the following condition:
C > B > A
I want to leave the row with C first; if not, leave the row with B; If C and B are absent, leave the row with A.
Desired output:
output <- data.frame (ID = c("1", "2", "3", "4", "5"),
c1 = c("C", "C", "A", "B", "C"),
x1 = c(5, 1, 3, 2, 5),
x2 = c(7, 1, 2, 7, 0))
You can arrange
the data based on required correct order and for each ID
select it's 1st row.
library(dplyr)
req_order <- c('C', 'B', 'A')
df %>%
arrange(ID, match(c1, req_order)) %>%
distinct(ID, .keep_all = TRUE)
# ID c1 x1 x2
# <chr> <chr> <dbl> <dbl>
#1 1 C 5 7
#2 2 C 1 1
#3 3 A 3 2
#4 4 B 2 7
#5 5 C 5 0
In base R, this can be written as :
df1 <- df[order(match(df$c1, req_order)), ]
df1[!duplicated(df1$ID), ]