I have a flagging rule need to apply.
Here is how my dataset looks like:
df <- data.frame(id = c(1,1,1,1, 2,2,2,2, 3,3,3,3),
key = c("a","a","b","c", "a","b","c","d", "a","b","c","c"),
form = c("A","B","A","A", "A","A","A","A", "B","B","B","A"))
> df
id key form
1 1 a A
2 1 a B
3 1 b A
4 1 c A
5 2 a A
6 2 b A
7 2 c A
8 2 d A
9 3 a B
10 3 b B
11 3 c B
12 3 c A
I would like to flag id
s based on a key
columns that has duplicates, a third column of form
shows different forms for each key
. The idea is to understand if an id has taken any items from multiple forms. I need to add a filtering column as below:
> df.1
id key form type
1 1 a A multiple
2 1 a B multiple
3 1 b A multiple
4 1 c A multiple
5 2 a A single
6 2 b A single
7 2 c A single
8 2 d A single
9 3 a B multiple
10 3 b B multiple
11 3 c B multiple
12 3 c A multiple
And eventually I need to get rid off the extra duplicated row which has different form
. To decide which of the duplicated one drops, I pick whichever the form
type has more items.
In a final separate dataset, I would like to have something like below:
> df.2
id key form type
1 1 a A multiple
3 1 b A multiple
4 1 c A multiple
5 2 a A single
6 2 b A single
7 2 c A single
8 2 d A single
9 3 a B multiple
10 3 b B multiple
11 3 c B multiple
So first id
has form A
dominant so kept the A
, and the third id
has form B
dominant so kept the B
.
Any ideas? Thanks!
We can check number of distinct elements to create the new column by group and then filter
based on the highest frequency (Mode
)
library(dplyr)
df.2 <- df %>%
group_by(id) %>%
mutate(type = if(n_distinct(form) > 1) 'multiple' else 'single') %>%
filter(form == Mode(form)) %>%
ungroup
-output
> df.2
# A tibble: 10 × 4
id key form type
<dbl> <chr> <chr> <chr>
1 1 a A multiple
2 1 b A multiple
3 1 c A multiple
4 2 a A single
5 2 b A single
6 2 c A single
7 2 d A single
8 3 a B multiple
9 3 b B multiple
10 3 c B multiple
where
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}