I need help with R, similar to question filtering-a-dataframe-showing-only-duplicates I wish to extract duplicates from a dataframe with over 2,000 entries.
The first 15 rows of data looks like this:
run | id | Diff |
---|---|---|
1 | 20 | 0 |
1 | 4 | 1024 |
1 | 4 | 1 |
1 | 4 | 1 |
1 | 4 | 65 |
1 | 4 | 1 |
1 | 4 | 1 |
1 | 11 | 475 |
1 | 11 | 1 |
1 | 11 | 1 |
2 | 25 | 0 |
2 | 18 | 0 |
2 | 18 | 1 |
2 | 18 | 1 |
2 | 18 | 1 |
I wish to extract only the duplicates, i.e.
run | id | Diff |
---|---|---|
1 | 4 | 1024 |
1 | 4 | 1 |
1 | 4 | 1 |
1 | 4 | 65 |
1 | 4 | 1 |
1 | 4 | 1 |
1 | 11 | 475 |
1 | 11 | 1 |
1 | 11 | 1 |
2 | 18 | 0 |
2 | 18 | 1 |
2 | 18 | 1 |
2 | 18 | 1 |
Using the command
mydata_extract %>% group_by(id) %>% filter(n() > 1)
does not extract the data, in fact I get the complete set of data returned. Is there something about "filter(n() > 1)" that I need to change? I'm a beginner with R.
Sorry my data table is not formatting correctly, it looks okay in preview!
I will also want to group my data first by "run"
Maybe add run and id in the group_by()
?
library(dplyr)
df <- tibble::tribble(
~"run", ~"id", ~"Diff",
1, 20, 0,
1, 4, 1024,
1, 4, 1,
1, 4, 1,
1, 4, 65,
1, 4, 1,
1, 4, 1,
1, 11, 4,
1, 11, 1,
1, 11, 1,
2, 25, 0,
2, 18, 0,
2, 18, 1,
2, 18, 1,
2, 18, 1
) %>%
group_by(run, id) %>%
filter(n()>1)
# A tibble: 13 x 3
# Groups: run, id [3]
run id Diff
<dbl> <dbl> <dbl>
1 1 4 1024
2 1 4 1
3 1 4 1
4 1 4 65
5 1 4 1
6 1 4 1
7 1 11 4
8 1 11 1
9 1 11 1
10 2 18 0
11 2 18 1
12 2 18 1
13 2 18 1
You can add a mutate, to see how this n()
works (counts the number of rows per group),e.g.
df %>%
group_by(run, id) %>%
mutate(n = n())