I am trying to solve a problem in R that seems simple, but I can't work it out.
I have the following data:
tmp
town_id city_ id flag
1 10500 111 1
2 15300 1110 1
3 9400 11100 1
4 9400 11101 0
5 9600 11102 0
6 9800 11103 0
There is a duplicate town_id, and I would like to remove it while assigning the highest value in flag. That is, I would like to have:
town_id city_ id flag
1 10500 111 1
2 15300 1110 1
3 9400 11100 1
4 9600 11102 0
5 9800 11103 0
I tried to use the following dplyr code, but it assigns a 1 to everything:
tmp_2<-tmp %>% group_by(town_id) %>% mutate(flag=max(flag))
tmp_2
town_id city_ id flag
1 10500 111 1
2 15300 1110 1
3 9400 11100 1
4 9400 11101 1
5 9600 11102 1
6 9800 11103 1
Could someone please tell me what I am doing wrong?
Thank you.
I suppose this is what you want - filter out the duplicate:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
tmp <- structure(list(town_id = c(10500L, 15300L, 9400L, 9400L, 9600L, 9800L),
city_id = c(111L, 1110L, 11100L, 11101L, 11102L, 11103L),
flag = c(1L, 1L, 1L, 0L, 0L, 0L)),
class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6"))
tmp_2 <- tmp %>% group_by(town_id) %>% dplyr::filter(flag==max(flag)) %>% ungroup()
tmp_2
#> # A tibble: 5 x 3
#> town_id city_id flag
#> <int> <int> <int>
#> 1 10500 111 1
#> 2 15300 1110 1
#> 3 9400 11100 1
#> 4 9600 11102 0
#> 5 9800 11103 0
Edit:
In case there are more than one per group with the maximum value of flag
and you only want to retain the first one, you could do:
tmp_2 <- tmp %>% group_by(town_id) %>%
dplyr::filter(flag==max(flag)) %>%
filter(row_number()==1) %>% ungroup()
Created on 2020-05-22 by the reprex package (v0.3.0)