Search code examples
rdplyrdata-cleaning

Replacing value in a row based on value in another row with same identifier using R


I am trying to solve a problem in R that seems simple, but I can't work it out.

I have the following data:

tmp

  town_id city_ id flag
1   10500      111    1           
2   15300     1110    1           
3    9400    11100    1           
4    9400    11101    0           
5    9600    11102    0          
6    9800    11103    0           

There is a duplicate town_id, and I would like to remove it while assigning the highest value in flag. That is, I would like to have:

  town_id city_ id flag
1   10500      111    1           
2   15300     1110    1           
3    9400    11100    1                     
4    9600    11102    0          
5    9800    11103    0           

I tried to use the following dplyr code, but it assigns a 1 to everything:

tmp_2<-tmp %>% group_by(town_id) %>% mutate(flag=max(flag))

tmp_2

  town_id city_ id  flag 
1 10500   111          1           
2 15300   1110         1           
3 9400    11100        1           
4 9400    11101        1           
5 9600    11102        1           
6 9800    11103        1           

Could someone please tell me what I am doing wrong?

Thank you.


Solution

  • I suppose this is what you want - filter out the duplicate:

    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    tmp <- structure(list(town_id = c(10500L, 15300L, 9400L, 9400L, 9600L, 9800L), 
                          city_id = c(111L, 1110L, 11100L, 11101L, 11102L, 11103L), 
                          flag = c(1L, 1L, 1L, 0L, 0L, 0L)), 
                     class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6"))
    tmp_2 <- tmp %>% group_by(town_id) %>% dplyr::filter(flag==max(flag)) %>% ungroup()
    tmp_2
    #> # A tibble: 5 x 3
    #>   town_id city_id  flag
    #>     <int>   <int> <int>
    #> 1   10500     111     1
    #> 2   15300    1110     1
    #> 3    9400   11100     1
    #> 4    9600   11102     0
    #> 5    9800   11103     0
    

    Edit: In case there are more than one per group with the maximum value of flag and you only want to retain the first one, you could do:

    tmp_2 <- tmp %>% group_by(town_id) %>%  
        dplyr::filter(flag==max(flag)) %>% 
        filter(row_number()==1) %>% ungroup()
    

    Created on 2020-05-22 by the reprex package (v0.3.0)