Search code examples
rduplicatesuniquedata-cleaning

Find rows that are identical in one column but not another


There should be a fairly simple solution to this but it's giving me trouble. I have a DF similar to this:

> df <- data.frame(name = c("george", "george", "george", "sara", "sara", "sam", "bill", "bill"),
                  id_num = c(1, 1, 2, 3, 3, 4, 5, 5))
> df
    name id_num
1 george      1
2 george      1
3 george      2
4   sara      3
5   sara      3
6    sam      4
7   bill      5
8   bill      5

I'm looking for a way to find rows where the name and ID numbers are inconsistent in a very large dataset. I.e., George should always be "1" but in row three there is a mistake and he has also been assigned ID number "2".


Solution

  • I think the easiest way will be to use dplyr::count twice, hence for your example:

    df %>% 
       count(name, id) %>% 
       count(name) 
    

    The first count will give:

    name   id   n
    george  1   2
    george  2   1
    sara    3   2
    sam     4   1
    bill    5   2
    

    Then the second count will give:

    name    n
    george  2
    sara    1 
    sam     1 
    bill    1
    

    Of course, you could add filter(n > 1) to the end of your pipe, too, or arrange(desc(n))

    df %>% 
       count(name, id) %>% 
       count(name) %>% 
       arrange(desc(n)) %>% 
       filter(n > 1)