Search code examples
rdataframedata-cleaningmismatch

How to identify mismatch between two data sets in R?


I have two data set. Data set 1 and Data set 2 which is as follow:

Dataset1:-

family_id     house_id   number_family_member
1             1052        2
2             5042        3
3             1111        2

Dataset2:-

family_id     house_id   age   gender
1             1052       24    male
1             1052       25    female
2             5042       23    male
2             5042       20    female
3             1111       1     male
3             1111       20    female
3             1111       21    female

Here is the mismatch between the number of member entered in dataset1 and details of individual entered in dataset2. Like For family id 2, the number of member in family is 3 in dataset1 but the in dataset2 there is entry of only 2 member. How to identify these types of mismatch between two data sets????


Solution

  • both of these views might be helpful for you :

    dataset2 %>%
      add_count(family_id) %>%
      inner_join(dataset1) %>%
      mutate(match= n ==number_family_member)
    
    # # A tibble: 7 x 7
    #   family_id house_id   age gender     n number_family_member match
    #       <int>    <int> <int> <fctr> <int>                <int> <lgl>
    # 1         1     1052    24   male     2                    2  TRUE
    # 2         1     1052    25 female     2                    2  TRUE
    # 3         2     5042    23   male     2                    3 FALSE
    # 4         2     5042    20 female     2                    3 FALSE
    # 5         3     1111     1   male     3                    2 FALSE
    # 6         3     1111    20 female     3                    2 FALSE
    # 7         3     1111    21 female     3                    2 FALSE
    
    dataset2 %>%
      count(family_id) %>%
      inner_join(dataset1) %>%
      mutate(match= n ==number_family_member)
    
    # # A tibble: 3 x 5
    #   family_id     n house_id number_family_member match
    #       <int> <int>    <int>                <int> <lgl>
    # 1         1     2     1052                    2  TRUE
    # 2         2     2     5042                    3 FALSE
    # 3         3     3     1111                    2 FALSE