Search code examples
rdatabaseduplicatespanellocate

How can we detect observations having different outcomes in R?


I have a huge data in this form and with more other columns. So I have a list of people working in a country in 2011 and moved to another one in 2012.

Name  Work_{2011}     Work_{2012}     Wage_{2011}    Wage_{2012} 
  
Jack     US              UK            5387           35353
Bill     US              UK            43534          5343
Emma     US              FRANCE        34534          53455
Brand    US              FRANCE        64545          1343
Luigui   US              FRANCE        15343          3144
Ella     US              FRANCE        64545          1343       
Lucie    France          SPAIN         84545          1343
Maria    France          SPAIN         984545         1343
Grec     Italy           US            4545           1343

I want to keep the observations having the biggest share of destination for each departure. I want:

Name  Work_{2011}     Work_{2012}     Wage_{2011}    Wage_{2012} 
  
Emma     US              FRANCE        34534          53455
Brand    US              FRANCE        64545          1343
Luigui   US              FRANCE        15343          3144
Ella     US              FRANCE        64545          1343       
Lucie    France          SPAIN         84545          1343
Maria    France          SPAIN         984545         1343
Grec     Italy           US            4545           1343

Solution

  • I'm not 100% sure this will meet your needs, but perhaps it will be helpful for you. It might help to know more details about your data, including how large your dataset is, how your columns are organized by year, etc.

    In this example, you can use dplyr from tidyverse. First, you can group_by Work_2011 (I removed the braces from column names), and filter where the number of distinct values for Work_2012 is greater than 1. This would imply multiple destinations.

    Second, you can group_by both Work_2011 and Work_2012 to determine the number of countries for each destination. This will be helpful in a second filter.

    Again, please let me know if this is the direction you were interested in.

    library(dplyr)
    
    df %>%
      group_by(Work_2011) %>%
      filter(n_distinct(Work_2012) > 1) %>%
      group_by(Work_2011, Work_2012) %>%
      mutate(numctry = n()) %>%
      group_by(Work_2011) %>%
      filter(numctry == max(numctry))
    

    Output

      Name   Work_2011 Work_2012 numctry
      <chr>  <chr>     <chr>       <int>
    1 Emma   US        FRANCE          4
    2 Brand  US        FRANCE          4
    3 Luigui US        FRANCE          4
    4 Ella   US        FRANCE          4
    

    Edit (1/13/21): Based on edited question, we can simplify the code further.

    Start by calculating the number of destinations per country, we'll call this dest_per_cntry. This will be a new column. For Jack and Bill, it will be 2. For Emma, Brand, Luigui, and Ella, it will be 4.

    Then, you can group_by to consider the 2011 country only. For each country in the Work_2011 column, keep (or filter) only those where the dest_per_cntry is the same as the maximum number of destinations for that country. Note that if there are "ties", all rows with the maximum destinations will still be kept.

    library(tidyverse)
    
    df %>%
      group_by(Work_2011, Work_2012) %>%
      mutate(dest_per_cntry = n()) %>%
      group_by(Work_2011) %>%
      filter(dest_per_cntry == max(dest_per_cntry))
    

    Output

      Name   Work_2011 Work_2012 Wage_2011 Wage_2012 dest_per_cntry
      <chr>  <chr>     <chr>         <int>     <int>          <int>
    1 Emma   US        FRANCE        34534     53455              4
    2 Brand  US        FRANCE        64545      1343              4
    3 Luigui US        FRANCE        15343      3144              4
    4 Ella   US        FRANCE        64545      1343              4
    5 Lucie  France    SPAIN         84545      1343              2
    6 Maria  France    SPAIN        984545      1343              2
    7 Grec   Italy     US             4545      1343              1
    

    Data

    df <- structure(list(Name = c("Jack", "Bill", "Emma", "Brand", "Luigui", 
    "Ella", "Lucie", "Maria", "Grec"), Work_2011 = c("US", "US", 
    "US", "US", "US", "US", "France", "France", "Italy"), Work_2012 = c("UK", 
    "UK", "FRANCE", "FRANCE", "FRANCE", "FRANCE", "SPAIN", "SPAIN", 
    "US"), Wage_2011 = c(5387L, 43534L, 34534L, 64545L, 15343L, 64545L, 
    84545L, 984545L, 4545L), Wage_2012 = c(35353L, 5343L, 53455L, 
    1343L, 3144L, 1343L, 1343L, 1343L, 1343L)), class = "data.frame", row.names = c(NA, 
    -9L))