Search code examples
rtidyversetidyrcase-whengrepl

How do I count values in a column and match them with a specific row?


I have the dataset that looks like this, where ID and emails correspond to a unique person. The remaining columns represent people named by that person/row. For example, a person with ID 1 and email address alex@gmail.com named Pete, Jane, and Tim when asked a question.

id email          john_b alex_a pete jane tim
1  alex@gmail.com NA     NA     1    1    1
2  pete@yahoo.com NA     1      1    NA   NA
3  jane@q.com     NA     NA     1    NA   1
4  bea@mail.co    NA     1      1    NA   NA
5  tim@q.com      NA     NA     1    NA   1

I need the new dataset to look like this, where a new column nomination represents the number of times that person/row was named in the rest of the dataset. For example, Pete was named by 5 people and gets 5 in the nomination column, on the row with the relevant email address. Jane was named once (by alex@gmail.com) and gets 1 in the nomination column, on the row with Jane's email address.

id email          john_b alex_a pete jane tim nomination
1  alex@gmail.com NA     NA     1    1    1   0 
2  pete@yahoo.com NA     1      1    NA   NA  5
3  jane@q.com     NA     NA     1    NA   1   1
4  bea@mail.co    NA     1      1    NA   NA  0
5  tim@q.com      NA     NA     1    NA   1   3

I have a sense that I need a combination of case-when and grepl here, but can't wrap my head around it.

Thanks for any help!


Solution

  • Hi I finally came up with a code that I hope to get you to what you expect. However, I could not think of any way to match bea@mail.co to john_b. It takes a mind far more brighter than mine for sure but if I could think of anything, I would update my codes here:

    library(dplyr)
    library(tidyr)
    library(stringr)
    
    df <- tribble(
     ~email,          ~john_b, ~alex_a, ~pete, ~jane, ~tim,
      "alex@gmail.com", NA,     NA,     1,    1,    1,
      "pete@yahoo.com", NA ,    1,      1,    NA,   NA,
      "jane@q.com",     NA  ,   NA,     1,    NA,   1,
      "bea@mail.co",    NA,     1,      1,    NA,   NA,
      "tim@q.com",      NA ,    NA,     1,    NA,   1
    )
    
    # First we count the number of times each person is named
    nm <- df %>%
      summarise(across(john_b:tim, ~ sum(.x, na.rm = TRUE))) %>%
      pivot_longer(everything(), names_to = "names", values_to = "nominations")
    nm
    
    # A tibble: 5 x 2
      names  nominations
      <chr>        <dbl>
    1 john_b           0
    2 alex_a           2
    3 pete             5
    4 jane             1
    5 tim              3
    
    

    Then we try to partially match every names with their corresponding emails. Here the only problem is john_b as I mentioned before.

    nm2 <- nm %>%
      rowwise() %>%
      mutate(emails = map(names, ~ df$email[str_detect(df$email, str_sub(.x, 1L, 4L))])) %>%
      unnest(cols = c(emails))
    
    nm2
    
    # A tibble: 4 x 3
      names  nominations emails        
      <chr>        <dbl> <chr>         
    1 alex_a           2 alex@gmail.com
    2 pete             5 pete@yahoo.com
    3 jane             1 jane@q.com    
    4 tim              3 tim@q.com  
    
    

    And in the end we join these two data frames by emails:

    df %>%
      full_join(nm2, by = c("email" = "emails"))
    
    # A tibble: 5 x 8
      email          john_b alex_a  pete  jane   tim names  nominations
      <chr>          <lgl>   <dbl> <dbl> <dbl> <dbl> <chr>        <dbl>
    1 alex@gmail.com NA         NA     1     1     1 alex_a           2
    2 pete@yahoo.com NA          1     1    NA    NA pete             5
    3 jane@q.com     NA         NA     1    NA     1 jane             1
    4 bea@mail.co    NA          1     1    NA    NA NA              NA
    5 tim@q.com      NA         NA     1    NA     1 tim              3
    
    

    You can also omit the column names if you like. I just leave it their so that you can compare them together. If you could make some modification on john's email they would have perfectly matched.