I have the dataset that looks like this, where ID and emails correspond to a unique person. The remaining columns represent people named by that person/row. For example, a person with ID 1 and email address alex@gmail.com named Pete, Jane, and Tim when asked a question.
id email john_b alex_a pete jane tim
1 alex@gmail.com NA NA 1 1 1
2 pete@yahoo.com NA 1 1 NA NA
3 jane@q.com NA NA 1 NA 1
4 bea@mail.co NA 1 1 NA NA
5 tim@q.com NA NA 1 NA 1
I need the new dataset to look like this, where a new column nomination represents the number of times that person/row was named in the rest of the dataset. For example, Pete was named by 5 people and gets 5 in the nomination column, on the row with the relevant email address. Jane was named once (by alex@gmail.com) and gets 1 in the nomination column, on the row with Jane's email address.
id email john_b alex_a pete jane tim nomination
1 alex@gmail.com NA NA 1 1 1 0
2 pete@yahoo.com NA 1 1 NA NA 5
3 jane@q.com NA NA 1 NA 1 1
4 bea@mail.co NA 1 1 NA NA 0
5 tim@q.com NA NA 1 NA 1 3
I have a sense that I need a combination of case-when and grepl here, but can't wrap my head around it.
Thanks for any help!
Hi I finally came up with a code that I hope to get you to what you expect. However, I could not think of any way to match bea@mail.co
to john_b
. It takes a mind far more brighter than mine for sure but if I could think of anything, I would update my codes here:
library(dplyr)
library(tidyr)
library(stringr)
df <- tribble(
~email, ~john_b, ~alex_a, ~pete, ~jane, ~tim,
"alex@gmail.com", NA, NA, 1, 1, 1,
"pete@yahoo.com", NA , 1, 1, NA, NA,
"jane@q.com", NA , NA, 1, NA, 1,
"bea@mail.co", NA, 1, 1, NA, NA,
"tim@q.com", NA , NA, 1, NA, 1
)
# First we count the number of times each person is named
nm <- df %>%
summarise(across(john_b:tim, ~ sum(.x, na.rm = TRUE))) %>%
pivot_longer(everything(), names_to = "names", values_to = "nominations")
nm
# A tibble: 5 x 2
names nominations
<chr> <dbl>
1 john_b 0
2 alex_a 2
3 pete 5
4 jane 1
5 tim 3
Then we try to partially match every names with their corresponding emails. Here the only problem is john_b
as I mentioned before.
nm2 <- nm %>%
rowwise() %>%
mutate(emails = map(names, ~ df$email[str_detect(df$email, str_sub(.x, 1L, 4L))])) %>%
unnest(cols = c(emails))
nm2
# A tibble: 4 x 3
names nominations emails
<chr> <dbl> <chr>
1 alex_a 2 alex@gmail.com
2 pete 5 pete@yahoo.com
3 jane 1 jane@q.com
4 tim 3 tim@q.com
And in the end we join these two data frames by emails
:
df %>%
full_join(nm2, by = c("email" = "emails"))
# A tibble: 5 x 8
email john_b alex_a pete jane tim names nominations
<chr> <lgl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 alex@gmail.com NA NA 1 1 1 alex_a 2
2 pete@yahoo.com NA 1 1 NA NA pete 5
3 jane@q.com NA NA 1 NA 1 jane 1
4 bea@mail.co NA 1 1 NA NA NA NA
5 tim@q.com NA NA 1 NA 1 tim 3
You can also omit the column names
if you like. I just leave it their so that you can compare them together. If you could make some modification on john's email they would have perfectly matched.