I am having a dataset that has a variable called individuals with many options and it comes like that. I have observations for a given Day on different individuals (Individual_ID)
The different options of individuals look like this: Individual_ID("Adele", "Fitz", "Abba").... these would belong to a group that is Group=A Individual_ID("Noir", "Rouge", "Bleue").... these would belong to a group called Group=B
In some instances, the individuals from different groups, can get mixed, so we have something like this Individual_ID("Adele", "Rouge", "Bleue")... so this would represent a mixed-group,
I would like to create a variable called GroupingID that can be either GroupA, GroupB, or MixedGroup For that I do not precise that all individuals of the group are present, but instead, that the representation of the individuals is neat or not neat with respect to their group.
In order to consider a mixed grouping, any combination involving at least two individuals from different groups is sufficient.
Could someone explain me how I could apply a condition AND/OR in mutate to create a variable Grouping?
Here how my data looks like
Date IndividualsObserved
1/1/2016 Abba,Adele
2/1/2016 Adele,Fitz
3/1/2016 Fitz,Rouge,Noir
4/1/2016 Fitz,Adele,Abba
5/1/2016 Rouge,Noir,Bleue
6/1/2016 Rouge,Abba,Fitz
(the different individuals appear separated by commas in each entry cell of the column IndividualsObserved)
So I would like to have a grouping category that is able to discern whether the grouping is neat (only one group identity, or whether the grouping is composed by a mixed of individuals from different groups). It would be something like this (GroupingID)
Date IndividualsObserved GroupingID
1/1/2016 Abba,Adele GroupA
2/1/2016 Adele,Fitz GroupA
3/1/2016 Fitz,Rouge,Noir MixedGrouping
4/1/2016 Fitz,Adele,Abba GroupA
5/1/2016 Rouge,Noir,Bleue GroupB
6/1/2016 Rouge,Abba,Fitz MixedGrouping
7/1/2016 Noir,Bleue,Abba MixedGrouping
I tried this but did not work:
mutate(GroupingID = case_when(IndividualsObserved %in% c("Adele","Abba", "Fitz") ~ "GroupA",
IndividualsObserved %in% c("Noir","Bleue", "Rouge") ~ "GroupB",
TRUE ~ ToCheck))
I would appreciate any insights you may have about how to approach this using the mutate option,
I tried using dplyr function mutate
Many thanks Mark, r2evans, and hello_friend for your helpful suggestions, Indeed, it works out in the different ways you propose!
Now that I have applied this to my extensive dataset, I realise I have a few challenging cases. Perhaps you have some ideas about how to:
-consider specific individuals as "ambiguous", meaning they do not belong to any group, so they cannot be considered group A or B as they are outsiders visiting the two. Could it be possible to assign these individuals a status that does not affect the MixedGroup? If they were there, but their presence or absence did not change the overall group composition, could they have a neutral status?
-create an additional column that says GroupDetails that could be GroupA or GroupB or GroupA+GroupB attending to the list provided with the individuals
-finally, because the list has some 30000 entries, would it be possible to request with an R function to obtain all the names of IndividualsObserved (the complete list is more extensive than the one I provided as an example)?
Thanks a lot
Similar to Mark's answer, but after creating a list-column, we can look for all(.. %in% ..)
membership to define the groups.
quux %>%
mutate(IndividualsObserved = strsplit(IndividualsObserved, ",")) %>%
rowwise() %>%
mutate(
GroupingID = case_when(
all(IndividualsObserved %in% c("Adele","Abba", "Fitz")) ~ "GroupA",
all(IndividualsObserved %in% c("Noir","Bleue", "Rouge")) ~ "GroupB",
TRUE ~ "MixedGroup")
) %>%
ungroup()
# # A tibble: 6 × 3
# Date IndividualsObserved GroupingID
# <chr> <list> <chr>
# 1 1/1/2016 <chr [2]> GroupA
# 2 2/1/2016 <chr [2]> GroupA
# 3 3/1/2016 <chr [3]> MixedGroup
# 4 4/1/2016 <chr [3]> GroupA
# 5 5/1/2016 <chr [3]> GroupB
# 6 6/1/2016 <chr [3]> MixedGroup
I'm generally not a fan of doing things rowwise()
, but it works well-enough here and is unlikely to be a performance problem unless your real data is fairly large.
Data
quux <- structure(list(Date = c("1/1/2016", "2/1/2016", "3/1/2016", "4/1/2016", "5/1/2016", "6/1/2016"), IndividualsObserved = c("Abba,Adele", "Adele,Fitz", "Fitz,Rouge,Noir", "Fitz,Adele,Abba", "Rouge,Noir,Bleue", "Rouge,Abba,Fitz")), class = "data.frame", row.names = c(NA, -6L))