Search code examples
rdplyrcasecategorical-datamutate

dplyr mutate with conditional values AND OR to create a group category


I am having a dataset that has a variable called individuals with many options and it comes like that. I have observations for a given Day on different individuals (Individual_ID)

The different options of individuals look like this: Individual_ID("Adele", "Fitz", "Abba").... these would belong to a group that is Group=A Individual_ID("Noir", "Rouge", "Bleue").... these would belong to a group called Group=B

In some instances, the individuals from different groups, can get mixed, so we have something like this Individual_ID("Adele", "Rouge", "Bleue")... so this would represent a mixed-group,

I would like to create a variable called GroupingID that can be either GroupA, GroupB, or MixedGroup For that I do not precise that all individuals of the group are present, but instead, that the representation of the individuals is neat or not neat with respect to their group.

In order to consider a mixed grouping, any combination involving at least two individuals from different groups is sufficient.

Could someone explain me how I could apply a condition AND/OR in mutate to create a variable Grouping?

Here how my data looks like

Date      IndividualsObserved    
1/1/2016   Abba,Adele
2/1/2016   Adele,Fitz
3/1/2016   Fitz,Rouge,Noir
4/1/2016   Fitz,Adele,Abba
5/1/2016   Rouge,Noir,Bleue
6/1/2016   Rouge,Abba,Fitz

(the different individuals appear separated by commas in each entry cell of the column IndividualsObserved)

So I would like to have a grouping category that is able to discern whether the grouping is neat (only one group identity, or whether the grouping is composed by a mixed of individuals from different groups). It would be something like this (GroupingID)

Date      IndividualsObserved   GroupingID
1/1/2016   Abba,Adele           GroupA
2/1/2016   Adele,Fitz           GroupA
3/1/2016   Fitz,Rouge,Noir      MixedGrouping
4/1/2016   Fitz,Adele,Abba      GroupA
5/1/2016   Rouge,Noir,Bleue     GroupB
6/1/2016   Rouge,Abba,Fitz      MixedGrouping
7/1/2016   Noir,Bleue,Abba      MixedGrouping

I tried this but did not work:

  mutate(GroupingID = case_when(IndividualsObserved %in% c("Adele","Abba", "Fitz") ~ "GroupA",
                                IndividualsObserved %in% c("Noir","Bleue", "Rouge") ~ "GroupB",
                                TRUE ~ ToCheck)) 

I would appreciate any insights you may have about how to approach this using the mutate option,

I tried using dplyr function mutate

Update:

Many thanks Mark, r2evans, and hello_friend for your helpful suggestions, Indeed, it works out in the different ways you propose!

Now that I have applied this to my extensive dataset, I realise I have a few challenging cases. Perhaps you have some ideas about how to:

-consider specific individuals as "ambiguous", meaning they do not belong to any group, so they cannot be considered group A or B as they are outsiders visiting the two. Could it be possible to assign these individuals a status that does not affect the MixedGroup? If they were there, but their presence or absence did not change the overall group composition, could they have a neutral status?

-create an additional column that says GroupDetails that could be GroupA or GroupB or GroupA+GroupB attending to the list provided with the individuals

-finally, because the list has some 30000 entries, would it be possible to request with an R function to obtain all the names of IndividualsObserved (the complete list is more extensive than the one I provided as an example)?

Thanks a lot


Solution

  • Similar to Mark's answer, but after creating a list-column, we can look for all(.. %in% ..) membership to define the groups.

    quux %>%
      mutate(IndividualsObserved = strsplit(IndividualsObserved, ",")) %>%
      rowwise() %>%
      mutate(
        GroupingID = case_when(
          all(IndividualsObserved %in% c("Adele","Abba", "Fitz")) ~ "GroupA", 
          all(IndividualsObserved %in% c("Noir","Bleue", "Rouge")) ~ "GroupB", 
          TRUE ~ "MixedGroup")
      ) %>%
      ungroup()
    # # A tibble: 6 × 3
    #   Date     IndividualsObserved GroupingID
    #   <chr>    <list>              <chr>     
    # 1 1/1/2016 <chr [2]>           GroupA    
    # 2 2/1/2016 <chr [2]>           GroupA    
    # 3 3/1/2016 <chr [3]>           MixedGroup
    # 4 4/1/2016 <chr [3]>           GroupA    
    # 5 5/1/2016 <chr [3]>           GroupB    
    # 6 6/1/2016 <chr [3]>           MixedGroup
    

    I'm generally not a fan of doing things rowwise(), but it works well-enough here and is unlikely to be a performance problem unless your real data is fairly large.


    Data

    quux <- structure(list(Date = c("1/1/2016", "2/1/2016", "3/1/2016", "4/1/2016", "5/1/2016", "6/1/2016"), IndividualsObserved = c("Abba,Adele", "Adele,Fitz", "Fitz,Rouge,Noir", "Fitz,Adele,Abba", "Rouge,Noir,Bleue", "Rouge,Abba,Fitz")), class = "data.frame", row.names = c(NA, -6L))