Search code examples
rdplyr

How to check if two strings from different columns are in the same group, when the group is another column?


So I have this data frame with three columns: species (the names of the species in that row); syonym (a synonym species name that occurs in at least one instance of the species column in the data frame; and group, which is basically groups of occurrences (can have one or more species names). This is a reproducible example of that df:

df <- data.frame(
  species = c("Species X","Species A", "Species Z", "Species A", "Species B", "Species C", "Species C", "Species D", "Species D", "Species A", "Species B", "Species E","Species Y","Species W","Species R"),
  synonyms = c("Species Y","Species B", "no_synonym", "Species B", "Species A", "Species E", "Species E", "no_synonym", "no_synonym", "Species B", "Species A", "Species C","Species X","Species R","Species W"),
  groups = c("G1","G1", "G1", "G1", "G1", "G2", "G2", "G3", "G3", "G1", "G4", "G5","G6","G7","G8")
)

I am trying to create a new column with "yes" or "no" that checks whether a specific group has two synonym species names. For instance, group "G1" has Species A and Species B, so it would be "yes", because "G1" has both Species A and Species B as instances in the species column, and they are synonyms of each other. Species Z, although it is in "G1", has no synonyms so it should be "no". Species X has a synonym in the df (the same for Species Y, they have each other), but it doesn't have any row with its synonym in the species column in its group, "G1". "G2" would be "no" because even though Species E is a synonym of Species C, "G2" doesn't have instances of both Species E and Species C in the species column. The species that have "no_synonym" would also be "no".

Basically this would be the output:

df <- data.frame(
  species = c("Species X","Species A", "Species Z", "Species A", "Species B", "Species C", "Species C", "Species D", "Species D", "Species A", "Species B", "Species E","Species Y","Species W","Species R"),
  synonyms = c("Species Y","Species B", "no_synonym", "Species B", "Species A", "Species E", "Species E", "no_synonym", "no_synonym", "Species B", "Species A", "Species C","Species X","Species R","Species W"),
  groups = c("G1","G1", "G1", "G1", "G1", "G2", "G2", "G3", "G3", "G1", "G4", "G5","G6","G7","G8"),
at_least_two_synonyms_in_group=c("no","yes","no","yes","yes","no","no","no","no","yes","no","no","no","no","no"))

I tried using dplyr, but I'm not getting the output I'm expecting, for instance, the first and third rows have "yes", but they should be "no", because even if Species Y is a synonym of Species X and occurs somewhere in the df, it doesn't occur in "G1". Similarly, Species Z doesn't even have a synonym in the df, so it should be "no" as well.

df <- df %>%
  group_by(groups) %>%
  mutate(
    at_least_two_synonyms_in_group = ifelse(
      any(synonyms %in% species) & any(species %in% synonyms) & n_distinct(intersect(synonyms, species)) >= 2,
      "yes",
      "no"
    )
  ) %>%
  ungroup()

Solution

  • Can you reduce the logic to just check if synonyms is in species?

    df %>%
      group_by(groups) %>%
      mutate(
        ingroup = if_else(synonyms %in% species, "yes", "no")
      ) %>%
      ungroup()
    #      species   synonyms groups at_least_two_synonyms_in_group ingroup
    # 1  Species X  Species Y     G1                             no      no
    # 2  Species A  Species B     G1                            yes     yes
    # 3  Species Z no_synonym     G1                             no      no
    # 4  Species A  Species B     G1                            yes     yes
    # 5  Species B  Species A     G1                            yes     yes
    # 6  Species C  Species E     G2                             no      no
    # 7  Species C  Species E     G2                             no      no
    # 8  Species D no_synonym     G3                             no      no
    # 9  Species D no_synonym     G3                             no      no
    # 10 Species A  Species B     G1                            yes     yes
    # 11 Species B  Species A     G4                             no      no
    # 12 Species E  Species C     G5                             no      no
    # 13 Species Y  Species X     G6                             no      no