Search code examples
rdataframenestedcombinationsexpand

R: grouping and expanding the data.frame to include only possible pairs of names in a column


My goal is to expand my data.frame in R to include possible combinations (but not all possible combinations) from a column in R. Similar to the expand.grid command, but that function gives you all possible combinations, not just what is present.

To start, I need to group by each factor in the 1st column, and keep the information included in column 2. In column 3, I have character strings of the names of 'Animals'. I want to find each possible pair that occur in this column, row by row (but not all possible pairs). For example, if I have 'Dreadwing' and 'Scorcher' in the first two rows - that would be one pair: Dreadwing-Scorcher and it should not include Scorcher-Dreadwing. However, if rows 4 and five are T-Rex and T-Rex, the pair should appear once: T-Rex-T-Rex because T-Rex appears in 2 separate rows of the column 'Animals.' If T-Rex were to appear in 3 separate rows, then the pair should appear 3 times, etc., etc.

Lastly, pairs should expand the data.frame by 2 columns to store the pairs. In other words, the Dreadwing and Scorcher pair should be each in their own separate column, but in the same row.

I have manually put together this picture to show what my output should be from the data.frame I have (note: Area_1 and Area_2 are separated only to fit the results in one screenshot). The left: I have put arrows showing the desired combinations just from the first row, Dreadwing. On the right: the desired result for all of Area_1 and Area_2.

enter image description here

For the desired result, for Area_1, the Dreadwing-Dreadwing pair should not occur bc Dreadwing does not appear in any other row for Area_1. However, T-Rex appears in 2 separate rows, so the combination of T-Rex-T-Rex should be there, as well as the combination of each row of T-Rex combining with each row of Waterwing. So, 4 T-Rex-Waterwing combinations.

Reproducible Data

Creating the data.frame

v <- c(rep("Area_1", 7), rep("Area_2", 7))
w <- c(rep("Forest", 7), rep("Cave", 7))
y <- c("Waterwing", "Scorcher", "Snapmaw", "T-Rex", "T-Rex", "Dreadwing", 
"Waterwing", "Snake", "T-Rex", "T-Rex", "Dreadwing", "Snapmaw", "Scorcher", 
"Waterwing")

stack_df <- data.frame(Area = v, Location = w, Animals = y)
stack_df <- stack_df[order(stack_df$Area, stack_df$Location, stack_df$Animals), ]
row.names(stack_df) <- 1:nrow(stack_df)

Using the tidyR guidebook, I have found that the command expand in conjunction with the nesting command (to keep only combinations that appear in the data) does not work. For example:

library(tidyr)    
stack_df %>%
    dplyr::group_by(Area) %>%
    expand(nesting(Location, Animals, Animals))

will return only 11/14 rows.

I have tried multiple ways using the expand and crossing command. However, like the expand.grid command, these commands give you all possible combinations.

Despite this, using the expand command is the closest I have gotten to what I am aiming for.

stack_df %>%
dplyr::group_by(Area) %>%
expand(Location, Animals, Animals)

As you can see, all possibilities are included, which is not the desired result.

Any ideas on how I can get this done?


Solution

  • It sounds/looks to me like you want to find all combinations (within the Area/Location group) of pairs of Animals where the 1st Animal in the pair occurs on a row before the 2nd Animal in the pair.

    We can do this by adding a row number index and doing a self-join with an inequality constraint on the row numbers. (This requires dplyr version >= 1.1.0)

    library(dplyr)
    stack_df = stack_df |>
      mutate(group_i = row_number(), .by = c(Area, Location))
    
    stack_df |>
      inner_join(
        stack_df,
        by = join_by(Area, Location, group_i < group_i),
        suffix = c("..2", "..3")  
      ) |>
      select(-starts_with("group_"))
    #      Area Location Animals..2 Animals..3
    # 1  Area_1   Forest  Dreadwing   Scorcher
    # 2  Area_1   Forest  Dreadwing    Snapmaw
    # 3  Area_1   Forest  Dreadwing      T-Rex
    # 4  Area_1   Forest  Dreadwing      T-Rex
    # 5  Area_1   Forest  Dreadwing  Waterwing
    # 6  Area_1   Forest  Dreadwing  Waterwing
    # 7  Area_1   Forest   Scorcher    Snapmaw
    # 8  Area_1   Forest   Scorcher      T-Rex
    # 9  Area_1   Forest   Scorcher      T-Rex
    # 10 Area_1   Forest   Scorcher  Waterwing
    # 11 Area_1   Forest   Scorcher  Waterwing
    # 12 Area_1   Forest    Snapmaw      T-Rex
    # 13 Area_1   Forest    Snapmaw      T-Rex
    # 14 Area_1   Forest    Snapmaw  Waterwing
    # 15 Area_1   Forest    Snapmaw  Waterwing
    # 16 Area_1   Forest      T-Rex      T-Rex
    # 17 Area_1   Forest      T-Rex  Waterwing
    # 18 Area_1   Forest      T-Rex  Waterwing
    # 19 Area_1   Forest      T-Rex  Waterwing
    # 20 Area_1   Forest      T-Rex  Waterwing
    # 21 Area_1   Forest  Waterwing  Waterwing
    # 22 Area_2     Cave  Dreadwing   Scorcher
    # 23 Area_2     Cave  Dreadwing      Snake
    # 24 Area_2     Cave  Dreadwing    Snapmaw
    # 25 Area_2     Cave  Dreadwing      T-Rex
    # 26 Area_2     Cave  Dreadwing      T-Rex
    # 27 Area_2     Cave  Dreadwing  Waterwing
    # 28 Area_2     Cave   Scorcher      Snake
    # 29 Area_2     Cave   Scorcher    Snapmaw
    # 30 Area_2     Cave   Scorcher      T-Rex
    # 31 Area_2     Cave   Scorcher      T-Rex
    # 32 Area_2     Cave   Scorcher  Waterwing
    # 33 Area_2     Cave      Snake    Snapmaw
    # 34 Area_2     Cave      Snake      T-Rex
    # 35 Area_2     Cave      Snake      T-Rex
    # 36 Area_2     Cave      Snake  Waterwing
    # 37 Area_2     Cave    Snapmaw      T-Rex
    # 38 Area_2     Cave    Snapmaw      T-Rex
    # 39 Area_2     Cave    Snapmaw  Waterwing
    # 40 Area_2     Cave      T-Rex      T-Rex
    # 41 Area_2     Cave      T-Rex  Waterwing
    # 42 Area_2     Cave      T-Rex  Waterwing