Search code examples
rjoindplyrdata.tabletidyverse

Left join by group and condition (`tidyverse` or `data.table`)


I have a very large data frame that includes integer columns state and state_cyclen. Every row is a gameframe, while state describes the state a game is in at that frame and state_cyclen is coded to indicate n occurrence of that state (it is basically data.table::rleid(state)). Conditioning on state and cycling by state_cyclen I need to import several columns from other definitions data frames. Definition data frames store properties about state and their row ordering informs on the way these properties are cycled throughout the game (players encounter each game state many times).

A minimal example of the long data that should be left joined:

data <- data.frame(
  state        = c(1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3, 2, 2, 3, 3, 3, 4, 4, 3, 3),
  state_cyclen = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 1, 1, 4, 4)
)

data 
#>    state state_cyclen
#> 1      1            1
#> 2      1            1
#> 3      2            1
#> 4      2            1
#> 5      3            1
#> 6      3            1
#> 7      1            2
#> 8      1            2
#> 9      2            2
#> 10     2            2
#> 11     3            2
#> 12     3            2
#> 13     2            3
#> 14     2            3
#> 15     3            3
#> 16     3            3
#> 17     3            3
#> 18     4            1
#> 19     4            1
#> 20     3            4
#> 21     3            4

Minimal example for definition data frames storing the ordering:

def_one <- data.frame(
  prop = letters[1:3],
  others = LETTERS[1:3]
)  

def_two <- data.frame(
  prop = letters[4:10],
  others = LETTERS[4:10]
) 

def_three <- data.frame(
  prop = letters[11:12],
  others = LETTERS[11:12]
) 

I have a solution written in base R that gives the desired output, but it's neither very readable, nor probably very efficient.

# Add empty columns
data$prop <- NA
data$others <- NA

# Function that recycles numeric vector bounded by a upper limit 
bounded_vec_recyc <- function(vec, n) if(n == 1) vec else (vec - 1) %% n + 1

# My solution
vec_pos_one <- data[data[, "state"] == 1, ]$state_cyclen 
vec_pos_one <- bounded_vec_recyc(vec_pos_one, n = nrow(def_one))
data[data[, "state"] == 1, ][, c("prop", "others")] <- def_one[vec_pos_one,]
  

vec_pos_two <- data[data[, "state"] == 2, ]$state_cyclen 
vec_pos_two <- bounded_vec_recyc(vec_pos_two, n = nrow(def_two))
data[data[, "state"] == 2, ][, c("prop", "others")] <- def_two[vec_pos_two,]


vec_pos_three <- data[data[, "state"] == 3, ]$state_cyclen 
vec_pos_three <- bounded_vec_recyc(vec_pos_three, n = nrow(def_three))
data[data[, "state"] == 3, ][, c("prop", "others")] <- def_three[vec_pos_three,]

data
#>    state state_cyclen prop others
#> 1      1            1    a      A
#> 2      1            1    a      A
#> 3      2            1    d      D
#> 4      2            1    d      D
#> 5      3            1    k      K
#> 6      3            1    k      K
#> 7      1            2    b      B
#> 8      1            2    b      B
#> 9      2            2    e      E
#> 10     2            2    e      E
#> 11     3            2    l      L
#> 12     3            2    l      L
#> 13     2            3    f      F
#> 14     2            3    f      F
#> 15     3            3    k      K
#> 16     3            3    k      K
#> 17     3            3    k      K
#> 18     4            1 <NA>   <NA>
#> 19     4            1 <NA>   <NA>
#> 20     3            4    l      L
#> 21     3            4    l      L

Created on 2022-08-30 with reprex v2.0.2

TLDR: As you can see, I am basically trying to merge one by one these definition data frames to the main data frame on corresponding state by recycling the rows of the definition data frame while retaining their order, using the state_cyclen column to keep track of occurrences of each state throughout the game.

Is there a way to do this within the tidyverse or data.table that is faster or at least easier to read? I need this to be quite fast as I have many such gameframe files (in the hundreds) and they are lengthy (hundreds of thousands of rows).

P.S. Not sure if title is adequate for the operations I am doing, as I can imagine multiple ways of implementation. Edits on it are welcome.


Solution

  • Here is a data.table solution. Not sure it is easier to read, but pretty sure it is more efficient:

    library(data.table)
    
    dt <- rbind(setDT(def_one)[,state := 1],
                setDT(def_two)[,state := 2],
                setDT(def_three)[,state := 3])
    dt[,state_cyclen := 1:.N,by = state]
    
    data <- setDT(data)
    data[dt[,.N,by = state],
         state_cyclen := bounded_vec_recyc(state_cyclen,i.N),
         on = "state",
         by = .EACHI]
    
    dt[data,on = c("state","state_cyclen")]
    
        prop others state state_cyclen
     1:    a      A     1            1
     2:    a      A     1            1
     3:    d      D     2            1
     4:    d      D     2            1
     5:    k      K     3            1
     6:    k      K     3            1
     7:    b      B     1            2
     8:    b      B     1            2
     9:    e      E     2            2
    10:    e      E     2            2
    11:    l      L     3            2
    12:    l      L     3            2
    13:    f      F     2            3
    14:    f      F     2            3
    15:    k      K     3            1
    16:    k      K     3            1
    17:    k      K     3            1
    18: <NA>   <NA>     4            1
    19: <NA>   <NA>     4            1
    20:    l      L     3            2
    21:    l      L     3            2
        prop others state state_cyclen
    

    By step: I bind the def_one, def_two and def_three dataframes to create a data.table with the variable you need to merge

    dt <- rbind(setDT(def_one)[,state := 1],
                setDT(def_two)[,state := 2],
                setDT(def_three)[,state := 3])
    dt[,state_cyclen := 1:.N,by = state]
    

    In case you want to merge a lot of dataframes, you can use rbindlist and a list of data.tables.

    I then modify your state_cyclen in data to do the same recycling than you:

    dt[,.N,by = state]
    
       state N
    1:     1 3
    2:     2 7
    3:     3 2
    

    gives the lengths you use to define your recycling.

    data[dt[,.N,by = state],
         state_cyclen := bounded_vec_recyc(state_cyclen,i.N),
         on = "state",
         by = .EACHI]
    

    I use the by = .EACHI to modify the variable for each group during the merge, using the N variable from dt[,.N,by = state]

    Then I just have to do the left join:

    dt[data,on = c("state","state_cyclen")]