Search code examples
rdataframedplyrfilterlongitudinal

Repeat first row per group id for time-varying variables in R data frame


A grouped/longitudinal data frame takes the form

id <- c(1,1,2,2,2,3,3,3)
x <- c(0,0,1,1,1,1,1,1)
t1 <- c(1,0,1,1,0,0,1,0)
t2 <- c(2.1,1.5,1.7,2.0,2.0,2.8,2.1,2.1)
t3 <- c(21,18,23,26,27,25,31,22)

df <- data.frame(id, x, t1, t2, t3)

I assume column x is time-invariant, but t1, t2, and t3 are time-varying. For t1 and t2, I want to replace their values by their first row values per each id and repeating it by the number of occurrence of the id, while keeping the values of x and t3 unchanged. The expected is

  id x t1  t2 t3
1  1 0  1 2.1 21
2  1 0  1 2.1 18
3  2 1  1 1.7 23
4  2 1  1 1.7 26
5  2 1  1 1.7 27
6  3 1  0 2.8 25
7  3 1  0 2.8 31
8  3 1  0 2.8 22

To resolve this I filter the data by first row per each id using the dplyr package. Then, generate a column of the number of times each id is repeated (i.e., ntimes) and finally repeat each row by ntimes. My code below

df <- df %>%
      group_by(id) %>%
      filter(row_number() == 1) %>%
      ungroup()
df$ntimes <- c(2,3,3)
df <- as.data.frame(lapply(df, rep, df$ntimes))
df <- subset(df, select = -c(ntimes)) 

This results in altering the t3 values. Also, I find the use of ntimes inconvenient as I am dealing with a large data frame with too many rows and columns. Is there a handy way of going about this to get the expected outout? Help is greatly appreciated!


Solution

  • Using dplyr::across and dplyr::first you can do:

    library(dplyr, warn = FALSE)
    
    df |>
      mutate(across(c(t1, t2), first), .by = id)
    #>   id x t1  t2 t3
    #> 1  1 0  1 2.1 21
    #> 2  1 0  1 2.1 18
    #> 3  2 1  1 1.7 23
    #> 4  2 1  1 1.7 26
    #> 5  2 1  1 1.7 27
    #> 6  3 1  0 2.8 25
    #> 7  3 1  0 2.8 31
    #> 8  3 1  0 2.8 22