Search code examples
rvariancepanel-data

Variance over time with two dataframes in R


I have two dataframes. df1 is a dataframe where every row is a score that someone gives.

df1

title <- c("x","x","x","x","y","y","y","y","y")
day <- c(0,2,2,4,1,1,3,3,4)
score <- c(7,7,6,4,8,1,7,1,5)
df1 = data.frame(title,day,score)

df2 is title-day formatted panel dataset in long format with a lot of variables. I'm looking for a way to mutate the variance score on day x and the variance score over time (i.e. variance of the score(s) on day x and all the previous scores that are there).

It should look like this:

title <- c("x","x","x","x","x","y","y","y","y","y")
day <- c(0,1,2,3,4,0,1,2,3,4)
variance_day_x <- c(0,0,0.5,0,0,0,24.5,0,12,0)
variance_cumulative <- c(0,0,0.3333,0.3333,2,0,24.5,24.5,14.25,10.8)
df2 <- data.frame(title,day,variance_day_x,variance_cumulative)

As you can see I need to mutate 2 variables out of df1 into df2. The variance per day is the first variable, where variance = 0 when there is 0 or 1 score available on that day because there is nothing to calculate. The second variable is cumulative variance where the variance needs to be updated every time there are new scores available.

Hope this explained my problem well enough. I'm stuck at this moment, hope you guys can help!


Solution

  • Using tidyverse you could try something like this. First group_by title and use a custom cumulative variance function that can be called from mutate. The daily variance is computed after grouping by both title and day. complete will fill in missing days, and fill will carry forward the cumulative variance for those missing days. You can replace the NA with zero if you would like with replace_na.

    library(tidyverse)
    
    cumvar <- function(x)  {
      sapply(seq_along(x), function(i) var(x[1:i]))
    }
    
    df1 %>%
      group_by(title) %>%
      mutate(cvar = cumvar(score)) %>%
      group_by(title, day) %>%
      summarise(variance_day_x = var(score),
                variance_cumulative = last(cvar)) %>%
      complete(title, day = 0:4) %>%
      fill(variance_cumulative, .direction = "down")
    

    Output

    # A tibble: 10 x 4
    # Groups:   title [2]
       title   day variance_day_x variance_cumulative
       <chr> <dbl>          <dbl>               <dbl>
     1 x         0           NA                NA    
     2 x         1           NA                NA    
     3 x         2            0.5               0.333
     4 x         3           NA                 0.333
     5 x         4           NA                 2    
     6 y         0           NA                NA    
     7 y         1           24.5              24.5  
     8 y         2           NA                24.5  
     9 y         3           18                14.2  
    10 y         4           NA                10.8