Search code examples
rindexingtidyversevectorization

Vectorized sum of varying length vectors


I'm trying to simulate y observations from the following equation :

for all possible values of X and random values for the parameter Beta and the vector of parameters Delta.

Using R and packages from the tidyverse, this is what I got so far :

library(tidyverse)
library(extraDistr)
n <- 100
x_max <- 4

set.seed(1)
data <- 
    tibble(
        id = 1:100,
        beta = rnorm(100, mean = 0, sd = 1/(x_max - 1)),
        delta = cbind(0, rdirichlet(n, rep(1, x_max - 1)))
    ) %>% 
    expand_grid(x = 1:x_max)

data

# A tibble: 400 × 4
      id    beta delta[,1]   [,2]    [,3]   [,4]     x
   <int>   <dbl>     <dbl>  <dbl>   <dbl>  <dbl> <int>
 1     1 -0.209          0 0.169  0.170   0.661      1
 2     1 -0.209          0 0.169  0.170   0.661      2
 3     1 -0.209          0 0.169  0.170   0.661      3
 4     1 -0.209          0 0.169  0.170   0.661      4
 5     2  0.0612         0 0.0962 0.00297 0.901      1
 6     2  0.0612         0 0.0962 0.00297 0.901      2
 7     2  0.0612         0 0.0962 0.00297 0.901      3
 8     2  0.0612         0 0.0962 0.00297 0.901      4
 9     3 -0.279          0 0.241  0.714   0.0451     1
10     3 -0.279          0 0.241  0.714   0.0451     2
# ℹ 390 more rows
# ℹ Use `print(n = ...)` to see more rows

I'm stuck at the part of computing the y values from the formula above. The challenge is to use x as an indexing variable while computing the sums of delta parameters. What I'm looking for is something like this :

data %>% 
    rowwise() %>% 
    mutate(
        y = beta * sum(c_across(delta[,1]:delta[,x]))
    )

However, this code does not work and I get the following error message :

Error in `mutate()`:
ℹ In argument: `y = beta * sum(c_across(delta[, 1]:delta[, x]))`.
ℹ In row 1.
Caused by error in `c_across()`:
! Problem while evaluating `delta[, 1]`.
Caused by error:
! object 'delta' not found

I know there must be some problem with the nested matrix delta. However, I still feel like dplyr should be able to support vectorized indexing operations no?

From the example above, the result I expect to get is this :

# A tibble: 400 × 5
# Rowwise: 
      id    beta delta[,1]   [,2]    [,3]   [,4]     x        y
   <int>   <dbl>     <dbl>  <dbl>   <dbl>  <dbl> <int>    <dbl>
 1     1 -0.209          0 0.169  0.170   0.661      1  0      
 2     1 -0.209          0 0.169  0.170   0.661      2 -0.0352 
 3     1 -0.209          0 0.169  0.170   0.661      3 -0.0708 
 4     1 -0.209          0 0.169  0.170   0.661      4 -0.209  
 5     2  0.0612         0 0.0962 0.00297 0.901      1  0      
 6     2  0.0612         0 0.0962 0.00297 0.901      2  0.00589
 7     2  0.0612         0 0.0962 0.00297 0.901      3  0.00607
 8     2  0.0612         0 0.0962 0.00297 0.901      4  0.0612 
 9     3 -0.279          0 0.241  0.714   0.0451     1  0      
10     3 -0.279          0 0.241  0.714   0.0451     2 -0.0671 
# ℹ 390 more rows
# ℹ Use `print(n = ...)` to see more rows

Solution

  • Is this close to what you're looking for?

    library(tidyverse)
    library(extraDistr)
    
    set.seed(1)
    
    n <- 100
    x_max <- 4
    
    data <- 
      tibble(
        id = 1:n,
        beta = rnorm(n, mean = 0, sd = 1/(x_max - 1)),
        delta = cbind(0, rdirichlet(n, rep(1, x_max - 1)))
      ) %>% 
      expand_grid(x = 1:x_max)
    
    data %>% 
      rowwise() %>% 
      mutate(
        y = beta * sum(delta[,1:x])
      )
    #> # A tibble: 400 × 5
    #> # Rowwise: 
    #>       id    beta delta[,1]   [,2]    [,3]   [,4]     x        y
    #>    <int>   <dbl>     <dbl>  <dbl>   <dbl>  <dbl> <int>    <dbl>
    #>  1     1 -0.209          0 0.169  0.170   0.661      1  0      
    #>  2     1 -0.209          0 0.169  0.170   0.661      2 -0.0352 
    #>  3     1 -0.209          0 0.169  0.170   0.661      3 -0.0708 
    #>  4     1 -0.209          0 0.169  0.170   0.661      4 -0.209  
    #>  5     2  0.0612         0 0.0962 0.00297 0.901      1  0      
    #>  6     2  0.0612         0 0.0962 0.00297 0.901      2  0.00589
    #>  7     2  0.0612         0 0.0962 0.00297 0.901      3  0.00607
    #>  8     2  0.0612         0 0.0962 0.00297 0.901      4  0.0612 
    #>  9     3 -0.279          0 0.241  0.714   0.0451     1  0      
    #> 10     3 -0.279          0 0.241  0.714   0.0451     2 -0.0671 
    #> # ℹ 390 more rows
    

    I think the confusion that c_across is having is that delta is technically behaving as one (matrix) column, so can't be summed 'across':

    str(data)
    #> tibble [400 × 4] (S3: tbl_df/tbl/data.frame)
    #>  $ id   : int [1:400] 1 1 1 1 2 2 2 2 3 3 ...
    #>  $ beta : num [1:400] -0.2088 -0.2088 -0.2088 -0.2088 0.0612 ...
    #>  $ delta: num [1:400, 1:4] 0 0 0 0 0 0 0 0 0 0 ...
    #>  $ x    : int [1:400] 1 2 3 4 1 2 3 4 1 2 ...