Search code examples
rtidyrtibblenested-tibble

Make "tidy" tibble dataset with matrix variable


I have a dataset in a one-row-per-person format, but I want to merge in a dataset as a matrix variable -- basically, if not for the need of the tidy format, I would use a fuzzy join to merge in data from a range of dates, creating n x 3 rows. Instead, I really want to merge them in as a matrix variable.

I've tried using the fuzzy join followed by a nest function, but my merge dataset contains several variables, so needing to explicitly state my variables for the nest function would be a ton of work and ultimately look messy (though, I suppose if this is my only way forward, I could commit to this).

By way of providing a reproducible example of what I am thinking, I've created the fake dataset below:

person_data <- tibble(ID=paste0("ID", 1:50), outcome=sample(0:1, 50, T), date=sample(5:145, 50, F),min_date=date-5)
value_data <- tibble(date=seq(1:150), exp1=sample(20:100, 150, T), exp2=sample(20:100, 150, T))

My first thought was to combine fuzzy join and the nest function like so:

merged <- fuzzyjoin::fuzzy_left_join(person_data, value_data, by=c("date" = "date", "min_date" = "date"), match_fun=c(`>=`,`<=`))

analysis_data <- merged %>% 
  select(-date.y) %>% 
  group_by(ID) %>% 
  nest(exp1=exp1, exp2=exp2)

The head of the resulting analysis_data tibble is here.

> analysis_data
# A tibble: 50 × 6
# Groups:   ID [50]
   ID    outcome date.x min_date exp1             exp2            
   <chr>   <int>  <int>    <dbl> <list>           <list>          
 1 ID1         1     39       34 <tibble [6 × 1]> <tibble [6 × 1]>
 2 ID2         1     87       82 <tibble [6 × 1]> <tibble [6 × 1]>
 3 ID3         0     23       18 <tibble [6 × 1]> <tibble [6 × 1]>
 4 ID4         1     60       55 <tibble [6 × 1]> <tibble [6 × 1]>
 5 ID5         1    120      115 <tibble [6 × 1]> <tibble [6 × 1]>
 6 ID6         0     35       30 <tibble [6 × 1]> <tibble [6 × 1]>
 7 ID7         0    131      126 <tibble [6 × 1]> <tibble [6 × 1]>
 8 ID8         0     83       78 <tibble [6 × 1]> <tibble [6 × 1]>
 9 ID9         0     68       63 <tibble [6 × 1]> <tibble [6 × 1]>
10 ID10        1    133      128 <tibble [6 × 1]> <tibble [6 × 1]>

And for the most part, it kind of does get me there because the resulting data has 50 rows but is there a better/simpler way of approaching this problem, especially if I had exp1-50 instead of exp1 and exp2? I'd also like to add an index matrix variable, but I'm unsure how to add that.

Realizing it doesn't fit with the rest of my examples, I basically want a dataset that looks like this in the end:

fake_data_goal <- tibble(ID=paste0("ID", 1:50), outcome=sample(0:1, 50, T), var=matrix(data=sample(50:100, 50*3, T), nrow=50), index=matrix(rep(c(1:3), 50), nrow=50, byrow=T))

> fake_data_goal
# A tibble: 50 × 4
   ID    outcome var[,1]  [,2]  [,3] index[,1]  [,2]  [,3]
   <chr>   <int>   <int> <int> <int>     <int> <int> <int>
 1 ID1         0      66    90    71         1     2     3
 2 ID2         0      95    98    75         1     2     3
 3 ID3         0      57    84    91         1     2     3
 4 ID4         1      78    89    64         1     2     3
 5 ID5         1      97    73    95         1     2     3
 6 ID6         0      52    96    76         1     2     3
 7 ID7         0      62    93    57         1     2     3
 8 ID8         0      62    76    81         1     2     3
 9 ID9         1      55    58    67         1     2     3
10 ID10        0      81    91    91         1     2     3

Thank you so much in advance!


Solution

  • As perhaps half of an answer which would at least mean you don't have to repeat the typing of exp, exp2... up to exp50 etc., you could pivot_longer before joining and pivot_wider after nesting to give the right number of list columns. These can then be plucked out into matrix columns by pattern matching column names:

    library(tidyverse)
    
    person_data <-
      tibble(
        ID = paste0("ID", 1:50),
        outcome = sample(0:1, 50, T),
        date = sample(5:145, 50, F),
        min_date = date - 5
      )
    
    value_data <-
      tibble(
        date = seq(1:150),
        exp1 = sample(20:100, 150, T),
        exp2 = sample(20:100, 150, T)
      )
    
    
    merged <-
      fuzzyjoin::fuzzy_left_join(
        person_data,
        value_data |> pivot_longer(-date, names_to = "exp", values_to = "val"),
        by = c("date" = "date", "min_date" = "date"),
        match_fun = c(`>=`, `<=`)
      )
    
    merged |> 
      select(-date.y) |> 
      group_by(ID, exp) |> 
      nest(val = val) |> 
      pivot_wider(names_from = exp,
                  values_from = val) |> 
      rowwise() |>
      mutate(across(starts_with("exp"), ~ t(as.matrix(.x$val))))
    #> # A tibble: 50 × 6
    #> # Rowwise:  ID
    #>    ID    outcome date.x min_date exp1[,1]  [,2]  [,3]  [,4]  [,5]  [,6] exp2[,1]
    #>    <chr>   <int>  <int>    <dbl>    <int> <int> <int> <int> <int> <int>    <int>
    #>  1 ID1         0    118      113       96    83    69    59    42    34       60
    #>  2 ID2         1      9        4       38    91    83    28    87    73       96
    #>  3 ID3         1     83       78       49    31    33    84    57    50       83
    #>  4 ID4         1     97       92       91    92    21    53    59    39       23
    #>  5 ID5         1     71       66       97    28    56    91    67    43       98
    #>  6 ID6         1     27       22       81    88    41    22    24    84       36
    #>  7 ID7         1     64       59       46    51    88    76    39    63       53
    #>  8 ID8         0     72       67       28    56    91    67    43    55       23
    #>  9 ID9         0     80       75       76    79    62    49    31    33       66
    #> 10 ID10        0     87       82       57    50    31    72    95    31       66
    #> # … with 40 more rows, and 1 more variable: exp2[2:6] <int>
    

    Created on 2022-08-10 by the reprex package (v2.0.1)

    As noted, it's not fully answering your question, but would at least get you to the half-way-acceptable stage you'd reached when producing analysis_data.

    The things remaining would be:

    1. is this where you expect your exp1 etc. variables to go?
    2. it's not clear where your index values would come from for each row?