Search code examples
rdataframepivotdata-manipulation

Interaction Effects of Students in R


I am working with a dataset df1 in R. It has two columns: Date and Student_ID. I first want to find out the number of pairs of students who have appeared together in a certain date. Suppose these pairs of students are in a column V. Now I want to create a dataframe with columns V such that corresponding to every Date, we have a 0 or 1 under every column if the pair of students appeared on the same date. I am having trouble writing this code as it is quite a complex data manipulation to come up with.

This is how the dataframe looks like:

> head(pair_combinations, 20)
# A tibble: 20 × 2
    DATE STUDENT_ID
      <int> <chr>      
 1 21500001 203145     
 2 21500001 200794     
 3 21500001 201143     
 4 21500001 2594       
 5 21500001 201952     
 6 21500001 200757     
 7 21500001 203934     
 8 21500001 203471     
 9 21500001 203488     
10 21500001 203118     
11 21500019 203145     
12 21500019 200794     
13 21500019 201143     
14 21500019 2594       
15 21500019 201952     
16 21500019 203471     
17 21500019 203934     
18 21500019 203118     
19 21500019 203501     
20 21500026 203145

I want the output like:

    DATE INTERACTION_203145_200794 INTERACTION_203145_201143 ...
      <int> <chr>      
 1 21500001          1                          1 
 2 21500019          1                          1 ...
...

Solution

  • It's not pretty, but I'm guessing this might work

    First, a helper function to get all pairs of values

    all_pairs<-function(x) {
      if (length(x) >= 2) {
        sapply(combn(sort(unique(x)),2, simplify=FALSE), paste, collapse="_")
      } else {
        NA
      }
    }
    

    We sort the values so the smaller ID always comes first so we don't repeat pairs (it will always be A_B and not B_A).

    Now we can do some reshaping

    library(dplyr)
    library(tidyr)
    
    dd %>% 
      group_by(DATE) %>% 
      reframe(intx = all_pairs(STUDENT_ID)) %>% 
      mutate(val = 1) %>% 
      pivot_wider(names_from = intx, values_from = val, values_fill = 0)
    

    Tested with

    dd <- read.table(text="
        DATE STUDENT_ID
     1 21500001 203145     
     2 21500001 200794     
     3 21500001 201143     
     4 21500001 2594       
     5 21500001 201952     
     6 21500001 200757     
     7 21500001 203934     
     8 21500001 203471     
     9 21500001 203488     
    10 21500001 203118     
    11 21500019 203145     
    12 21500019 200794     
    13 21500019 201143     
    14 21500019 2594       
    15 21500019 201952     
    16 21500019 203471     
    17 21500019 203934     
    18 21500019 203118     
    19 21500019 203501     
    20 21500026 203145", header=TRUE)