Efficiently find the number of different classmates from course-level data

I have been stuck with computing efficiently the number of classmates for each student from a course-level database.

Consider this data.frame, where each row represents a course that a student has taken during a given semester:

dat <- 
  data.frame(
  student = c(1, 1, 2, 2, 2, 3, 4, 5),
  semester = c(1, 2, 1, 2, 2, 2, 1, 2),
  course = c(2, 4, 2, 3, 4, 3, 2, 4)
)

#   student semester course
# 1       1        1      2
# 2       1        2      4
# 3       2        1      2
# 4       2        2      3
# 5       2        2      4
# 6       3        2      3
# 7       4        1      2
# 8       5        2      4

Students are going to courses in a given semester. Their classmates are other students attending the same course during the same semester. For instance, across both semesters, student 1 has 3 classmates (students 2, 4 and 5).

How can I get the number of unique classmates each student has combining both semesters? The desired output would be:

  student n
1       1 3
2       2 4
3       3 1
4       4 2
5       5 2

where n is the value for the number of different classmates a student has had during the academic year.

I sense that an igraph solution could possibly work (hence the tag), but my knowledge of this package is too limited. I also feel like using joins could help, but again, I am not sure how.

Importantly, I would like this to work for larger datasets (mine has about 17M rows). Here's an example data set:

set.seed(1)
big_dat <- 
  data.frame(
    student = sample(1e4, 1e6, TRUE),
    semester = sample(2, 1e6, TRUE),
    course = sample(1e3, 1e6, TRUE)
  )

Solution

First try with igraph:

library(data.table)
library(igraph)

setDT(dat)
i <- max(dat$student)
g <- graph_from_data_frame(
  dat[,.(student, class = .GRP + i), .(semester, course)][,-1:-2]
)
v <- V(g)[1:uniqueN(dat$student)]
data.frame(student = as.integer(names(v)),
           n = ego_size(g, 2, v, mindist = 2))
#>   student n
#> 1       1 3
#> 2       2 4
#> 3       4 2
#> 4       5 2
#> 5       3 1

Note that if student is not integer, you'll need to create a temporary integer id with match on the unique value and then index on the final output.

With tcrossprod:

library(data.table)
library(Matrix)

setDT(dat)
u <- unique(dat$student)
data.frame(
  student = u,
  n = colSums(
    tcrossprod(
      dat[,id := match(student, u)][
        ,.(i = id, j = .GRP), .(semester, course)
      ][,sparseMatrix(i, j)]
    )
  ) - 1L
)
#>   student n
#> 1       1 3
#> 2       2 4
#> 3       3 1
#> 4       4 2
#> 5       5 2