Search code examples
rperformancejoinigraph

Efficiently find the number of different classmates from course-level data


I have been stuck with computing efficiently the number of classmates for each student from a course-level database.

Consider this data.frame, where each row represents a course that a student has taken during a given semester:

dat <- 
  data.frame(
  student = c(1, 1, 2, 2, 2, 3, 4, 5),
  semester = c(1, 2, 1, 2, 2, 2, 1, 2),
  course = c(2, 4, 2, 3, 4, 3, 2, 4)
)

#   student semester course
# 1       1        1      2
# 2       1        2      4
# 3       2        1      2
# 4       2        2      3
# 5       2        2      4
# 6       3        2      3
# 7       4        1      2
# 8       5        2      4

Students are going to courses in a given semester. Their classmates are other students attending the same course during the same semester. For instance, across both semesters, student 1 has 3 classmates (students 2, 4 and 5).

How can I get the number of unique classmates each student has combining both semesters? The desired output would be:

  student n
1       1 3
2       2 4
3       3 1
4       4 2
5       5 2

where n is the value for the number of different classmates a student has had during the academic year.

I sense that an igraph solution could possibly work (hence the tag), but my knowledge of this package is too limited. I also feel like using joins could help, but again, I am not sure how.

Importantly, I would like this to work for larger datasets (mine has about 17M rows). Here's an example data set:

set.seed(1)
big_dat <- 
  data.frame(
    student = sample(1e4, 1e6, TRUE),
    semester = sample(2, 1e6, TRUE),
    course = sample(1e3, 1e6, TRUE)
  )

Solution

  • First try with igraph:

    library(data.table)
    library(igraph)
    
    setDT(dat)
    i <- max(dat$student)
    g <- graph_from_data_frame(
      dat[,.(student, class = .GRP + i), .(semester, course)][,-1:-2]
    )
    v <- V(g)[1:uniqueN(dat$student)]
    data.frame(student = as.integer(names(v)),
               n = ego_size(g, 2, v, mindist = 2))
    #>   student n
    #> 1       1 3
    #> 2       2 4
    #> 3       4 2
    #> 4       5 2
    #> 5       3 1
    

    Note that if student is not integer, you'll need to create a temporary integer id with match on the unique value and then index on the final output.

    With tcrossprod:

    library(data.table)
    library(Matrix)
    
    setDT(dat)
    u <- unique(dat$student)
    data.frame(
      student = u,
      n = colSums(
        tcrossprod(
          dat[,id := match(student, u)][
            ,.(i = id, j = .GRP), .(semester, course)
          ][,sparseMatrix(i, j)]
        )
      ) - 1L
    )
    #>   student n
    #> 1       1 3
    #> 2       2 4
    #> 3       3 1
    #> 4       4 2
    #> 5       5 2