Search code examples
mysqlsqlpostgresqlaggregate-functionsrelational-division

Find pair of students who take exactly the same classes


I have to find a pair of students who take exactly the same classes from table that has studentID and courseID.

studentID | courseID
    1           1
    1           2
    1           3
    2           1
    3           1
    3           2
    3           3 

Query should return (1, 3).
The result also should not have duplicate rows such as (1,3) and (3,1).


Solution

  • Given sample data:

    CREATE TABLE student_course (
       student_id integer,
       course_id integer,
       PRIMARY KEY (student_id, course_id)
    );
    
    INSERT INTO student_course (student_id, course_id)
    VALUES (1, 1), (1, 2), (1, 3), (2, 1), (3, 1), (3, 2), (3, 3) ;
    

    Use array aggregation

    One option is to use a CTE to join on the ordered lists of courses each student is taking:

    WITH student_coursearray(student_id, courses) AS (
      SELECT student_id, array_agg(course_id ORDER BY course_id)
      FROM student_course
      GROUP BY student_id
    )
    SELECT a.student_id, b.student_id
    FROM student_coursearray a INNER JOIN student_coursearray b ON (a.courses = b.courses)
    WHERE a.student_id > b.student_id;
    

    array_agg is actually part of the SQL standard, as is the WITH common-table expression syntax. Neither are supported by MySQL so you'll have to express this a different way if you want to support MySQL.

    Find missing course pairings per-student

    Another way to think about this would be "for every student pairing, find out if one is taking a class the other is not". This would lend its self to a FULL OUTER JOIN, but it's pretty awkward to express. You have to determine the pairings of student IDs of interest, then for each pairing do a full outer join across the set of classes each takes. If there are any null rows then one took a class the other didn't, so you can use that with a NOT EXISTS filter to exclude such pairings. That gives you this monster:

    WITH student_id_pairs(left_student, right_student) AS (
      SELECT DISTINCT a.student_id, b.student_id
      FROM student_course a 
      INNER JOIN student_course b ON (a.student_id > b.student_id)
    )
    SELECT left_student, right_student 
    FROM student_id_pairs 
    WHERE NOT EXISTS (
      SELECT 1
      FROM (SELECT course_id FROM student_course WHERE student_id = left_student) a
      FULL OUTER JOIN (SELECT course_id FROM student_course b WHERE student_id = right_student) b
        ON (a.course_id = b.course_id)
      WHERE a.course_id IS NULL or b.course_id IS NULL
    );
    

    The CTE is optional and may be replaced by a CREATE TEMPORARY TABLE AS SELECT ... or whatever if your DB doesn't support CTEs.

    Which to use?

    I'm very confident that the array approach will perform better in all cases, particularly because for a really large data set you can take the WITH expression, create a temporary table from the query instead, add an index on (courses, student_id) to it and do crazy-fast equality searching that'll well and truly pay off the cost of the index creation time. You can't do that with the subquery joins approach.