I have to find a pair of students who take exactly the same classes from table that has studentID
and courseID
.
studentID | courseID
1 1
1 2
1 3
2 1
3 1
3 2
3 3
Query should return (1, 3)
.
The result also should not have duplicate rows such as (1,3)
and (3,1)
.
Given sample data:
CREATE TABLE student_course (
student_id integer,
course_id integer,
PRIMARY KEY (student_id, course_id)
);
INSERT INTO student_course (student_id, course_id)
VALUES (1, 1), (1, 2), (1, 3), (2, 1), (3, 1), (3, 2), (3, 3) ;
One option is to use a CTE to join on the ordered lists of courses each student is taking:
WITH student_coursearray(student_id, courses) AS (
SELECT student_id, array_agg(course_id ORDER BY course_id)
FROM student_course
GROUP BY student_id
)
SELECT a.student_id, b.student_id
FROM student_coursearray a INNER JOIN student_coursearray b ON (a.courses = b.courses)
WHERE a.student_id > b.student_id;
array_agg
is actually part of the SQL standard, as is the WITH
common-table expression syntax. Neither are supported by MySQL so you'll have to express this a different way if you want to support MySQL.
Another way to think about this would be "for every student pairing, find out if one is taking a class the other is not". This would lend its self to a FULL OUTER JOIN
, but it's pretty awkward to express. You have to determine the pairings of student IDs of interest, then for each pairing do a full outer join across the set of classes each takes. If there are any null rows then one took a class the other didn't, so you can use that with a NOT EXISTS
filter to exclude such pairings. That gives you this monster:
WITH student_id_pairs(left_student, right_student) AS (
SELECT DISTINCT a.student_id, b.student_id
FROM student_course a
INNER JOIN student_course b ON (a.student_id > b.student_id)
)
SELECT left_student, right_student
FROM student_id_pairs
WHERE NOT EXISTS (
SELECT 1
FROM (SELECT course_id FROM student_course WHERE student_id = left_student) a
FULL OUTER JOIN (SELECT course_id FROM student_course b WHERE student_id = right_student) b
ON (a.course_id = b.course_id)
WHERE a.course_id IS NULL or b.course_id IS NULL
);
The CTE is optional and may be replaced by a CREATE TEMPORARY TABLE AS SELECT ...
or whatever if your DB doesn't support CTEs.
I'm very confident that the array approach will perform better in all cases, particularly because for a really large data set you can take the WITH
expression, create a temporary table from the query instead, add an index on (courses, student_id)
to it and do crazy-fast equality searching that'll well and truly pay off the cost of the index creation time. You can't do that with the subquery joins approach.