I am working with the ml100k dataset to write a query that can fetch me the most rated movie for every age.
Here's how my tables are defined.
users
id | age | gender | occupation | zipcode
ratings
userid | movieid | rating | ts
What I've done so far
SELECT age, movieid, COUNT(*) AS mcount
FROM ratings
JOIN users ON id = userid
GROUP BY age, movieid
This gave me the count for every movie for every age.
10 1 1
11 1 1
13 1 3
14 1 1
15 1 2
16 1 4
17 1 4
18 1 6
19 1 15
20 1 22
21 1 14
SELECT age, MAX(mcount) AS mc
FROM (
SELECT age, movieid, COUNT(*) AS mcount
FROM ratings
JOIN users ON id = userid
GROUP BY age, movieid
) t1
GROUP BY age
7 1
10 1
11 1
13 5
14 3
15 5
16 5
17 11
18 16
19 21
20 25
21 23
This gave me the age and the max count. However, I also want the corresponding movie ID and this where I have been stuck. My thought process is to join these results with the first table but it is not working. Are there any other alternatives I can try? This is the query I used.
SELECT users.age, ratings.movieid, count(*) as mc2
FROM ratings JOIN users ON id = userid
INNER JOIN
(
SELECT age, MAX(mcount) AS mc
FROM (
SELECT age, movieid, COUNT(*) AS mcount
FROM ratings
JOIN users ON id = userid
GROUP BY age, movieid
) t1
GROUP BY age
)t2
ON t2.age = users.age
WHERE mc2=t2.mc
GROUP BY users.age, ratings.movieid;
You can do it like this:
SELECT t.age, t.movieid, t.mcount
FROM (
SELECT age, movieid, COUNT(*) AS mcount
FROM ratings
JOIN users ON id = userid
GROUP BY age, movieid
) t
LEFT JOIN (
SELECT age, movieid, COUNT(*) AS mcount
FROM ratings
JOIN users ON id = userid
GROUP BY age, movieid
) t2
ON t.age = t2.age AND
t.movieid <> t2.movieid AND
t.mcount < t2.mcount
WHERE t2.age IS NULL
Explanation:
ratings
and users
on
the id
of users
age
and movieid
LEFT JOIN
the two groups by
age
movieid
(so we will have sensible count comparisons)mcount
being smaller than the secondWHERE
clause we exclude the scenario when we found higher count matches in the second group than in the first group