I have a table with 24 milion rows.
I want to run this query:
select r1.userID, r2.userID, sum(r1.rate * r2.rate) as sum
from dbo.Ratings as r1
join dbo.Ratings as r2
on r1.movieID = r2.movieID
where r1.userID <= r2.userID
group by r1.userID, r2.userID
As I tested, it took 24 hours to produce 0.02 percent of the final result.
How can I speed it up?
Here is the definition of the table:
CREATE TABLE [dbo].[Ratings](
[userID] [int] NOT NULL,
[movieID] [int] NOT NULL,
[rate] [real] NOT NULL,
PRIMARY KEY CLUSTERED
(
[userID] ASC,
[movieID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
CREATE NONCLUSTERED INDEX [IX_RatingsMovies] ON [dbo].[Ratings]
(
[movieID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
CREATE NONCLUSTERED INDEX [IX_RatingsUsers] ON [dbo].[Ratings]
(
[userID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
Here is the execution plan:
The workaround I suggested was to create a "reverse" index:
CREATE INDEX IX_Ratings_Reverse on Ratings(movieid, userid) include(rate);
and then force SQL Server to use it:
select r1.userID, r2.userID, sum(r1.rate * r2.rate) as sum
from dbo.Ratings as r1 join dbo.Ratings as r2
with (index(IX_Ratings_Reverse))
on r1.movieID = r2.movieID
where r1.userID <= r2.userID group by r1.userID, r2.userID