We are generating some crosssell-data for a shop. We want to display stuff like "customers who looked at this product, also looked at these products". To generate this data, we do this query on a daily routine from session-based product-viewed-data.
INSERT INTO
product_viewed_together
(
product,
product_associate,
viewed
)
SELECT
v.product,
v2.product,
COUNT(*)
FROM
product_view v
INNER JOIN
product_view v2
ON
v2.session = v.session
AND v2.product != v.product
AND DATE_ADD(v2.created, INTERVAL %d DAY) > NOW()
WHERE
DATE_ADD(v.created, INTERVAL %d DAY) > NOW()
GROUP BY
v.product,
v2.product;
Table product_view
is joined to itself. As this table is quite big (circa 26 million rows), the result is even bigger. The query issues a huge amount of performance and time.
I am not use, we choosed a layout fitting the problem in a good way. Is there a better way to store and generate this data?
Make the date tests sargable:
DATE_ADD(v.created, INTERVAL %d DAY) > NOW()
-->
v.created > NOW - INTERVAL %d DAY
Is product_view
a VIEW
? Or a TABLE
? If a table, provide two "covering" indexes:
INDEX(created, session, product) -- (for v)
INDEX(session, created, product) -- (for v2)
Perhaps all the counts you get are even? This bug can be fixed in about 3 ways, each will double the speed. I think the optimal one is to change one line in the ON
to
DATE_ADD(v2.created, INTERVAL %d DAY) > NOW()
-->
v2.created > v.created
I think that will double the speed.
However, the counts may not be exactly correct if you can have two different products with the same created
.
Another issue: You will end up with
prod assoc CT
123 234 43
234 123 76 -- same pair, opposite order
My revised test says that 234 came before 123 more often than the other way.
Give those things a try. But if you still need more; I have another, more invasive, thought.