Search code examples
mysqljoindatabase-performance

performance of big INSERT with data from huge select


We are generating some crosssell-data for a shop. We want to display stuff like "customers who looked at this product, also looked at these products". To generate this data, we do this query on a daily routine from session-based product-viewed-data.

            INSERT INTO
                product_viewed_together
            (
                product,
                product_associate,
                viewed
            )
            SELECT
                v.product,
                v2.product,
                COUNT(*)
            FROM
                product_view v
            INNER JOIN
                product_view v2
            ON
                v2.session = v.session
                AND v2.product != v.product
                AND DATE_ADD(v2.created, INTERVAL %d DAY) > NOW()
            WHERE
                DATE_ADD(v.created, INTERVAL %d DAY) > NOW()
            GROUP BY
                v.product,
                v2.product;

Table product_view is joined to itself. As this table is quite big (circa 26 million rows), the result is even bigger. The query issues a huge amount of performance and time.

I am not use, we choosed a layout fitting the problem in a good way. Is there a better way to store and generate this data?


Solution

  • Make the date tests sargable:

    DATE_ADD(v.created, INTERVAL %d DAY) > NOW()
    

    -->

    v.created > NOW - INTERVAL %d DAY
    

    Is product_view a VIEW? Or a TABLE? If a table, provide two "covering" indexes:

    INDEX(created, session, product)  -- (for v)
    INDEX(session, created, product)  -- (for v2)
    

    Perhaps all the counts you get are even? This bug can be fixed in about 3 ways, each will double the speed. I think the optimal one is to change one line in the ON to

    DATE_ADD(v2.created, INTERVAL %d DAY) > NOW()
    

    -->

    v2.created > v.created
    

    I think that will double the speed.

    However, the counts may not be exactly correct if you can have two different products with the same created.

    Another issue: You will end up with

    prod  assoc  CT
    123   234    43
    234   123    76  -- same pair, opposite order
    

    My revised test says that 234 came before 123 more often than the other way.

    Give those things a try. But if you still need more; I have another, more invasive, thought.