Search code examples
sqlmysqlquery-optimizationknowledge-graph

SQL Optimization query for recrusive subquery (MySQL)


I am totally new to MySQL and appreciate very much for your patience. I have the following MySQL table:

fact_id from_id relation_id to_id link_time
1 19 6 151 1
2 2233 2 57 1
3 182 23 112 1
4 22 17 21 1
5 3 8 742 1
6 507 2 55 1
7 154 25 56 1
8 100 83 18 1
9 1110 2 31 1
10 141 29 7 1
... ... ... ... ...

with the corresponding init code:

create table icews14s
(
    fact_id     int auto_increment
        primary key,
    from_id     int not null,
    relation_id int not null,
    to_id       int not null,
    link_time   int not null
);

create index icews14s_from_id_index
    on icews14s (from_id);

create index icews14s_link_time_index
    on icews14s (link_time);

create index icews14s_to_id_index
    on icews14s (to_id);

and one long query:

with target as (select fact_id, from_id, link_time from icews14s where fact_id = 69298),
             pre_nodes_1 as (select o.fact_id     as fact_id,
                                    o.from_id     as from_id,
                                    o.relation_id as relation_id,
                                    o.to_id       as to_id,
                                    o.link_time   as link_time,
                                    1             as degree
                             from icews14s o
                                      left join (select * from icews14s) t
                                                on o.from_id = t.from_id and o.link_time < t.link_time
                             where t.fact_id in (select fact_id from target)),
             pre_nodes_2 as (select o.fact_id     as fact_id,
                                    o.from_id     as from_id,
                                    o.relation_id as relation_id,
                                    o.to_id       as to_id,
                                    o.link_time   as link_time,
                                    2             as degree
                             from icews14s o
                                      left join (select * from icews14s) t on o.from_id = t.to_id and o.link_time < t.link_time
                             where t.fact_id in (select fact_id from pre_nodes_1)),
             pre_nodes_3 as (select o.fact_id     as fact_id,
                                    o.from_id     as from_id,
                                    o.relation_id as relation_id,
                                    o.to_id       as to_id,
                                    o.link_time   as link_time,
                                    3             as degree
                             from icews14s o
                                      left join (select * from icews14s) t on o.from_id = t.to_id and o.link_time < t.link_time
                             where t.fact_id in (select fact_id from pre_nodes_2))
        select fact_id, from_id, relation_id, to_id, link_time, 0 as degree from icews14s where fact_id = 69298
        union select * from pre_nodes_1
        union select * from pre_nodes_2
        union select * from pre_nodes_3
        order by fact_id desc limit 30;

the corresponding result is :

fact_id from_id relation_id to_id link_time degree
69298 1659 16 269 285 0
60977 1659 37 3176 253 1
58981 3176 1 3281 245 2
58757 1659 8 1884 245 1
58722 1659 0 1884 245 1
39282 1659 1 105 163 1
38740 105 7 1143 161 2
38570 105 29 1815 161 2
38440 105 19 2 160 2
38101 2 28 52 159 3
38061 105 0 581 158 2
37825 2 14 1057 157 3
37822 2 2 228 157 3
37606 2 2 1006 156 3
37597 2 9 9 156 3
37554 2 2 9 156 3
37390 2 99 9 156 3
37322 2 8 48 155 3
37277 2 2 9 155 3
37266 2 9 1068 155 3
37210 2 8 239 155 3
37120 2 2 90 155 3
37032 2 2 9 154 3
36993 2 8 28 154 3
36988 2 71 136 154 3
36971 2 2 90 154 3
36949 2 9 48 154 3
36896 2 29 9 154 3
36827 2 28 309 154 3
36798 2 10 52 153 3

The question is, this query is very slow in such a 100000 record table.

Is there any methods to make it quicker?

I tried to solve using recrusive temp table, but this statement never ends:

WITH RECURSIVE pre_nodes AS (
    SELECT
        fact_id,
        from_id,
        relation_id,
        to_id,
        link_time,
        0 AS degree
    FROM
        icews14s
    WHERE
        fact_id = 69298
    UNION
    SELECT
        o.fact_id,
        o.from_id,
        o.relation_id,
        o.to_id,
        o.link_time,
        n.degree + 1
    FROM
        icews14s o
    JOIN
    pre_nodes n ON IF(n.degree = 0, (o.from_id = n.from_id AND o.link_time < n.link_time),
                      (o.from_id = n.to_id AND o.link_time < n.link_time))
    WHERE
        o.fact_id != 69298
)
SELECT distinct
    fact_id,
    from_id,
    relation_id,
    to_id,
    link_time,
    degree
FROM
    pre_nodes
ORDER BY
    fact_id DESC
LIMIT
    30;

Solution

  • Your first query can be simplified quite a bit. Let's take the pre_nodes_1 cte as an example:

    pre_nodes_1 as (
        select o.fact_id     as fact_id,
               o.from_id     as from_id,
               o.relation_id as relation_id,
               o.to_id       as to_id,
               o.link_time   as link_time,
               1             as degree
        from icews14s o
        left join (select * from icews14s) t
            on o.from_id = t.from_id
           and o.link_time < t.link_time
        where t.fact_id in (select fact_id from target)
    )
    

    As O. Jones pointed out in the comments, the nesting of (select * from icews14s) is unnecessary, but the optimizer should spot this and remove the nesting. The left join will be implicitly turned into an inner join due to the criterion on t.fact_id. Given the t.fact_id in (select fact_id from target), I think the cte can be rewritten as:

    pre_nodes_1 as (
        select o.*, 1 as degree
        from icews14s o
        join target t
            on o.from_id = t.from_id and o.link_time < t.link_time
    )
    

    The same pattern applies to the other two pre_nodes_* ctes. Given the o.link_time < t.link_time criterion in each of the joins, it should not be necessary to dedupe the UNION, so switching to UNION ALL will reduce overhead by removing the dedupe step.

    with target as (
        select fact_id, from_id, link_time from icews14s where fact_id = 69298
    ),
    pre_nodes_1 as (
        select o.*, 1 as degree
        from icews14s o
        join target t
            on o.from_id = t.from_id and o.link_time < t.link_time
    ),
    pre_nodes_2 as (
        select o.*, 2 as degree
        from icews14s o
        join pre_nodes_1 t
            on o.from_id = t.to_id and o.link_time < t.link_time
    ),
    pre_nodes_3 as (
        select o.*, 3 as degree
        from icews14s o
        join pre_nodes_2 t
            on o.from_id = t.to_id and o.link_time < t.link_time
    )
    
    select fact_id, from_id, relation_id, to_id, link_time, 0 as degree
    from icews14s
    where fact_id = 69298
    union all
    select * from pre_nodes_1
    union all
    select * from pre_nodes_2
    union all
    select * from pre_nodes_3
    order by fact_id desc
    limit 30;
    

    This query will probably benefit from a composite index on (from_id, link_time):

    alter table icews14s
        add index idx_from_id_link_time (from_id, link_time);
    

    For your recursive cte, I suspect the performance issue is due to the join condition. Without a reasonable amount of test data I cannot test this, but I suspect rewriting the recursive part of the cte like this may allow it to use an index:

        SELECT o.*, n.degree + 1
        FROM pre_nodes n
        JOIN icews14s o
            ON o.from_id = IF(n.degree = 0, n.from_id, n.to_id)
            AND o.link_time < n.link_time
            AND o.fact_id <> 69298