Search code examples
mysqlquery-optimizationsql-execution-plan

MySQL query optimization - distinct, order by and limit


I am trying to optimize the following query:

select distinct this_.id as y0_
from Rental this_
    left outer join RentalRequest rentalrequ1_ 
      on this_.id=rentalrequ1_.rental_id
    left outer join RentalSegment rentalsegm2_ 
      on rentalrequ1_.id=rentalsegm2_.rentalRequest_id
where
    this_.DTYPE='B'
    and this_.id<=1848978
    and this_.billingStatus=1
    and rentalsegm2_.endDate between 1273631699529 and 1274927699529
order by rentalsegm2_.id asc
limit 0, 100;

This query is done multiple time in a row for paginated processing of records (with a different limit each time). It returns the ids I need in the processing. My problem is that this query take more than 3 seconds. I have about 2 million rows in each of the three tables.

Explain gives:

+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+----------------------------------------------+
| id | select_type | table        | type   | possible_keys                                       | key           | key_len | ref                                        | rows   | Extra                                        |
+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+----------------------------------------------+
|  1 | SIMPLE      | rentalsegm2_ | range  | index_endDate,fk_rentalRequest_id_BikeRentalSegment | index_endDate | 9       | NULL                                       | 449904 | Using where; Using temporary; Using filesort | 
|  1 | SIMPLE      | rentalrequ1_ | eq_ref | PRIMARY,fk_rental_id_BikeRentalRequest              | PRIMARY       | 8       | solscsm_main.rentalsegm2_.rentalRequest_id |      1 | Using where                                  | 
|  1 | SIMPLE      | this_        | eq_ref | PRIMARY,index_billingStatus                         | PRIMARY       | 8       | solscsm_main.rentalrequ1_.rental_id        |      1 | Using where                                  | 
+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+----------------------------------------------+

I tried to remove the distinct and the query ran three times faster. explain without the query gives:

+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+-----------------------------+
| id | select_type | table        | type   | possible_keys                                       | key           | key_len | ref                                        | rows   | Extra                       |
+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+-----------------------------+
|  1 | SIMPLE      | rentalsegm2_ | range  | index_endDate,fk_rentalRequest_id_BikeRentalSegment | index_endDate | 9       | NULL                                       | 451972 | Using where; Using filesort | 
|  1 | SIMPLE      | rentalrequ1_ | eq_ref | PRIMARY,fk_rental_id_BikeRentalRequest              | PRIMARY       | 8       | solscsm_main.rentalsegm2_.rentalRequest_id |      1 | Using where                 | 
|  1 | SIMPLE      | this_        | eq_ref | PRIMARY,index_billingStatus                         | PRIMARY       | 8       | solscsm_main.rentalrequ1_.rental_id        |      1 | Using where                 | 
+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+-----------------------------+

As you can see, the Using temporary is added when using distinct.

I already have an index on all fields used in the where clause. Is there anything I can do to optimize this query?

Thank you very much!

Edit: I tried to order by on this_.id as suggested and the query was 5x slower. Here is the explain plan:

+----+-------------+--------------+------+-----------------------------------------------------+---------------------------------------+---------+------------------------------+--------+----------------------------------------------+
| id | select_type | table        | type | possible_keys                                       | key                                   | key_len | ref                          | rows   | Extra                                        |
+----+-------------+--------------+------+-----------------------------------------------------+---------------------------------------+---------+------------------------------+--------+----------------------------------------------+
|  1 | SIMPLE      | this_        | ref  | PRIMARY,index_billingStatus                         | index_billingStatus                   | 5       | const                        | 782348 | Using where; Using temporary; Using filesort | 
|  1 | SIMPLE      | rentalrequ1_ | ref  | PRIMARY,fk_rental_id_BikeRentalRequest              | fk_rental_id_BikeRentalRequest        | 9       | solscsm_main.this_.id        |      1 | Using where; Using index; Distinct           | 
|  1 | SIMPLE      | rentalsegm2_ | ref  | index_endDate,fk_rentalRequest_id_BikeRentalSegment | fk_rentalRequest_id_BikeRentalSegment | 8       | solscsm_main.rentalrequ1_.id |      1 | Using where; Distinct                        | 
+----+-------------+--------------+------+-----------------------------------------------------+---------------------------------------+---------+------------------------------+--------+----------------------------------------------+

Solution

    1. From the execution plan we see that the optimizer is smart enough to understand that you do not require OUTER JOINs here. Anyway, you should better specify that explicitly.
    2. The DISTINCT modifier means that you want to GROUP BY all fields in SELECT part, that is ORDER BY all of the specified fields and then discard duplicates. In other words, order by rentalsegm2_.id asc clause does not make any sence here.

    The query below should return the equivalent result:

    select distinct this_.id as y0_
    from Rental this_
        join RentalRequest rentalrequ1_ 
          on this_.id=rentalrequ1_.rental_id
        join RentalSegment rentalsegm2_ 
          on rentalrequ1_.id=rentalsegm2_.rentalRequest_id
    where
        this_.DTYPE='B'
        and this_.id<=1848978
        and this_.billingStatus=1
        and rentalsegm2_.endDate between 1273631699529 and 1274927699529
    limit 0, 100;
    

    UPD

    If you want the execution plan to start with RentalSegment, you will need to add the following indices to the database:

    1. RentalSegment (endDate)
    2. RentalRequest (id, rental_id)
    3. Rental (id, DTYPE, billingStatus) or (id, billingStatus, DTYPE)

    The query then could be rewritten as the following:

    SELECT this_.id as y0_
    FROM RentalSegment rs
        JOIN RentalRequest rr
        JOIN Rental this_
    WHERE rs.endDate between 1273631699529 and 1274927699529
        AND rs.rentalRequest_id = rr.id
        AND rr.rental_id <= 1848978
        AND rr.rental_id = this_.id
        AND this_.DTYPE='D'
        AND this_.billingStatus = 1
    GROUP BY this_.id
    LIMIT 0, 100;
    

    If the execution plan will not start from RentalSegment you can force in with STRAIGHT_JOIN.