Search code examples
mysqlsql-serverjoinfilteroperator-precedence

Filtering rows efficiency in MySQL


I'm learning SQL coming from a scientific programming background (Matlab, Python). Order of operations greatly determine computational efficiency in those languages and I'm wondering if SQL is the same. I'd like to develop good programming strategies early, while learning SQL, rather than have to improve on them later. My question concerns efficiency.

Scenario: I'm trying to filter out rows from one table (job_log_test) using information from a column in another table (report_data). The two tables have identical index correspondence, each row pertains to an individual job. I've worked out two--probably non-optimal--solutions to this problem:

Solution 1: First join together all rows from relevant columns from the two tables, then filter rows based on values in the bill_customer column that originated from the report_data table.

SELECT xxx.job_number, xxx.customer_name 
FROM (
  SELECT job_number, customer_name, bill_customer 
      FROM job_log_test 
      INNER JOIN report_data 
      ON job_log_test.job_log_test_id = report_data.report_data_id
    )xxx 
WHERE bill_customer IS TRUE;

Solution 2: From report_data, run a sub-query to produce an list of indices of the rows to be extracted. Then, using the list, extract the desired rows from job_log_test.

SELECT job_number, customer_name 
FROM job_log_test 
WHERE job_log_test_id 
IN (
  SELECT report_data_id 
    FROM report_data 
    WHERE bill_customer IS TRUE
);

Which of these solutions, if any, does the experienced SQL programmer recommend?

Many thanks for your time!


Solution

  • Back in the early days when RDBMSes had rather simplistic implementations, you could actually optimize things by hand this way, all the time. The query with IN( would probably perform worse.

    Nowadays, you might still be able to optimize things like that, but on rare occasions only, because RDBMSes have become quite sophisticated, and they contain Query Optimizers (wikipedia) which analyse your query and restate it in a more optimal form, taking into account even things such as the number of rows in the various tables that you are referencing.

    So, your best bet is to state your query in its most simple and straightforward form:

      SELECT job_number, customer_name, bill_customer 
          FROM job_log_test 
          INNER JOIN report_data 
          ON job_log_test.job_log_test_id = report_data.report_data_id
          WHERE bill_customer IS TRUE;
    

    and let the query optimizer worry about the rest.

    Also, the existence of the query optimizers means that you can never be sure whether your attempts to hand-optimize a query will yield better or worse results.

    Sure, in certain cases your hand-optimized query might be better than what the query optimizer might come up with. In other cases, your hand-optimized query might be stated in such a way which prevents the query optimizer from improving it, so you might get worse performance than if you had used the simplest form. And finally, your hand-optimized query might confuse the query optimizer even more, thus yielding even worse performance.

    So, if you think you can hand-optimize your query, go ahead and give it a try, but always compare the performance of your hand-optimized query versus the unoptimized one, and you will see that in most (granted, not all) cases it is fruitless.