Search code examples
sqlpostgresqlperformanceinner-join

Position of ON and WHERE clauses and the efficiency performance


I have two tables, one called Health_User and the other called Diary. They have users' demographic information, and their recorded values respectively. What I want to do is retrieving the recorded values, but:

  1. Excluding testers (not real users) with the "is_tester" column (boolean values) in Health_User, and
  2. Excluding unreasonable values with too high or too low measurements in Diary.

So I have several queries which should get the same results:

# Query 1
SELECT d.user_id, d.id AS diary_id, d.glucose_value, d.unit
    FROM Diary AS d
    JOIN (
        SELECT id
        FROM Health_User
        WHERE is_tester = false
    ) AS u
    ON d.user_id = u.id
    WHERE ((d.glucose_value >= 20 AND d.glucose_value <= 600 AND d.unit = 'mg/dL')
            OR (d.glucose_value >= 20/18.02 AND d.glucose_value <= 600/18.02 AND d.unit = 'mmol/L'));

# Query 2
SELECT d.user_id, d.id AS diary_id, d.glucose_value, d.unit
    FROM Diary AS d
    JOIN Health_User AS u
    ON d.user_id = u.id
    WHERE u.is_tester = false
        AND ((d.glucose_value >= 20 AND d.glucose_value <= 600 AND d.unit = 'mg/dL')
              OR (d.glucose_value >= 20/18.02 AND d.glucose_value <= 600/18.02 AND d.unit = 'mmol/L'));

# Query 3
SELECT d.user_id, d.id AS diary_id, d.glucose_value, d.unit
    FROM Health_User AS u
    JOIN (
        SELECT id, user_id, glucose_value, unit
        FROM Diary
        WHERE ((glucose_value >= 20 AND glucose_value <= 600 AND unit = 'mg/dL')
                OR (glucose_value >= 20/18.02 AND glucose_value <= 600/18.02 AND unit = 'mmol/L'))
    ) AS d
    ON d.user_id = u.id
    WHERE u.is_tester = false;

Here I have three questions:

Question 1: I would speculate that Query 1 would have better performance than Query 2, because a) it joins only one column instead of the whole table of Health_User and b) it filters out testers before joining the tables. Am I correct?

Question 2: The conditional limitation is more complex for Diary (See the last WHERE clause in Query 1). Is it better to switch Diary inside the JOIN and make Health_User outside like Query 3, or it makes no difference?

Question 3: Is there any even better solution in terms of performance?


Solution

  • There would be a difference if the database executed the queries in the order your queries suggest (first filter, then join or vice versa).

    As it is, PostgreSQL has a query optimizer that rearranges the query to find the most efficient execution order, and all your queries will end up with the same execution plan, which you can verify using the SQL statement EXPLAIN.

    For inner joins, it does not influence the result if you filter before or after the join; you could also write all the conditions into the join condition without changing the result. The optimizer knows that.

    You can speed up execution by creating appropriate indexes. It depends on the distribution of the data to know if a certain index is useful. The rule of thumb is that indexes on selective conditions (that filter out many data) are more useful. Work with EXPLAIN to find the best indexes.