I need to generate statistics for different types of disabilities and count how many women, who started suffering from each condition in 2018, were affected during each week of January (first month of the year). The conditions I need to track are:
The data is stored in a table, where I need to:
Retrieve the data for each condition. Break the data down by week (1st, 2nd, 3rd, and 4th week) for the first month of 2018. Filter the data for women, based on the fact that the second and third digits of the birth number are greater than or equal to 51.
SELECT
disability_name,
SUM(CASE WHEN TO_CHAR(start_date, 'IW') = '01' THEN 1 ELSE 0 END) AS first_week,
SUM(CASE WHEN TO_CHAR(start_date, 'IW') = '02' THEN 1 ELSE 0 END) AS second_week,
SUM(CASE WHEN TO_CHAR(start_date, 'IW') = '03' THEN 1 ELSE 0 END) AS third_week,
SUM(CASE WHEN TO_CHAR(start_date, 'IW') = '04' THEN 1 ELSE 0 END) AS fourth_week
FROM p_disabilities
JOIN p_disability_type
ON p_disabilities.id_disability = p_disability_type.disability_id
WHERE TO_CHAR(start_date, 'YYYY') = '2018'
AND TO_CHAR(start_date, 'MM') = '01'
AND TO_NUMBER(SUBSTR(birth_number, 3, 2)) >= 51
GROUP BY disability_name;
I am trying to simplify this SQL query to make it more efficient. Specifically, I want to improve the performance of the query since it's working with potentially large datasets. What are some ways I can simplify or optimize this query?
Additionally, what would be the most suitable index to use on the tables involved, particularly on start_date, birth_number, and disability_id, to ensure this query runs as quickly as possible?
Script for tables :
CREATE TABLE p_disabilities (
id_disability CHAR(6) NOT NULL PRIMARY KEY,
birth_number CHAR(11) NOT NULL,
start_date DATE NOT NULL,
end_date DATE NOT NULL,
disability_id NUMBER NOT NULL );
CREATE TABLE p_disability_type (
disability_id NUMBER NOT NULL,
disability_name VARCHAR2(50) );
Try to avoid applying functions on columns in the WHERE
clause. When working with date/time ranges we usually use an inclusive lower and an exclusive upper limit. As to the birth number substring: there is no way to avoid using SUBSTR
of course. (But second/third digit should be SUBSTR(birth_number, 2, 2)
if I am not mistaken.)
The query would then look like this:
SELECT
dt.disability_name,
SUM(CASE WHEN TO_CHAR(d.start_date, 'IW') = '01' THEN 1 ELSE 0 END) AS first_week,
SUM(CASE WHEN TO_CHAR(d.start_date, 'IW') = '02' THEN 1 ELSE 0 END) AS second_week,
SUM(CASE WHEN TO_CHAR(d.start_date, 'IW') = '03' THEN 1 ELSE 0 END) AS third_week,
SUM(CASE WHEN TO_CHAR(d.start_date, 'IW') = '04' THEN 1 ELSE 0 END) AS fourth_week
FROM p_disabilities d
JOIN p_disability_type dt ON dt.disability_id = d.disability_id
WHERE d.start_date >= DATE '2018-01-01'
AND d.start_date < DATE '2018-02-01'
AND TO_NUMBER(SUBSTR(d.birth_number, 2, 2)) >= 51
GROUP BY dt.disability_name
ORDER BY dt.disability_name;
The most selective column in the WHERE
clause is the start_date, where you pick a single month from the whole table. The birth_number substring is less selective, because you accept half of the possible values. So, an appropriate index would look like this:
CREATE INDEX idx ON p_disabilities (start_date, TO_NUMBER(SUBSTR(d.birth_number, 2, 2)));
This is a function index including a calculated value that we index. An alternative would be a computed column:
ALTER TABLE p_disabilities ADD sex AS (CASE WHEN TO_NUMBER(SUBSTR(birth_number, 2, 2)) >= 51 THEN 'f' ELSE 'm' END);
Then index that and use it in the query, which I find much more readable.
You can include columns at the end of the index that you want to use in the query, starting with the id_disability for the join and TO_CHAR(d.start_date, 'IW')
for the conditional aggregation. The index may then look like this:
CREATE INDEX idx ON p_disabilities (start_date, sex, disability_id, TO_CHAR(start_date, 'IW'));
A note on the dates/weeks: While using the ISO week is a great approach for many scenarios, I don't see it appropriate here, because the first ISO week can start in the previous year, and by looking at January only, you may end up with a four-day week. Consider changing this somehow (e.g. by looking at January 1-7, 8-14, ... instead).