I am using tableau for a dashboard that pulls data in from a MySQL database. It works perfectly but the query takes too long to execute. The data is limited to the previous 2 years and the most recent Saturday that has passed from today.
Background info: I have sales data on a daily cadence databased in a single table per account(client). There is a field called distributor_view which can have one of the following 2 values: Manufacturing and Sourcing. An item will sometimes disappear from the manufacturing view, but will always be on the sourcing. The revenue figures between the 2 are not equal, but for data completeness- if any items are missing from the manufacturing view (preferred) then the revenue is pulled from the sourcing.
Note: If all data is queried most of it will be duplicated on an itemId level- only about 5% of items are missing from the manufacturing view and need to be referenced by the sourcing view.
The logic of the query below is such that it will pull all items that are in the manufacturing view, and UNION ALL items that exist in the sourcing view and are missing from the manufacturing view.
Query:
SELECT * FROM
(
SELECT 'Client1' as Client, date as 'Date', distributor_view as 'Distributor View', item_id as 'ItemID', product_title as 'Title', ordered_revenue as 'Ordered Revenue'
FROM table1
WHERE
distributor_view = 'Manufacturing'
AND
YEAR(date) >= YEAR(CURDATE()) - 2
AND
date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
UNION ALL
SELECT 'Client1' as Client, date as 'Date', distributor_view as 'Distributor View', item_id as 'ItemID', product_title as 'Title', ordered_revenue as 'Ordered Revenue'
FROM table1
WHERE
distributor_view = 'Sourcing'
AND
(date, item_id) NOT IN (
SELECT date, item_id FROM table1 WHERE distributor_view = 'Manufacturing' AND YEAR(date) >= YEAR(CURDATE()) - 2 AND date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
)
AND
YEAR(date) >= YEAR(CURDATE()) - 2
AND
date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
UNION ALL
SELECT 'Client2' as Client, date as 'Date', distributor_view as 'Distributor View', item_id as 'ItemID', product_title as 'Title', ordered_revenue as 'Ordered Revenue'
FROM table2
WHERE
distributor_view = 'Manufacturing'
AND
YEAR(date) >= YEAR(CURDATE()) - 2
AND
date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
UNION ALL
SELECT 'Client2' as Client, date as 'Date', distributor_view as 'Distributor View', item_id as 'ItemID', product_title as 'Title', ordered_revenue as 'Ordered Revenue'
FROM table2
WHERE
distributor_view = 'Sourcing'
AND
(date, item_id) NOT IN (
SELECT date, item_id FROM table2 WHERE distributor_view = 'Manufacturing' AND YEAR(date) >= YEAR(CURDATE()) - 2 AND date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
)
AND
YEAR(date) >= YEAR(CURDATE()) - 2
AND
date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
) AS sales
ORDER BY date DESC
In an earlier iteration of this I used Python to create a script that would delete out items from sourcing that existed in manufacturing. The query would then be much simpler- a select with no regard to distributor_view. I don't like this method because it requires a script to be ran each time the data changes(daily).
Another option I'm brainstorming is to just pull all data and attempt to reconcile the two within tableau using some calculation.
(1) This is not sargable
AND YEAR(date) >= YEAR(CURDATE()) - 2
Instead do
AND date >= CONCAT(YEAR(CURDATE(), '-01-01')
(It is kludgy, but should allow for using an index involving the date
column.)
(2) I'm not sure about this syntax
AND date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
Let's change to
AND date <= CURDATE - INTERVAL DAYOFWEEK(CURDATE()) DAY
(3) Add this to both tables:
INDEX(distributor_view, date, item_id)
(4) Two problems here, but neither can be fully optimized:
AND (date, item_id) NOT IN ( SELECT ... )
First, "NOT IN". Let's see the EXPLAIN
in order to see if the Optimizer did anything good with it.
Second, "row constructors" -- such as (date, item_id)
should be avoided. Although such 'works', it is not likely to be efficient. There have been some improvements in very recent versions. What version of MySQL or MariaDB are you using?
Change to one of these:
AND NOT EXISTS ( SELECT ... )
or
LEFT JOIN ( SELECT ... ) ... WHERE ... IS NULL
(5) get rid of the outer SELECT *
. (Be sure to add the extra parens I show here.)
( SELECT ... )
UNION ALL
( SELECT ... )
UNION ALL
( SELECT ... )
UNION ALL
( SELECT ... )
ORDER BY date DESC
Without the extra parens, the ORDER BY
will [I think] apply only to the last SELECT
. In general, when using UNION
, extra parens is safe, and may be necessary.
(6) If these suggestions cannot be applied to the framework that generated the query, you have my condolences.