Given the following tables for a retail administration system:
STORES: store_id, name
PRODUCTS: product_id, name, cost
PRODUCT_ENTRIES: key, store_id, date
PRODUCT_ENTRIES_CONTENT: product_entries_key, product_id, quantity
PRODUCT_EXITS: key, store_id, product_id, quantity, status, date
SALES: key, store_id, date
SALES_CONTENT: sales_key, product_id, quantity
RETURNS: key, store_id, date
RETURNS_CONTENT: returns_key, product_id, quantity
In order to calculate stock values I run through the contents of the products table and for each product_id:
To calculate the cost of the inventory of each store, I'm running the following query through a PHP loop for each distinct store and outputting the result:
SELECT
SUM((((
(SELECT COALESCE(SUM(product_entries_content.quantity), 0)
FROM product_entries
INNER JOIN product_entries_content ON
product_entries_content.product_entries_key = product_entries.key
WHERE product_entries_content.product_id = products.id
AND product_entries.store_id = '.$row['id'].'
AND DATE(product_entries.date) <= DATE(NOW()))
-
(SELECT COALESCE(SUM(quantity), 0)
FROM sales_content
INNER JOIN sales ON sales.key = sales_content.sales_key
WHERE product_id = products.product_id AND sales.store_id = '.$row['id'].'
AND DATE(sales_content.date) <= DATE(NOW()))
+
(SELECT COALESCE(SUM(quantity), 0)
FROM returns_content
INNER JOIN returns ON returns.key = returns_content.returns_key
WHERE product_id = products.product_id AND returns.store_id = '.$row['id'].'
AND DATE(returns.date) <= DATE(NOW()))
-
(SELECT COALESCE(SUM(quantity), 0)
FROM product_exits
WHERE product_id = products.product_id AND (status = 2 OR status = 3)
AND product_exits.store_id = '.$row['id'].' #store_id
AND DATE(product_exits.date) <= DATE(NOW()))
) * products.cost) / 100) ) AS "'.$row['key'].'" #store_name
FROM products WHERE 1
All foreign keys and indexes are properly set. The problem is because of the large amount of stores and movements in each store the query is becoming increasingly heavy, and because inventory is calculated from the beginning of each store's history it only gets slower with time.
What could I do to optimize this scheme?
Ideally, SHOW CREATE TABLE tablename
for each table would really help a lot in any optimization question. The data type of each column is EXTREMELY important to performance.
That said, from the information you've given the following should be helpful, assuming the column data types are all appropriate.
Add the following indexes, if they do not exist. IMPORTANT: Single column indexes are NOT valid replacements for the following composite indexes. You stated that
All foreign keys and indexes are properly set.
but that tells us nothing about what they are, and if they are "proper" for optimization.
New indexes
ALTER TABLE sales
CREATE INDEX `aaaa` (`store_id`,`key`)
ALTER TABLE sales_content
CREATE INDEX `bbbb` (`product_id`,`sales_key`,`date`,`quantity`)
ALTER TABLE returns
CREATE INDEX `cccc` (`store_id`,`date`,`sales_key`)
ALTER TABLE returns_content
CREATE INDEX `dddd` (`product_id`,`returns_key`,`quantity`)
ALTER TABLE product_exits
CREATE INDEX `eeee` (`product_id`,`status`,`store_id`,`date`,`quantity`)
ALTER TABLE product_entries
CREATE INDEX `ffff` (`store_id`,`date`,`key`)
ALTER TABLE product_entries_content
CREATE INDEX `gggg` (`product_id`,`product_entries_key`,`quantity`)
(Use more appropriate names than aaaa
. I just used those to save time.)
Each of the above indexes will allow the database to read only one row for each table. Most performance issues involving joins comes from what is known as a double lookup.
Understanding indexes and double lookups
An index is just a copy of the table data. Each column listed in the index is copied from the table, in the order listed in the index, and then the primary key is appended to that row in the index. When the database uses an index to look up a value, if not all the information is contained in the index, the primary key will be used to access the clustered index of the table to obtain the rest of the information. This is what a double look up is, and it is VERY bad for performance.
Example
All the above indexes are designed to avoid double lookups. Let's look at the second subquery to see how the indexes related to that query will work.
ALTER TABLE sales
CREATE INDEX `aaaa` (`store_id`,`key`)
ALTER TABLE sales_content
CREATE INDEX `bbbb` (`product_id`,`sales_key`,`date`,`quantity`)
Subquery (I added aliases and adjusted how the date column is accessed, but otherwise it is unchanged):
SELECT COALESCE(SUM(sc.quantity), 0)
FROM sales_content sc
INNER JOIN sales s
ON s.key = sc.sales_key
WHERE sc.product_id = p.product_id
AND s.store_id = '.$row['id'].'
AND sc.date < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
Using the aaaa
index, the database will be able to look up only those rows in the sales
table that match the store_id
, since that is listed first in the index. Think of this in the same way as a phone book, where store_id
is the last name, and key
is the first name. If you have the last name, then it is EXTREMELY easy to flip to that point of the phone book, and quickly get all the first names that go with that last name. Likewise, the database is able to very quickly "flip" to the part of the index that contains the given store_id
value, and find all the key
values. In this case, we do not need the primary key at all (which would be the phone number, in the phone book example.)
So, done with the sales
table, and we have all the key
values we need from there.
Next, the database moves onto the bbbb
index. We already have product_id
from the main query, and we have the sales_key
from the aaaa
index. That is like having both first and last name in the phone book. The only thing left to compare is the date, which could be like the address in a phone book. The database will store all the dates in order, and so by giving it a cutoff value, it can just look at all the dates up to a certain point.
The last part of the bbbb
index is the quantity, which is there so that the database can quickly sum up all those quantities. To see why this is fast, consider again the phone book. Imagine in addition to last name, first name, and address information, that there is also a quantity column (of something, it doesn't matter what). If you wanted the sum of the quantities for a specific last name, first name, and for all addresses that start with the number 5 or less, that is easy, isn't it? Just find the first one, and add them up in order until you reach the first address that starts with a number greater than 5. The database benefits the same way when using the date column in this way (date is like the address column, in this example.)
The date columns
Finally, I noted earlier, I changed how the date column was accessed. You never want to run a function on a database column that you are comparing to another value. The reason is this: What would happen if you had to convert all the addresses into roman numerals, before you did any comparison? You wouldn't be able to just go down the list like we did earlier. You'd have to convert ALL the values, and THEN check each one to make sure it was within the limit, since we no longer know if the values are sorted correctly to just be able to do the "read them all and then stop at a certain value" shortcut I described above.
You and I may know that converting a datetime value to a date isn't going to change the order, but the database will not know (it might be possible it optimizes this conversion, but that's not something I want to assume.) So, keep the columns pure. The change I made was to just take the NOW() date, and add one day, and then make it a <
instead of a <=
. After all, comparing two values and saying the date must be equal to or less than today's date is equivalent to saying the datetime must be less than tomorrow's date.
The query
Below is my final query for you. As stated, not much has changed other than the date change and aliases. However, you had a typo in the first subquery where you accessed products.id
. I corrected the id
to be product_id
, given that that matches what you stated were the columns for the products
table.
SELECT
SUM(
(
(
(
(
SELECT COALESCE(SUM(pec.quantity), 0)
FROM product_entries pe
INNER JOIN product_entries_content pec
ON pec.product_entries_key = pe.key
WHERE pec.product_id = p.product_id
AND pe.store_id = '.$row['id'].'
AND pe.date < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
)
-
(
SELECT COALESCE(SUM(sc.quantity), 0)
FROM sales_content sc
INNER JOIN sales s
ON s.key = sc.sales_key
WHERE sc.product_id = p.product_id
AND s.store_id = '.$row['id'].'
AND sc.date < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
)
+
(
SELECT COALESCE(SUM(rc.quantity), 0)
FROM returns_content rc
INNER JOIN returns r
ON r.key = rc.returns_key
WHERE rc.product_id = p.product_id
AND r.store_id = '.$row['id'].'
AND r.date < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
)
-
(
SELECT COALESCE(SUM(pex.quantity), 0)
FROM product_exits pex
WHERE pex.product_id = p.product_id
AND (pex.status = 2 OR pex.status = 3)
AND pex.store_id = '.$row['id'].' #store_id
AND pex.date < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
)
)
* p.cost)
/ 100)
) AS "'.$row['key'].'" #store_name
FROM products p WHERE 1
You may be able to further optimize this by splitting the subquery on the product_exits
table into 2 separate sub queries, rather than using a OR
, which many times will perform poorly. Ultimately, you'll have to benchmark that to see how well the database optimizes the OR
on its own.