Search code examples
phpmysqlquery-optimization

Retail inventory Mysql query optimization


Given the following tables for a retail administration system:

STORES: store_id, name

PRODUCTS: product_id, name, cost

PRODUCT_ENTRIES: key, store_id, date

PRODUCT_ENTRIES_CONTENT: product_entries_key, product_id, quantity

PRODUCT_EXITS: key, store_id, product_id, quantity, status, date

SALES: key, store_id, date

SALES_CONTENT: sales_key, product_id, quantity

RETURNS: key, store_id, date

RETURNS_CONTENT: returns_key, product_id, quantity

In order to calculate stock values I run through the contents of the products table and for each product_id:

  • Sum quantities of product_entries_content as well as returns_content
  • Subtract quantities of product_exits_content (where status = 2 or 3) as well as sales_content

To calculate the cost of the inventory of each store, I'm running the following query through a PHP loop for each distinct store and outputting the result:

SELECT

    SUM((((

    (SELECT COALESCE(SUM(product_entries_content.quantity), 0)

    FROM product_entries

    INNER JOIN product_entries_content ON 
product_entries_content.product_entries_key = product_entries.key

    WHERE product_entries_content.product_id = products.id 
    AND product_entries.store_id = '.$row['id'].'   
    AND DATE(product_entries.date) <= DATE(NOW()))


    -

    (SELECT COALESCE(SUM(quantity), 0) 

    FROM sales_content

    INNER JOIN sales ON sales.key  = sales_content.sales_key

    WHERE product_id = products.product_id AND sales.store_id = '.$row['id'].'
    AND DATE(sales_content.date) <= DATE(NOW()))

    +

    (SELECT COALESCE(SUM(quantity), 0) 

    FROM returns_content

    INNER JOIN returns  ON returns.key = returns_content.returns_key

    WHERE product_id = products.product_id AND returns.store_id = '.$row['id'].'
    AND DATE(returns.date) <= DATE(NOW()))

    -

    (SELECT COALESCE(SUM(quantity), 0) 

    FROM product_exits

    WHERE product_id = products.product_id AND (status = 2 OR status = 3) 
AND product_exits.store_id = '.$row['id'].' #store_id
    AND DATE(product_exits.date) <= DATE(NOW()))     

    ) * products.cost) / 100) ) AS "'.$row['key'].'" #store_name

FROM products WHERE 1

All foreign keys and indexes are properly set. The problem is because of the large amount of stores and movements in each store the query is becoming increasingly heavy, and because inventory is calculated from the beginning of each store's history it only gets slower with time.

What could I do to optimize this scheme?


Solution

  • Ideally, SHOW CREATE TABLE tablename for each table would really help a lot in any optimization question. The data type of each column is EXTREMELY important to performance.

    That said, from the information you've given the following should be helpful, assuming the column data types are all appropriate.

    Add the following indexes, if they do not exist. IMPORTANT: Single column indexes are NOT valid replacements for the following composite indexes. You stated that

    All foreign keys and indexes are properly set.

    but that tells us nothing about what they are, and if they are "proper" for optimization.

    New indexes

    ALTER TABLE sales
    CREATE INDEX `aaaa` (`store_id`,`key`)
    
    ALTER TABLE sales_content
    CREATE INDEX `bbbb` (`product_id`,`sales_key`,`date`,`quantity`)
    
    ALTER TABLE returns
    CREATE INDEX `cccc` (`store_id`,`date`,`sales_key`)
    
    ALTER TABLE returns_content
    CREATE INDEX `dddd` (`product_id`,`returns_key`,`quantity`)
    
    ALTER TABLE product_exits
    CREATE INDEX `eeee` (`product_id`,`status`,`store_id`,`date`,`quantity`)
    
    ALTER TABLE product_entries
    CREATE INDEX `ffff` (`store_id`,`date`,`key`)
    
    ALTER TABLE product_entries_content
    CREATE INDEX `gggg` (`product_id`,`product_entries_key`,`quantity`)
    

    (Use more appropriate names than aaaa. I just used those to save time.)

    Each of the above indexes will allow the database to read only one row for each table. Most performance issues involving joins comes from what is known as a double lookup.

    Understanding indexes and double lookups

    An index is just a copy of the table data. Each column listed in the index is copied from the table, in the order listed in the index, and then the primary key is appended to that row in the index. When the database uses an index to look up a value, if not all the information is contained in the index, the primary key will be used to access the clustered index of the table to obtain the rest of the information. This is what a double look up is, and it is VERY bad for performance.

    Example

    All the above indexes are designed to avoid double lookups. Let's look at the second subquery to see how the indexes related to that query will work.

    ALTER TABLE sales
    CREATE INDEX `aaaa` (`store_id`,`key`)
    
    ALTER TABLE sales_content
    CREATE INDEX `bbbb` (`product_id`,`sales_key`,`date`,`quantity`)
    

    Subquery (I added aliases and adjusted how the date column is accessed, but otherwise it is unchanged):

    SELECT COALESCE(SUM(sc.quantity), 0) 
    FROM sales_content sc
    INNER JOIN sales s 
    ON s.key  = sc.sales_key
    WHERE sc.product_id = p.product_id 
    AND s.store_id = '.$row['id'].'
    AND sc.date < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
    

    Using the aaaa index, the database will be able to look up only those rows in the sales table that match the store_id, since that is listed first in the index. Think of this in the same way as a phone book, where store_id is the last name, and key is the first name. If you have the last name, then it is EXTREMELY easy to flip to that point of the phone book, and quickly get all the first names that go with that last name. Likewise, the database is able to very quickly "flip" to the part of the index that contains the given store_id value, and find all the key values. In this case, we do not need the primary key at all (which would be the phone number, in the phone book example.)

    So, done with the sales table, and we have all the key values we need from there.

    Next, the database moves onto the bbbb index. We already have product_id from the main query, and we have the sales_key from the aaaa index. That is like having both first and last name in the phone book. The only thing left to compare is the date, which could be like the address in a phone book. The database will store all the dates in order, and so by giving it a cutoff value, it can just look at all the dates up to a certain point.

    The last part of the bbbb index is the quantity, which is there so that the database can quickly sum up all those quantities. To see why this is fast, consider again the phone book. Imagine in addition to last name, first name, and address information, that there is also a quantity column (of something, it doesn't matter what). If you wanted the sum of the quantities for a specific last name, first name, and for all addresses that start with the number 5 or less, that is easy, isn't it? Just find the first one, and add them up in order until you reach the first address that starts with a number greater than 5. The database benefits the same way when using the date column in this way (date is like the address column, in this example.)

    The date columns

    Finally, I noted earlier, I changed how the date column was accessed. You never want to run a function on a database column that you are comparing to another value. The reason is this: What would happen if you had to convert all the addresses into roman numerals, before you did any comparison? You wouldn't be able to just go down the list like we did earlier. You'd have to convert ALL the values, and THEN check each one to make sure it was within the limit, since we no longer know if the values are sorted correctly to just be able to do the "read them all and then stop at a certain value" shortcut I described above.

    You and I may know that converting a datetime value to a date isn't going to change the order, but the database will not know (it might be possible it optimizes this conversion, but that's not something I want to assume.) So, keep the columns pure. The change I made was to just take the NOW() date, and add one day, and then make it a < instead of a <=. After all, comparing two values and saying the date must be equal to or less than today's date is equivalent to saying the datetime must be less than tomorrow's date.

    The query

    Below is my final query for you. As stated, not much has changed other than the date change and aliases. However, you had a typo in the first subquery where you accessed products.id. I corrected the id to be product_id, given that that matches what you stated were the columns for the products table.

    SELECT
    SUM(
    (
    (
    (
        (
        SELECT COALESCE(SUM(pec.quantity), 0)
        FROM product_entries pe
        INNER JOIN product_entries_content pec 
        ON pec.product_entries_key = pe.key
        WHERE pec.product_id = p.product_id 
        AND pe.store_id = '.$row['id'].' 
        AND pe.date < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
        )
        -
        (
        SELECT COALESCE(SUM(sc.quantity), 0) 
        FROM sales_content sc
        INNER JOIN sales s 
        ON s.key  = sc.sales_key
        WHERE sc.product_id = p.product_id 
        AND s.store_id = '.$row['id'].'
        AND sc.date < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
        )
        +
        (
        SELECT COALESCE(SUM(rc.quantity), 0)
        FROM returns_content rc
        INNER JOIN returns r 
        ON r.key = rc.returns_key
        WHERE rc.product_id = p.product_id 
        AND r.store_id = '.$row['id'].'
        AND r.date < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
        )
        -
        (
        SELECT COALESCE(SUM(pex.quantity), 0)
        FROM product_exits pex
        WHERE pex.product_id = p.product_id 
        AND (pex.status = 2 OR pex.status = 3)
        AND pex.store_id = '.$row['id'].' #store_id
        AND pex.date < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
        )
    ) 
    * p.cost) 
    / 100)
    ) AS "'.$row['key'].'" #store_name
    FROM products p WHERE 1
    

    You may be able to further optimize this by splitting the subquery on the product_exits table into 2 separate sub queries, rather than using a OR, which many times will perform poorly. Ultimately, you'll have to benchmark that to see how well the database optimizes the OR on its own.