I am writing a Python script that will be run regularly in a production environment where efficiency is key.
Below is an anonymized query that I have which pulls sales data for 3,000 different items.
I think I am getting slower results querying for all of them at once. When I try querying for different sizes, the amount of time it takes varies inconsistently (likely due to my internet connection). For example, sometimes querying for 1000 items 3 times is faster than all 3000 at once. However, running the same test 5 minutes later gets me different results. It is a production database where performance may be dependent on current traffic. I am not a database administrator but work in data science, using mostly similar select queries (I do the rest in Python).
Is there a best practice here? Some sort of logic that determines how many items to put in the WHERE IN clause?
date_min = pd.to_datetime('2021-11-01')
date_max = pd.to_datetime('2022-01-31')
sql = f"""
SELECT
product_code,
sales_date,
n_sold,
revenue
FROM
sales_daily
WHERE
product_code IN {tuple(item_list)}
and sales_date >= DATE('{date_min}')
and sales_date <= DATE('{date_max}')
ORDER BY
sales_date DESC, revenue
"""
df_act = pd.read_sql(sql, di.db_engine)
df_act
If your sales_date
column is indexed in the database, I think using a function in the where clause (DATE
) might cause the plan to not use that index. I believe you will have better luck if you concatenate date_min
and date_max
as strings (YYYY-MM-DD) into the SQL string and get rid of the function. Also, use BETWEEN...AND
rather than >= ... AND ... <=
.
As for IN
with 1000 items, strongly recommend you don't do that. Create a single-column temp table of those values and index the item, then join to product_code.
Generally, something like this:
DROP TABLE IF EXISTS _item_list;
CREATE TEMP TABLE _item_list
AS
SELECT item
FROM VALUES (etc) t(item);
CREATE INDEX idx_items ON _item_list (item);
SELECT
product_code,
sales_date,
n_sold,
revenue
FROM
sales_daily x
INNER JOIN _item_list y ON x.product_code = y.item
WHERE
sales_date BETWEEN '{date_min}' AND '{date_max}'
ORDER BY
sales_date DESC, revenue
As an addendum, try to have the items in the item list in the same order as the index on the product_code.