Search code examples
pythonsqldatabasepostgresqldatabase-administration

Postres/Python - is it better to use one large query, or several smaller ones? How do you decide number of items to include in 'WHERE IN' clause?


I am writing a Python script that will be run regularly in a production environment where efficiency is key.

Below is an anonymized query that I have which pulls sales data for 3,000 different items.

I think I am getting slower results querying for all of them at once. When I try querying for different sizes, the amount of time it takes varies inconsistently (likely due to my internet connection). For example, sometimes querying for 1000 items 3 times is faster than all 3000 at once. However, running the same test 5 minutes later gets me different results. It is a production database where performance may be dependent on current traffic. I am not a database administrator but work in data science, using mostly similar select queries (I do the rest in Python).

Is there a best practice here? Some sort of logic that determines how many items to put in the WHERE IN clause?

date_min = pd.to_datetime('2021-11-01')
date_max = pd.to_datetime('2022-01-31')
sql = f"""
    SELECT
        product_code,
        sales_date,
        n_sold,
        revenue
    FROM
        sales_daily
    WHERE
        product_code IN {tuple(item_list)}
        and sales_date >= DATE('{date_min}')
        and sales_date <= DATE('{date_max}')
    ORDER BY
        sales_date DESC, revenue
    """
df_act = pd.read_sql(sql, di.db_engine)
df_act

Solution

  • If your sales_date column is indexed in the database, I think using a function in the where clause (DATE) might cause the plan to not use that index. I believe you will have better luck if you concatenate date_min and date_max as strings (YYYY-MM-DD) into the SQL string and get rid of the function. Also, use BETWEEN...AND rather than >= ... AND ... <=.

    As for IN with 1000 items, strongly recommend you don't do that. Create a single-column temp table of those values and index the item, then join to product_code.

    Generally, something like this:

    DROP TABLE IF EXISTS _item_list;
    CREATE TEMP TABLE _item_list
    AS
    SELECT item
    FROM VALUES (etc) t(item);
    
    CREATE INDEX idx_items ON _item_list (item);
    
        SELECT
            product_code,
            sales_date,
            n_sold,
            revenue
        FROM
            sales_daily x
            INNER JOIN _item_list y ON x.product_code = y.item
        WHERE
            sales_date BETWEEN '{date_min}' AND '{date_max}'
        ORDER BY
            sales_date DESC, revenue
    

    As an addendum, try to have the items in the item list in the same order as the index on the product_code.