Search code examples
pythonmysqlsqlalchemy

Slow sqlalchemy ".in_" clause


I'm trying to execute the following query using sqlalchemy:

query = session.query(
                func.count(ExampleTable.id)
            ).filter(ExampleTable.environment == self.environment))
if ids_list:
    query = query.filter(ExampleTable.id.in_(ids_list))

I noticed it was slow, so i did a simple timing "benchmark":

start = perf_counter()
query.first()
print(format(perf_counter() - start, '.3f'))  # .3f to limit decimal cases
  • It does not have to be precise, it is just for debugging purposes.

The average result of this perf_counter is always close to 2.7 seconds.

I also thought it could be the ExampleTable.environment == self.environment, since it is a string comparison, but the average result was around 0.03s, executing the same "benchmark" above with only the environment comparison.

  • ExampleTable has 150k ~ 200k total entries.
  • ids_list has 22k integer values.

With all that in mind, i did the same queries on MySQL Workbench (using the __ str __ method from Query object):

SELECT 
    COUNT(example_table.id) AS count_1
FROM
    example_table
WHERE
    example_table.environment = 'production'
        AND example_table.id IN (775008, 790152, 793346, ...) -- 22k ids

The first execution, which is usually slower, took 0.110 sec / 0.000 sec (duration / fetch)

The next executions took around 0.030 ~ 0.040 / 0.000 (duration / fetch)

SELECT
    COUNT(example_table.id) AS count_1 
FROM
    example_table 
WHERE
    example_table.environment = 'production';

First execution: 0.016 sec / 0.000 sec (duration / fetch)

Next executions: 0.000 / 0.000 (duration / fetch)

Since i'm running the same query on sqlalchemy and workbench, the duration should be at least close, right?

The difference between 2.7s~ to 0.100s~ is a LOT.

  1. Am i doing something wrong here? On python or mysql
  2. If the question lacks information, please let me know.

Edit1: The echo=True shows not exactly what i'd expect:

This is the first log shown about the query i'm debugging:

2024-07-15 12:13:58,594 INFO sqlalchemy.engine.Engine SELECT count(example_table.id) AS count_1 
FROM example_table
WHERE example_table.environment = %(environment_1)s AND  example_table.id IN (%(id_1_1)s, %(id_1_2)s, %(id_1_3)s, ..., (id_1_21800)s, %(id_1_21801)s) 
 LIMIT %(param_1)s

All these references seem odd %(id_1_1)s, %(id_1_2)s, ...

And aparently the second one defines the references:

2024-07-15 12:46:10,584 INFO sqlalchemy.engine.Engine [generated in 0.01642s] {'id_1': 1, 'environment_1': 'production', 'param_1': 1, 'id_1_1': 335469, 'id_1_2': 412304, 'id_1_3': 468745, ..., 'id_1_21800': 365095, 'id_1_21801': 731841}

Could be this slowness from these defined variables? Its one per id existing in .in_ clause



Solution

  • The problem was not the .in_ itself, but with the engine url i was using:

    url_obj = URL.create('mysql+mysqlconnector', username=username, password=password, host=host, database=database)
    

    Just changed it to mysql instead of mysql+mysqlconnector

    url_obj = URL.create('mysql', username=username, password=password, host=host, database=database)
    

    This made the query time drop from ~2.7s to ~0.09s