Search code examples
pythonflasksqlalchemyflask-sqlalchemy

SQLAlchemy querying child records based on date range


I am working on a Flask+SQLAlchemy project that tracks API Testing results.

I've set up a 3 table, 1:M:M relationship in sql alchemy to track the test results for API tests. I am tracking the overall API Endpoint, the functionality if that API (TestCase) and the individual TestResult's.

The models are set up like so:

class Endpoint(Base):
    __tablename__ = "endpoint"
    # Primary Key
    id: Mapped[int] = mapped_column(primary_key=True)
    # Data
    api_route: Mapped[str] = mapped_column(String(300), unique=True, nullable=False)
    # Relationships
    ## Links TestCase.endpoint
    test_cases: Mapped[List['TestCase']] = relationship(back_populates="endpoint")

    @property
    def serialize(self):
        return {
            'id': self.id,
            'api_route': self.api_route,
            'test_cases': [i.serialize for i in self.test_cases],
            'health': self.health
        }
class TestCase(Base):
    __tablename__ = "test_case"
    # Primary Key
    id: Mapped[int] = mapped_column(primary_key=True)
    # Foreign Keys
    endpoint_fk: Mapped[int] = mapped_column(ForeignKey("endpoint.id"))
    # Relationships
    ## Links Endpoint.test_cases
    endpoint: Mapped['Endpoint'] = relationship(back_populates="test_cases")
    ## Links TestResult.test_case
    results: Mapped[List['TestResult']] = relationship(back_populates="test_case")
    # Data
    test_case: Mapped[str] = mapped_column(String(50))

    @property
    def serialize(self):
        return {
            'id': self.id,
            'endpoint_fk': self.endpoint_fk,
            'test_case': self.test_case,
            'results': [i.serialize for i in self.results],
            'health': self.health,
        }
class TestResult(Base):
    __tablename__ = "testresult"
    # Primary Key
    id: Mapped[int] = mapped_column(primary_key=True)
    # Foreign Keys
    test_case_fk: Mapped[int] = mapped_column(ForeignKey("test_case.id"))
    # Data
    result: Mapped[int] = mapped_column(Integer())
    timestamp: Mapped[DateTime] = mapped_column(DateTime(timezone=True), server_default=func.now())
    error: Mapped[str] = mapped_column(String(3000))
    # Relationships
    # Links TestCase.results
    test_case: Mapped['TestCase'] = relationship(back_populates="results")

    @property
    def serialize(self):
        return {
            'id': self.id,
            'test_case_fk': self.test_case_fk,
            'result': self.result,
            'timestamp': self.timestamp,
            'error': self.error
        }

I am now working on a dashboard that shows the last 24 hours of results, and it seems rather straightforward to just filter my query based on a start and end date. Im querying the tables like so to return the Endpoint objects that i then serialize to get all the child data via the defined relationships. I'd prefer to have my Endpoint objects returned since there's some additional properties on there that are calculated based on the child data without having to deal with aggregates in the query.

query = db.session.query(Endpoint) \
    .join(TestCase, Endpoint.id == TestCase.endpoint_fk) \
    .join(TestResult, TestCase.id == TestResult.test_case_fk) \
    .filter(TestResult.timestamp.between(one_day_ago, dt_now)) \
    .order_by(Endpoint.id)
logger.info(query)

res = [r.serialize for r in query.all()]

When executed, it's returning all of the data for all test results, instead of those just limited to being between one_day_ago and dt_now.

Honestly im a bit lost as to why im getting all this extra data, with TestResult records belonging outside the specified date range.

I tried just doing some raw SQL and something like this work and properly gives back the proper amount of rows.

SELECT *
FROM endpoint JOIN test_case ON endpoint.id = test_case.endpoint_fk JOIN testresult ON test_case.id = testresult.test_case_fk 
WHERE testresult.timestamp BETWEEN '2023-08-28T00:00:00.112993-04:00' AND '2023-08-29T23:59:59.112993-04:00'
ORDER BY endpoint.id

I turned on the database echoing and it looks like the lazy loading is doing what it's supposed to yet the original WHERE clause isn't being preserved when it's getting the child records.

2023-08-29 13:27:26,992 INFO sqlalchemy.engine.Engine
SELECT endpoint.id AS endpoint_id, endpoint.api_route AS endpoint_api_route 
FROM endpoint JOIN test_case ON endpoint.id = test_case.endpoint_fk JOIN testresult ON test_case.id = testresult.test_case_fk 
WHERE testresult.timestamp BETWEEN %(timestamp_1)s AND %(timestamp_2)s ORDER BY endpoint.id
2023-08-29 13:27:26,992 INFO sqlalchemy.engine.Engine [generated in 0.00015s] {'timestamp_1': datetime.datetime(2023, 8, 28, 13, 27, 26, 978617), 'timestamp_2': datetime.datetime(2023, 8, 29, 13, 27, 26, 978617)}

2023-08-29 13:27:26,998 INFO sqlalchemy.engine.Engine
SELECT test_case.endpoint_fk AS test_case_endpoint_fk, test_case.id AS test_case_id, test_case.test_case AS test_case_test_case 
FROM test_case 
WHERE test_case.endpoint_fk IN (%(primary_keys_1)s, %(primary_keys_2)s)
2023-08-29 13:27:26,998 INFO sqlalchemy.engine.Engine [generated in 0.00028s] {'primary_keys_1': 1, 'primary_keys_2': 2}

2023-08-29 13:27:27,000 INFO sqlalchemy.engine.Engine
SELECT testresult.test_case_fk AS testresult_test_case_fk, testresult.id AS testresult_id, testresult.result AS testresult_result, testresult.timestamp AS testresult_timestamp, testresult.error AS testresult_error 
FROM testresult 
WHERE testresult.test_case_fk IN (%(primary_keys_1)s, %(primary_keys_2)s, %(primary_keys_3)s, %(primary_keys_4)s, %(primary_keys_5)s, %(primary_keys_6)s, %(primary_keys_7)s, %(primary_keys_8)s, %(primary_keys_9)s, %(primary_keys_10)s, %(primary_keys_11)s, %(primary_keys_12)s, %(primary_keys_13)s, %(primary_keys_14)s, %(primary_keys_15)s, %(primary_keys_16)s, %(primary_keys_17)s, %(primary_keys_18)s, %(primary_keys_19)s, %(primary_keys_20)s, %(primary_keys_21)s, %(primary_keys_22)s)
2023-08-29 13:27:27,000 INFO sqlalchemy.engine.Engine [generated in 0.00017s] {'primary_keys_1': 1, 'primary_keys_2': 2, 'primary_keys_3': 3, 'primary_keys_4': 4, 'primary_keys_5': 5, 'primary_keys_6': 6, 'primary_keys_7': 7, 'primary_keys_8': 8, 'primary_keys_9': 9, 'primary_keys_10': 10, 'primary_keys_11': 11, 'primary_keys_12': 12, 'primary_keys_13': 13, 'primary_keys_14': 14, 'primary_keys_15': 15, 'primary_keys_16': 16, 'primary_keys_17': 17, 'primary_keys_18': 18, 'primary_keys_19': 19, 'primary_keys_20': 20, 'primary_keys_21': 21, 'primary_keys_22': 22}

Any ideas/suggestions/help for how i can limit the returned data and not get all these extra records? I had thought of maybe doing the query with a subquery to get TestResults within the date range then joining to that result set, but haven't figured out the syntax for that yet.


Solution

  • As i somewhat suspected the lazy loading was the culprit here for me and once the Endpoint records were selected, it was just getting all child records that matched that key, even if i had specified the filter/where clause.

    After a lot more reading of the docs, i found the necessary things to get the lazy loading to adhere to the conditions i wanted.

    Docs: https://docs.sqlalchemy.org/en/20/orm/queryguide/api.html#sqlalchemy.orm.with_loader_criteria

    The end result to query and get all the data via the relationships

    query = db.select(Endpoint) \
            .options(
                with_loader_criteria(TestResult, TestResult.timestamp.between(the_day_before, start_date))
            ).order_by(Endpoint.id)