SQLAlchemy - select rows with same identity reference, but which vary be a value in different sequence periods

I have a table with the following structure (this is abbreviated):

class Valuation(Base):
    __tablename__ = 'valuation'
    id = Column(Integer, primary_key=True)
    reference = Column(BigInteger, index=True)
    value = Column(Float)
    period = Column(String)

With example data:

reference	value	period
2433	110	2023-a
5435	120	2023-b
5435	110	2022-a
2433	100	2022-b
5435	105	2022-c
2433	100	2021-a

Data caveats:

References don't always have value measurements for each distinct period-sequence (year-character). A reference may not have a value for the latest measured sequence of a period.
The value should decrease or remain the same over time, so the max of any period should be less than the max of preceding periods.

I would like to select every reference where the most recent period value for that reference is greater than any of the latest values for any preceding period.

In the above, that would return:

reference	value	period
2433	110	2023-a
5435	120	2023-b

Reviewing this, indicates that an aliased approach would help, but I'm a bit at a loss as how best to structure this.

Where I am so far:

value2022 = aliased(Valuation, name="value2022")
value2021 = aliased(Valuation, name="value2021")
query = (
    db.query(Valuation)
    .outerjoin(value2022, (
            (Valuation.reference == value2022.reference)
            & (Valuation.value > value2022.value)
            & (Valuation.period.startswith("2023"))
            & (value2022.period.startswith("2022"))
        )
    )
    .outerjoin(value2021, (
            (Valuation.reference == value2021.reference)
            & (Valuation.value > value2021.value)
            & (Valuation.period.startswith("2023"))
            & (value2021.period.startswith("2021"))
        )
    )
    .order_by(
        Valuation.reference,
        Valuation.period.desc(),
    )
    .distinct(Valuation.reference)
    .all()
)

However, this doesn't give me the comparison of the value for the latest period vs the latest value for each preceding period and seems to wildly overfit. Can this be done?

Solution

I haven't used sqlalchemy for a long time, but maybe this will be useful.

from sqlalchemy import create_engine, Column, Integer, BigInteger, Float, String, text
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker

import pandas as pd

engine = create_engine('sqlite://')
Base = declarative_base(engine)
Session = sessionmaker(bind=engine)
session = Session()


class Valuation(Base):
    __tablename__ = 'valuation'
    id = Column(Integer, primary_key=True)
    reference = Column(BigInteger, index=True)
    value = Column(Float)
    period = Column(String)


Base.metadata.create_all(engine)
for valuation in [
    Valuation(reference=2433, value=110, period='2023-a'),
    Valuation(reference=5435, value=120, period='2023-b'),
    Valuation(reference=5435, value=110, period='2022-a'),
    Valuation(reference=2433, value=100, period='2022-b'),
    Valuation(reference=5435, value=105, period='2022-c'),
    Valuation(reference=2433, value=100, period='2021-a'),
]:
    session.add(valuation)

session.commit()

with engine.connect() as con:
    cursor = con.execute(text("""
        -- max by reference
        WITH r AS (
            SELECT reference,
                   max(value) AS value
              FROM valuation
             GROUP BY reference
        )
        
        SELECT r.*, p.period
          FROM r
          JOIN (
              SELECT reference, 
                     period,
                     value
                FROM valuation
          ) AS p ON (p.reference = r.reference AND p.value = r.value)
    """))

    print('result using sql:')
    print(cursor.all())
    # also you can use pandas

    df = pd.read_sql_query("""
        SELECT reference, 
               period,
               max(value) AS value
          FROM valuation
        GROUP BY reference, period
        ORDER BY value DESC
    """, con=con.connection)

    print('\ndataframe from database\n')
    print(df)
    print('\ndataframe after deduplication\n')
    print(df.drop_duplicates(['reference']))

Let's run:

result using sql:
[(2433, 110.0, '2023-a'), (5435, 120.0, '2023-b')]

dataframe from database

   reference  period  value
0       5435  2023-b  120.0
1       2433  2023-a  110.0
2       5435  2022-a  110.0
3       5435  2022-c  105.0
4       2433  2021-a  100.0
5       2433  2022-b  100.0

dataframe after deduplication

   reference  period  value
0       5435  2023-b  120.0
1       2433  2023-a  110.0

Process finished with exit code 0