Search code examples
pythonsqlalchemy

SQLAlchemy - select rows with same identity reference, but which vary be a value in different sequence periods


I have a table with the following structure (this is abbreviated):

class Valuation(Base):
    __tablename__ = 'valuation'
    id = Column(Integer, primary_key=True)
    reference = Column(BigInteger, index=True)
    value = Column(Float)
    period = Column(String)

With example data:

reference value period
2433 110 2023-a
5435 120 2023-b
5435 110 2022-a
2433 100 2022-b
5435 105 2022-c
2433 100 2021-a

Data caveats:

  • References don't always have value measurements for each distinct period-sequence (year-character). A reference may not have a value for the latest measured sequence of a period.
  • The value should decrease or remain the same over time, so the max of any period should be less than the max of preceding periods.

I would like to select every reference where the most recent period value for that reference is greater than any of the latest values for any preceding period.

In the above, that would return:

reference value period
2433 110 2023-a
5435 120 2023-b

Reviewing this, indicates that an aliased approach would help, but I'm a bit at a loss as how best to structure this.

Where I am so far:

value2022 = aliased(Valuation, name="value2022")
value2021 = aliased(Valuation, name="value2021")
query = (
    db.query(Valuation)
    .outerjoin(value2022, (
            (Valuation.reference == value2022.reference)
            & (Valuation.value > value2022.value)
            & (Valuation.period.startswith("2023"))
            & (value2022.period.startswith("2022"))
        )
    )
    .outerjoin(value2021, (
            (Valuation.reference == value2021.reference)
            & (Valuation.value > value2021.value)
            & (Valuation.period.startswith("2023"))
            & (value2021.period.startswith("2021"))
        )
    )
    .order_by(
        Valuation.reference,
        Valuation.period.desc(),
    )
    .distinct(Valuation.reference)
    .all()
)

However, this doesn't give me the comparison of the value for the latest period vs the latest value for each preceding period and seems to wildly overfit. Can this be done?


Solution

  • I haven't used sqlalchemy for a long time, but maybe this will be useful.

    from sqlalchemy import create_engine, Column, Integer, BigInteger, Float, String, text
    from sqlalchemy.ext.declarative import declarative_base
    from sqlalchemy.orm import sessionmaker
    
    import pandas as pd
    
    engine = create_engine('sqlite://')
    Base = declarative_base(engine)
    Session = sessionmaker(bind=engine)
    session = Session()
    
    
    class Valuation(Base):
        __tablename__ = 'valuation'
        id = Column(Integer, primary_key=True)
        reference = Column(BigInteger, index=True)
        value = Column(Float)
        period = Column(String)
    
    
    Base.metadata.create_all(engine)
    for valuation in [
        Valuation(reference=2433, value=110, period='2023-a'),
        Valuation(reference=5435, value=120, period='2023-b'),
        Valuation(reference=5435, value=110, period='2022-a'),
        Valuation(reference=2433, value=100, period='2022-b'),
        Valuation(reference=5435, value=105, period='2022-c'),
        Valuation(reference=2433, value=100, period='2021-a'),
    ]:
        session.add(valuation)
    
    session.commit()
    
    with engine.connect() as con:
        cursor = con.execute(text("""
            -- max by reference
            WITH r AS (
                SELECT reference,
                       max(value) AS value
                  FROM valuation
                 GROUP BY reference
            )
            
            SELECT r.*, p.period
              FROM r
              JOIN (
                  SELECT reference, 
                         period,
                         value
                    FROM valuation
              ) AS p ON (p.reference = r.reference AND p.value = r.value)
        """))
    
        print('result using sql:')
        print(cursor.all())
        # also you can use pandas
    
        df = pd.read_sql_query("""
            SELECT reference, 
                   period,
                   max(value) AS value
              FROM valuation
            GROUP BY reference, period
            ORDER BY value DESC
        """, con=con.connection)
    
        print('\ndataframe from database\n')
        print(df)
        print('\ndataframe after deduplication\n')
        print(df.drop_duplicates(['reference']))
    

    Let's run:

    result using sql:
    [(2433, 110.0, '2023-a'), (5435, 120.0, '2023-b')]
    
    dataframe from database
    
       reference  period  value
    0       5435  2023-b  120.0
    1       2433  2023-a  110.0
    2       5435  2022-a  110.0
    3       5435  2022-c  105.0
    4       2433  2021-a  100.0
    5       2433  2022-b  100.0
    
    dataframe after deduplication
    
       reference  period  value
    0       5435  2023-b  120.0
    1       2433  2023-a  110.0
    
    Process finished with exit code 0