I have a table with the following structure (this is abbreviated):
class Valuation(Base):
__tablename__ = 'valuation'
id = Column(Integer, primary_key=True)
reference = Column(BigInteger, index=True)
value = Column(Float)
period = Column(String)
With example data:
reference | value | period |
---|---|---|
2433 | 110 | 2023-a |
5435 | 120 | 2023-b |
5435 | 110 | 2022-a |
2433 | 100 | 2022-b |
5435 | 105 | 2022-c |
2433 | 100 | 2021-a |
Data caveats:
value
should decrease or remain the same over time, so the max of any period should be less than the max of preceding periods.I would like to select every reference where the most recent period value for that reference is greater than any of the latest values for any preceding period.
In the above, that would return:
reference | value | period |
---|---|---|
2433 | 110 | 2023-a |
5435 | 120 | 2023-b |
Reviewing this, indicates that an aliased
approach would help, but I'm a bit at a loss as how best to structure this.
Where I am so far:
value2022 = aliased(Valuation, name="value2022")
value2021 = aliased(Valuation, name="value2021")
query = (
db.query(Valuation)
.outerjoin(value2022, (
(Valuation.reference == value2022.reference)
& (Valuation.value > value2022.value)
& (Valuation.period.startswith("2023"))
& (value2022.period.startswith("2022"))
)
)
.outerjoin(value2021, (
(Valuation.reference == value2021.reference)
& (Valuation.value > value2021.value)
& (Valuation.period.startswith("2023"))
& (value2021.period.startswith("2021"))
)
)
.order_by(
Valuation.reference,
Valuation.period.desc(),
)
.distinct(Valuation.reference)
.all()
)
However, this doesn't give me the comparison of the value for the latest period vs the latest value for each preceding period and seems to wildly overfit. Can this be done?
I haven't used sqlalchemy
for a long time, but maybe this will be useful.
from sqlalchemy import create_engine, Column, Integer, BigInteger, Float, String, text
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
import pandas as pd
engine = create_engine('sqlite://')
Base = declarative_base(engine)
Session = sessionmaker(bind=engine)
session = Session()
class Valuation(Base):
__tablename__ = 'valuation'
id = Column(Integer, primary_key=True)
reference = Column(BigInteger, index=True)
value = Column(Float)
period = Column(String)
Base.metadata.create_all(engine)
for valuation in [
Valuation(reference=2433, value=110, period='2023-a'),
Valuation(reference=5435, value=120, period='2023-b'),
Valuation(reference=5435, value=110, period='2022-a'),
Valuation(reference=2433, value=100, period='2022-b'),
Valuation(reference=5435, value=105, period='2022-c'),
Valuation(reference=2433, value=100, period='2021-a'),
]:
session.add(valuation)
session.commit()
with engine.connect() as con:
cursor = con.execute(text("""
-- max by reference
WITH r AS (
SELECT reference,
max(value) AS value
FROM valuation
GROUP BY reference
)
SELECT r.*, p.period
FROM r
JOIN (
SELECT reference,
period,
value
FROM valuation
) AS p ON (p.reference = r.reference AND p.value = r.value)
"""))
print('result using sql:')
print(cursor.all())
# also you can use pandas
df = pd.read_sql_query("""
SELECT reference,
period,
max(value) AS value
FROM valuation
GROUP BY reference, period
ORDER BY value DESC
""", con=con.connection)
print('\ndataframe from database\n')
print(df)
print('\ndataframe after deduplication\n')
print(df.drop_duplicates(['reference']))
Let's run:
result using sql:
[(2433, 110.0, '2023-a'), (5435, 120.0, '2023-b')]
dataframe from database
reference period value
0 5435 2023-b 120.0
1 2433 2023-a 110.0
2 5435 2022-a 110.0
3 5435 2022-c 105.0
4 2433 2021-a 100.0
5 2433 2022-b 100.0
dataframe after deduplication
reference period value
0 5435 2023-b 120.0
1 2433 2023-a 110.0
Process finished with exit code 0