Search code examples
sqlgreatest-n-per-group

How to select most recent record for each ID


I am going to financial data of peers of a company. I have 10 peers for a particular company and the financial data is captured at regular intervals (monthly, quarterly, etc). However since the data capturing does not happen for all together I end up having different most recent update date.

What I want to do is to select most recent row for each peer company ID in a way that I end up having only 11 rows in my table i.e. (1 for my company and 10 peers)

Below is the code that I am running as of now

select * from Financials_table

where PRD_END_DT = (select max(PRD_END_DT) from Financials_table ) -- Selecting the latest period end date
''')
peers_df.createOrReplaceTempView('peers_df')
print(shape('peers_df'))
head('peers_df', 50)

Note that I have a list of peers stored in peers_list and I'd like to get the most recent PRD_END_DT for each of the peers. Now what I am running returns the most recent PRD_END_DT value but not all peers have data as on that date.


Solution

  • There are several ways to get the most recent row per company ID. You haven't tagged your request with your DBMS, so some methods may work for you while others may not yet be supported by your DBMS. Here are some options:

    Get the maximum prd_end_dt per company_id. Then select the matching rows:

    select *
    from table 
    where (company_id, prd_end_dt) in
    (
      select company_id, max(prd_end_dt)
      from financials_table
      group by company_id
    )
    order by company_id;
    

    Select the rows for which no newer prd_end_dt exists for the company_id:

    select *
    from financials_table ft
    where not exists
    (
      select null
      from financials_table newer
      where newer.company_id = ft.company_id
      and newer.prd_end_dt > ft.prd_end_dt
    )
    order by company_id;
    

    Get the maximum prd_end_dt on-the-fly. Then compare the dates:

    select *
    from
    (
      select ft.*, max(prd_end_dt) over (partition by company_id) as max_prd_end_dt
      from financials_table ft
      group by company_id
    ) with_max_prd_end_dt
    where prd_end_dt = max_prd_end_dt
    order by company_id;
    

    Rank each company's rows per date and only keep the newest:

    select *
    from financials_table
    order by rank() over (partition by company_id order by prd_end_dt desc)
    fetch first row with ties;