sql datetime apache-spark-sql greatest-n-per-group min

Trying to fetch the first record from a group in SQL

I'm trying to query the earliest timestamp for a candidate, for a project, for a contract on spark SQL.

spark.sql(
      """
        |SELECT
        | DISTICT
        | timestamp,
        | candidate_id,
        | project_id,
        | contract_id
        |FROM candidatesHistory
        |GROUP BY timestamp, candidate_id, project_id, contract_id
        |ORDER BY timestamp DESC
        |LIMIT 1
        |""".stripMargin)

This code doesn't do it, it just fetches one record - how do I get the oldest timestamp for a candidate for a project for a contract?

Any help appreciated

Solution

If you only have 4 columns in the table, then you can use aggregation:

select candidate_id, project_id, contract_id, min(timestamp) first_timestamp
from candidateshistory
group by candidate_id, project_id, contract_id

If there are more columns and you want to bring all of them, then you can use row_number() to filter the table:

select ch.*
from (
    select ch.*,
        row_number() over(partition by candidate_id, project_id, contract_id order by timestamp) rn
    from candidateshistory ch
) ch
where rn = 1

For each (candidate_id, project_id, contract_id) tuple, this gives you the row with the earliest timestamp.