I'm trying to query the earliest timestamp for a candidate, for a project, for a contract on spark SQL.
spark.sql(
"""
|SELECT
| DISTICT
| timestamp,
| candidate_id,
| project_id,
| contract_id
|FROM candidatesHistory
|GROUP BY timestamp, candidate_id, project_id, contract_id
|ORDER BY timestamp DESC
|LIMIT 1
|""".stripMargin)
This code doesn't do it, it just fetches one record - how do I get the oldest timestamp for a candidate for a project for a contract?
Any help appreciated
If you only have 4 columns in the table, then you can use aggregation:
select candidate_id, project_id, contract_id, min(timestamp) first_timestamp
from candidateshistory
group by candidate_id, project_id, contract_id
If there are more columns and you want to bring all of them, then you can use row_number()
to filter the table:
select ch.*
from (
select ch.*,
row_number() over(partition by candidate_id, project_id, contract_id order by timestamp) rn
from candidateshistory ch
) ch
where rn = 1
For each (candidate_id, project_id, contract_id)
tuple, this gives you the row with the earliest timestamp
.