Task: select sportsmen who participate in at least 2 competitions in a row (2 competitions go one after another; 1-2-3-4-5: 2&4 or 1&3&5 are not ok, 1&2 is ok, 1&2&3 is ok, 1&2 and 4&5 is ok). Question: find the best way (faster, less resources)
Working table:
There is single hold_date for each competition_id.
There is only one result for each sportsman_id per competition_id.
This works fine for 25 rows in result table:
SELECT DISTINCT sportsman_id, sportsman_name, rank, year_of_birth, personal_record, country
FROM
(
SELECT sportsman_id, hold_date,
LAG (comp_order, 1) OVER (PARTITION BY sportsman_id ORDER BY sportsman_id) prev_comp_number
, comp_order
FROM result
INNER JOIN
(
SELECT hold_date, ROW_NUMBER() OVER (ORDER BY hold_date) AS comp_order
FROM
(
SELECT DISTINCT hold_date
FROM result
)
) USING (hold_date)
ORDER BY sportsman_id, comp_order
)
INNER JOIN sportsman USING (sportsman_id)
WHERE comp_order-prev_comp_number=1
;
screenshot of code with comments:
sample data:
result of code above (=desired result)
Let's assume there are millions of rows (thousands of competitions and thousands of sportsmen). How reliable my code is?
What I think is decreasing the number of rows by excluding the rows if sportsman_id occurs just once (if sportsman took part (got result) at only 1 competition he obviously can't be the one). Something like this: (haven't implement tho (dunno how or most likely when/where))
SELECT re.hold_date, r.sportsman_id
FROM result r
INNER JOIN result re ON (re.sportsman_id=r.sportsman_id)
GROUP BY r.sportsman_id, re.hold_date
HAVING COUNT(r.sportsman_id) > 1
;
Then, I guess with LAG I only double existing column which is kinda fine?
Is there easier way with using PLSQL? Or there is a function which does some part of my code?
You can do it by only reading the table once using the Tabibitosan method to group sequential competitions together https://www.red-gate.com/simple-talk/sql/t-sql-programming/the-sql-of-gaps-and-islands-in-sequences/#:%7E:text=The%20SQL%20of%20Gaps%20and%20Islands%20in%20Sequences,...%204%20Performance%20Comparison%20of%20Gaps%20Solutions.%20
Here you would have to use add_months because your competitions are months apart:
select sportsman_id, min(hold_date) , max(hold_date), comps_in_island
from (
select competition_id, sportsman_id, hold_date, island, count(*) over (partition by sportsman_id,island) comps_in_island
from (
select competition_id, sportsman_id, hold_date , add_months(hold_date,-1*row_number() over(partition by sportsman_id order by hold_date)) island
from result
)
)
where comps_in_island > 1
group by sportsman_id, island, comps_in_island;
DB fiddle: https://dbfiddle.uk/?rdbms=oracle_18&fiddle=1b707262722bc555ad851aee029b347a
-edit I got confused by some of the data, it looks like it's not the date that's important but the competition_id. This makes it simpler, if you have a gapless competition_id sequence (so competition 65786162213 was 65.7 billion events after 4)
select sportsman_id, min(competition_id) , max(competition_id), comps_in_island
from (
select competition_id, sportsman_id, hold_date, island, count(*) over (partition by sportsman_id,island) comps_in_island
from
select competition_id, sportsman_id, hold_date , competition_id -row_number() over(partition by sportsman_id order by competition_id)) island
from result
)
)
where comps_in_island > 1
group by sportsman_id, island, comps_in_island;
Or if you need to work out the competition numbers first you just need an additional subquery using dense_rank to rank the unique competition_ids accounting for ties :
select sportsman_id, min(competition_id) , max(competition_id), comps_in_island
from (
select competition_id, sportsman_id, hold_date, island, count(*) over (partition by sportsman_id,island) comps_in_island
from (
select competition_id, sportsman_id, hold_date , comp_number -row_number() over(partition by sportsman_id order by comp_number) island
from (
select competition_id, sportsman_id, hold_date , dense_rank() over (partition by null order by competition_id) comp_number
from result
)
)
)
where comps_in_island > 1
group by sportsman_id, island, comps_in_island;
This does assume that every possible competion_id you care about has a row in result.