Search code examples
sqloraclesubquerywindow-functionsgaps-and-islands

Oracle SQL or PLSQL. Select rows by partitions which values have specific order


Task: select sportsmen who participate in at least 2 competitions in a row (2 competitions go one after another; 1-2-3-4-5: 2&4 or 1&3&5 are not ok, 1&2 is ok, 1&2&3 is ok, 1&2 and 4&5 is ok). Question: find the best way (faster, less resources)

Working table:

enter image description here

There is single hold_date for each competition_id.

There is only one result for each sportsman_id per competition_id.

This works fine for 25 rows in result table:

SELECT DISTINCT sportsman_id, sportsman_name, rank, year_of_birth, personal_record, country
FROM
    (
    SELECT sportsman_id, hold_date,
        LAG (comp_order, 1) OVER (PARTITION BY sportsman_id ORDER BY sportsman_id) prev_comp_number
        , comp_order
    FROM result
    INNER JOIN
        (
        SELECT hold_date, ROW_NUMBER() OVER (ORDER BY hold_date) AS comp_order
        FROM
            (
            SELECT DISTINCT hold_date
            FROM result
            )
        ) USING (hold_date)
    ORDER BY sportsman_id, comp_order
    )
INNER JOIN sportsman USING (sportsman_id)
WHERE comp_order-prev_comp_number=1
;

screenshot of code with comments:

screenshot of code with comments

sample data:

sample data

result of code above (=desired result)

result of code above (=desired result)

Let's assume there are millions of rows (thousands of competitions and thousands of sportsmen). How reliable my code is?

What I think is decreasing the number of rows by excluding the rows if sportsman_id occurs just once (if sportsman took part (got result) at only 1 competition he obviously can't be the one). Something like this: (haven't implement tho (dunno how or most likely when/where))

SELECT re.hold_date, r.sportsman_id
FROM result r
INNER JOIN result re ON (re.sportsman_id=r.sportsman_id)
GROUP BY r.sportsman_id, re.hold_date
HAVING COUNT(r.sportsman_id) > 1
;

Then, I guess with LAG I only double existing column which is kinda fine?

Is there easier way with using PLSQL? Or there is a function which does some part of my code?


Solution

  • You can do it by only reading the table once using the Tabibitosan method to group sequential competitions together https://www.red-gate.com/simple-talk/sql/t-sql-programming/the-sql-of-gaps-and-islands-in-sequences/#:%7E:text=The%20SQL%20of%20Gaps%20and%20Islands%20in%20Sequences,...%204%20Performance%20Comparison%20of%20Gaps%20Solutions.%20

    Here you would have to use add_months because your competitions are months apart:

    select sportsman_id, min(hold_date) , max(hold_date), comps_in_island
    from (
     select  competition_id, sportsman_id, hold_date, island, count(*) over (partition by sportsman_id,island) comps_in_island
     from (
      select  competition_id, sportsman_id, hold_date , add_months(hold_date,-1*row_number() over(partition by sportsman_id order by hold_date)) island
      from    result
     )
    )
    where comps_in_island > 1
    group by sportsman_id, island, comps_in_island;
    

    DB fiddle: https://dbfiddle.uk/?rdbms=oracle_18&fiddle=1b707262722bc555ad851aee029b347a

    -edit I got confused by some of the data, it looks like it's not the date that's important but the competition_id. This makes it simpler, if you have a gapless competition_id sequence (so competition 65786162213 was 65.7 billion events after 4)

    select sportsman_id, min(competition_id) , max(competition_id), comps_in_island
    from (
     select  competition_id, sportsman_id, hold_date, island, count(*) over (partition by sportsman_id,island) comps_in_island
     from 
      select  competition_id, sportsman_id, hold_date , competition_id -row_number() over(partition by sportsman_id order by competition_id)) island
      from    result
     )
    )
    where comps_in_island > 1
    group by sportsman_id, island, comps_in_island;
    

    Or if you need to work out the competition numbers first you just need an additional subquery using dense_rank to rank the unique competition_ids accounting for ties :

    select sportsman_id, min(competition_id) , max(competition_id), comps_in_island
    from (
     select  competition_id, sportsman_id, hold_date, island, count(*) over (partition by sportsman_id,island) comps_in_island
     from (
      select  competition_id, sportsman_id, hold_date , comp_number -row_number() over(partition by sportsman_id order by comp_number) island
      from (  
       select  competition_id, sportsman_id, hold_date , dense_rank() over (partition by null order by competition_id) comp_number
       from    result
      )
     )
    )
    where comps_in_island > 1
    group by sportsman_id, island, comps_in_island;
    

    This does assume that every possible competion_id you care about has a row in result.