Search code examples
sqlgoogle-bigquery

Filtering for unique pairs in a SQL join- simplify a working query


I have data for event registrations where users have an account on the website (user_id), submit a form for registration (nametag), and pay via stripe (order_id).

Each registration must have a unique payment associated with it. Sometimes one person will log in and register themselves and their friends, which trips up the current model.

The below query results in the correct answer. Is there a simpler way to do it?

-- registration form table
with forms as (
  select
    1 as user_id,
    'jeff' as nametag
  union all
  select
    1 as user_id,
    'john' as nametag
  union all
  select
    1 as user_id,
    'jim' as nametag
),

-- paid orders table
orders as (
  select
    1 as user_id,
    'abc' as order_id
  union all
  select
    1 as user_id,
    'def' as order_id
  union all
  select
    1 as user_id,
    'zyx' as order_id
),

-- my query
window1 as (
  select
      forms.*,
      orders.order_id,
      row_number() over (partition by order_id order by orders.order_id, forms.nametag) as row_num1
  from forms
  inner join orders
    on forms.user_id = orders.user_id
)

select * except (row_num1, row_num2)
from (
  select
      *,
      row_number() over (partition by row_num1 order by order_id) as row_num2
  from window1
)
where row_num1 = row_num2

output:

user_id nametag order_id
1 jeff abc
1 jim def
1 john xyz

Solution

  • It would be good for design to have some column to order by like transaction timestamp, and respectively user registration timestamp.

    Assuming it's not present the following should simplify your query:

    select
        vw_forms.user_id,
        vw_forms.nametag,
        vw_orders.order_id
    from 
        (select *, row_number() over (partition by user_id) as r_forms from forms) vw_forms
    inner join 
        (select *, row_number() over (partition by user_id) as r_orders from orders) vw_orders
    on 
        vw_forms.user_id = vw_orders.user_id and vw_forms.r_forms = vw_orders.r_orders
    

    I skip ordering, because from your description I cannot really discern why would you do it.