Search code examples
sqlsqlitejoingroup-bywindow-functions

How to pull a list of all visitor_ids that generated more than $500 combined in their first two sessions in the month of January 2020?


Tables:

  1. Sessions

    • session_ts
    • visitor_id
    • vertical
    • session_id
  2. Transactions

    • session_ts
    • session_id
    • rev_bucket
    • revenue

Currently have the following query (using SQLite):

SELECT
  s.visitor_id,
  sub.session_id,
  month,
  year,
  total_rev,
  CASE 
    WHEN (row_num IN (1,2) >= total_rev >= 500) THEN 'Yes'
    ELSE 'No' END AS High_Value_Transactions,
  sub.row_num
FROM
  sessions s
JOIN
  (
    SELECT
    s.visitor_id,
    t.session_id,
    strftime('%m',t.session_ts) as month,
    strftime('%Y',t.session_ts) as year,
    SUM(t.revenue) as total_rev,
    row_number() OVER(PARTITION BY s.visitor_id ORDER BY s.session_ts) as row_num
  FROM
    Transactions t
  JOIN
    sessions s
  ON
    s.session_id = t.session_id
  WHERE strftime('%m',t.session_ts) = '01'
  AND strftime('%Y',t.session_ts) = '2020'
  GROUP BY 1,2
) sub
ON
  s.session_id = sub.session_id
WHERE sub.row_num IN (1,2)
ORDER BY 1

I'm having trouble identifying the first two sessions that combine for $500. Open to any feedback and simplifying of query. Thanks!


Solution

  • You can use window functions and aggregation:

    select visitor_id, sum(t.revenue) total_revenue
    from (
        select 
            s.visitor_id, 
            t.revenue, 
            row_number() over(partition by s.visitor_id order by t.session_ts) rn
        from transactions t
        inner join sessions s on s.session_id = t.session_id
        where t.session_ts >= '2020-01-01' and t.session_ts < '2020-02-01'
    ) t
    where rn <= 2
    group by visitor_id
    having sum(t.revenue) >= 500
    

    The subquery joins the two tables, filters on the target month (note that using half-open interval predicates is more efficient than applying date functions on the date column), and ranks each row within groups of visits of the same customer.

    Then, the outer query filters on the first two visits per visitor, aggregates by visitor, computes the corresponding revenue, and filters it with a having clause.