Search code examples
sqlsql-servergaps-and-islands

SQL window functions - how to assign a trip number


I have a transactions table with columns user_id, transaction_timestamp, transaction_currency. Assume that home currency is 'ABC' and hence domestic transactions are all denoted by transaction_currency = 'ABC'. I want to define another column to the table denoting the trip number of each use based on the progression of transaction_currency. Given the ordered (by timestamp) list of transaction for a specific user, a new trip starts at each transaction where transaction_currency is not 'ABC' and the immediately previous transaction had transaction_currency = 'ABC'. Once a trip starts, all transactions are part of the same trip number until the immediately next transaction has transaction_currency = 'ABC'. Again, the very next time another currency appears (other than 'ABC'), a new trip number is assigned to all transactions until it is ended by another transaction having transaction_currency = 'ABC'. For all transactions with transaction_currency = 'ABC', trip number can be defaulted to 0. For others, trip number starts from 1 and increments by 1 for every new trip.

Here is the expected result for a sample dataset

USER TXN_TIMESTAMP TRANSACTION_CURRENCY TRIP_NUMBER
1 1 ABC 0
1 2 AED 1
1 3 AED 1
1 4 AED 1
1 5 ABC 0
1 6 ABC 0
1 7 AED 2
1 8 AED 2
1 9 AED 2
1 10 AED 2
1 11 ABC 0
1 12 ABC 0
1 13 ABC 0
1 14 THB 3
1 15 THB 3
1 16 THB 3
1 17 USD 3 (trip num should not change here)
1 18 USD 3
1 19 ABC 0
1 20 AED 5

I know this can be done by window functions, but I am a beginner.

I tried the following

WITH TripStarts AS (
  SELECT
    user_id,
    transaction_timestamp,
    transaction_currency,
    LAG(transaction_currency, 1, 'ABC') OVER (PARTITION BY user_id ORDER BY transaction_timestamp) AS prev_currency,
  FROM
    transactions
)

SELECT
  *, 
  CASE
    WHEN transaction_currency = 'ABC' THEN 0
    WHEN transaction_currency != 'ABC' AND prev_currency = 'ABC' THEN ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY transaction_timestamp)
  END AS trip_number
FROM TripStarts

But the trip_number keeps getting incremented for each transaction in the same trip.

Awesome if this can be done by commonly used window functions.


Solution

  • This is a gaps problem, i usually do a two step counter + sum solution for that:

    select *
    ,   case when transaction_currency = 'abc' then 0 else  SUM(flag) over(partition by [user] order by txn_timestamp rows between unbounded preceding and current row) end as trip_number_new
    from (
        select *, case when lag(transaction_currency) over(partition by [user] order by txn_timestamp) = 'ABC' and transaction_currency <> 'abc' then 1 else 0 end as flag
        from (
            VALUES  (1, 1, N'ABC', N'0')
            ,   (1, 2, N'AED', N'1')
            ,   (1, 3, N'AED', N'1')
            ,   (1, 4, N'AED', N'1')
            ,   (1, 5, N'ABC', N'0')
            ,   (1, 6, N'ABC', N'0')
            ,   (1, 7, N'AED', N'2')
            ,   (1, 8, N'AED', N'2')
            ,   (1, 9, N'AED', N'2')
            ,   (1, 10, N'AED', N'2')
            ,   (1, 11, N'ABC', N'0')
            ,   (1, 12, N'ABC', N'0')
            ,   (1, 13, N'ABC', N'0')
            ,   (1, 14, N'THB', N'3')
            ,   (1, 15, N'THB', N'3')
            ,   (1, 16, N'THB', N'3')
            ,   (1, 17, N'USD', N'3 (trip num should not change here)')
            ,   (1, 18, N'USD', N'3')
            ,   (1, 19, N'ABC', N'0')
            ,   (1, 20, N'AED', N'5')
        ) t ([USER],TXN_TIMESTAMP,TRANSACTION_CURRENCY,TRIP_NUMBER)
        ) x
    

    Step 1:

    Create a flag that changes when you want to get a new trip number. In this case, we want it to happen when previous trip was ABC and current trip isn't ABC: case when lag(transaction_currency) over(partition by [user] order by txn_timestamp) = 'ABC' and transaction_currency <> 'abc' then 1 else 0 end

    Step 2:

    Summarize the flags over ordered by timestamp to get the trip number. For ABC, we set trip number to zero:

    case when transaction_currency = 'abc' then 0 else SUM(flag) over(partition by [user] order by txn_timestamp rows between unbounded preceding and current row) end

    Output:

    USER TXN_TIMESTAMP TRANSACTION_CURRENCY TRIP_NUMBER flag trip_number_new
    1 1 ABC 0 0 0
    1 2 AED 1 1 1
    1 3 AED 1 0 1
    1 4 AED 1 0 1
    1 5 ABC 0 0 0
    1 6 ABC 0 0 0
    1 7 AED 2 1 2
    1 8 AED 2 0 2
    1 9 AED 2 0 2
    1 10 AED 2 0 2
    1 11 ABC 0 0 0
    1 12 ABC 0 0 0
    1 13 ABC 0 0 0
    1 14 THB 3 1 3
    1 15 THB 3 0 3
    1 16 THB 3 0 3
    1 17 USD 3 (trip num should not change here) 0 3
    1 18 USD 3 0 3
    1 19 ABC 0 0 0
    1 20 AED 5 1 4

    Note, that output doesn't match on the last trip, but i think that's a mistake in your data