Search code examples
sqlsqlitegroup-bycumulative-sum

SQL: Select Running Total For Each Category In Transactional Table, Sorted By Date


I have a transactional table providing: an alphanumeric PK, a timestamp, a user_id, a in/out string column, and an amount column.

id          time                  user_id   io    amount
38hw        2019-10-18 18:35:09   2         in     1
nv49        2019-10-18 18:35:10   3         in    50
83ha        2019-10-18 18:35:11   5         in     2
ja03        2019-10-18 18:35:12   4         out    2
019c        2019-10-18 18:35:13   1         out   75
ac5r        2019-10-18 18:35:14   3         in    20
as30        2019-10-18 18:35:15   3         in     3
34ds        2019-10-18 18:35:16   4         in     7
12my        2019-10-18 18:35:17   2         in    50
dk20        2019-10-18 18:35:18   4         in    50
sk18        2019-10-18 18:35:19   1         in     7
am35        2019-10-18 18:35:20   2         in     3
mc92        2019-10-18 18:35:21   2         out    8
alov        2019-10-18 18:35:22   3         in     4
ap34        2019-10-18 18:35:23   1         out    6

I am trying to create another column that provides the running total for that user_id each time it shows up. These user_id's do not initially show up with a 0 amount, so that must be assumed the first time they show up.

I've considered its likely possible to do this with some helper columns. My thought process is something like this:

  • Create a column to indicate how many times the user_id shows up prior to the value in time column. Maybe called occurence_num column
  • Create a column that makes amount easier to work with, like (case when io='in' then amount else -1*amount end) as balance_adjust
  • Group by user_id and (on each row) sum() all the balance_adjust values where occurence_num is less than the current records.

I'm having a hard time testing these ideas out though. I'm working in a fairly large database, SQLite with 22 million rows. The table can be altered/updated as needed. It was stored this way in favor of keeping the ETL as simple as possible because there was a lot of data to pull and a lot of pages to pull from. My desired output would look something like this:

id          time                  user_id   io    amount    running_total
38hw        2019-10-18 18:35:09   2         in     1         1
nv49        2019-10-18 18:35:10   3         in    50        50
83ha        2019-10-18 18:35:11   5         in     2         2
ja03        2019-10-18 18:35:12   4         out    2        -2
019c        2019-10-18 18:35:13   1         out   75       -75
ac5r        2019-10-18 18:35:14   3         in    20        70
as30        2019-10-18 18:35:15   3         in     3        73
34ds        2019-10-18 18:35:16   4         in     7         5
12my        2019-10-18 18:35:17   2         in    50        51
dk20        2019-10-18 18:35:18   4         in    50        55
sk18        2019-10-18 18:35:19   1         in     7       -68
am35        2019-10-18 18:35:20   2         in     3        54
mc92        2019-10-18 18:35:21   2         out    8        46
alov        2019-10-18 18:35:22   3         in     4        77
ap34        2019-10-18 18:35:23   1         out    6       -74

I can get the overall total of each user this way, but it takes a few minutes:

SELECT 
    user_id,
    sum(case when io='in' then amount else -1*amount end) as balance
FROM 
    transactions 
GROUP BY
    user_id

I think expanding on this, a OVER/PARTITION clause will be a good call, but I'm not sure if it's the right call given the size of this database.

Thanks for the help.

Edit: I should mention, the real data may include duplicates in the time column. Transactions could have occurred at the same time, as it's only granular to the second.


Solution

  • A small tweak to your attempt should do it. It's sufficient to turn your sum into a "running sum" using the corresponding window function, that will compute the running amount by partitioning on user and ordering on time.

    And if you have tied times, you can rely on ordering by id, which will break the tie and make the sum work correctly.

    SELECT *,
           SUM(CASE WHEN io = 'in' 
                    THEN amount 
                    ELSE -amount 
               END) OVER(PARTITION BY user_id ORDER BY time, id) as balance
    FROM transactions 
    ORDER BY time
    

    Output:

    id time user_id io amount balance
    38hw 2019-10-18 18:35:09 2 in 1 1
    nv49 2019-10-18 18:35:10 3 in 50 50
    83ha 2019-10-18 18:35:11 5 in 2 2
    ja03 2019-10-18 18:35:12 4 out 2 -2
    019c 2019-10-18 18:35:13 1 out 75 -75
    ac5r 2019-10-18 18:35:14 3 in 20 70
    as30 2019-10-18 18:35:15 3 in 3 73
    34ds 2019-10-18 18:35:16 4 in 7 5
    12my 2019-10-18 18:35:17 2 in 50 51
    dk20 2019-10-18 18:35:18 4 in 50 55
    sk18 2019-10-18 18:35:19 1 in 7 -68
    am35 2019-10-18 18:35:20 2 in 3 54
    mc92 2019-10-18 18:35:21 2 out 8 46
    alov 2019-10-18 18:35:22 3 in 4 77
    ap34 2019-10-18 18:35:23 1 out 6 -74

    Check the demo here.

    Note: The last ORDER BY clause is not needed: it's just for visualization purposes.