I have a transactional table providing: an alphanumeric PK, a timestamp, a user_id, a in/out string column, and an amount column.
id time user_id io amount
38hw 2019-10-18 18:35:09 2 in 1
nv49 2019-10-18 18:35:10 3 in 50
83ha 2019-10-18 18:35:11 5 in 2
ja03 2019-10-18 18:35:12 4 out 2
019c 2019-10-18 18:35:13 1 out 75
ac5r 2019-10-18 18:35:14 3 in 20
as30 2019-10-18 18:35:15 3 in 3
34ds 2019-10-18 18:35:16 4 in 7
12my 2019-10-18 18:35:17 2 in 50
dk20 2019-10-18 18:35:18 4 in 50
sk18 2019-10-18 18:35:19 1 in 7
am35 2019-10-18 18:35:20 2 in 3
mc92 2019-10-18 18:35:21 2 out 8
alov 2019-10-18 18:35:22 3 in 4
ap34 2019-10-18 18:35:23 1 out 6
I am trying to create another column that provides the running total for that user_id each time it shows up. These user_id's do not initially show up with a 0 amount, so that must be assumed the first time they show up.
I've considered its likely possible to do this with some helper columns. My thought process is something like this:
time
column. Maybe called occurence_num
column(case when io='in' then amount else -1*amount end) as balance_adjust
user_id
and (on each row) sum()
all the balance_adjust
values where occurence_num
is less than the current records.I'm having a hard time testing these ideas out though. I'm working in a fairly large database, SQLite with 22 million rows. The table can be altered/updated as needed. It was stored this way in favor of keeping the ETL as simple as possible because there was a lot of data to pull and a lot of pages to pull from. My desired output would look something like this:
id time user_id io amount running_total
38hw 2019-10-18 18:35:09 2 in 1 1
nv49 2019-10-18 18:35:10 3 in 50 50
83ha 2019-10-18 18:35:11 5 in 2 2
ja03 2019-10-18 18:35:12 4 out 2 -2
019c 2019-10-18 18:35:13 1 out 75 -75
ac5r 2019-10-18 18:35:14 3 in 20 70
as30 2019-10-18 18:35:15 3 in 3 73
34ds 2019-10-18 18:35:16 4 in 7 5
12my 2019-10-18 18:35:17 2 in 50 51
dk20 2019-10-18 18:35:18 4 in 50 55
sk18 2019-10-18 18:35:19 1 in 7 -68
am35 2019-10-18 18:35:20 2 in 3 54
mc92 2019-10-18 18:35:21 2 out 8 46
alov 2019-10-18 18:35:22 3 in 4 77
ap34 2019-10-18 18:35:23 1 out 6 -74
I can get the overall total of each user this way, but it takes a few minutes:
SELECT
user_id,
sum(case when io='in' then amount else -1*amount end) as balance
FROM
transactions
GROUP BY
user_id
I think expanding on this, a OVER
/PARTITION
clause will be a good call, but I'm not sure if it's the right call given the size of this database.
Thanks for the help.
Edit: I should mention, the real data may include duplicates in the time column. Transactions could have occurred at the same time, as it's only granular to the second.
A small tweak to your attempt should do it. It's sufficient to turn your sum into a "running sum" using the corresponding window function, that will compute the running amount by partitioning on user and ordering on time.
And if you have tied times, you can rely on ordering by id, which will break the tie and make the sum work correctly.
SELECT *,
SUM(CASE WHEN io = 'in'
THEN amount
ELSE -amount
END) OVER(PARTITION BY user_id ORDER BY time, id) as balance
FROM transactions
ORDER BY time
Output:
id | time | user_id | io | amount | balance |
---|---|---|---|---|---|
38hw | 2019-10-18 18:35:09 | 2 | in | 1 | 1 |
nv49 | 2019-10-18 18:35:10 | 3 | in | 50 | 50 |
83ha | 2019-10-18 18:35:11 | 5 | in | 2 | 2 |
ja03 | 2019-10-18 18:35:12 | 4 | out | 2 | -2 |
019c | 2019-10-18 18:35:13 | 1 | out | 75 | -75 |
ac5r | 2019-10-18 18:35:14 | 3 | in | 20 | 70 |
as30 | 2019-10-18 18:35:15 | 3 | in | 3 | 73 |
34ds | 2019-10-18 18:35:16 | 4 | in | 7 | 5 |
12my | 2019-10-18 18:35:17 | 2 | in | 50 | 51 |
dk20 | 2019-10-18 18:35:18 | 4 | in | 50 | 55 |
sk18 | 2019-10-18 18:35:19 | 1 | in | 7 | -68 |
am35 | 2019-10-18 18:35:20 | 2 | in | 3 | 54 |
mc92 | 2019-10-18 18:35:21 | 2 | out | 8 | 46 |
alov | 2019-10-18 18:35:22 | 3 | in | 4 | 77 |
ap34 | 2019-10-18 18:35:23 | 1 | out | 6 | -74 |
Check the demo here.
Note: The last ORDER BY
clause is not needed: it's just for visualization purposes.