I have a table in postgres and basically i'm trying to duplicate rows based on the delimiter @ in the description column. Here is my table:
txn_id | description |
---|---|
3332654 | CAD@10.00@0.9397@$10.64@THB@150.00@23.8235@$6.30@KRW@36,000.00@860.3722@$41.84@FJD@20.00@1.5711@$12.73@EUR@5.00@0.6013@$8.32@SGP@15.00@10.6013@$18.32 |
3332655 | CAD@11.00@0.8197@$11.64@THB@110.00@21.8135@$61.30@KRW@36,001.00@861.3722@$411.84@FJD@21.00@11.5711@$11.73 |
Using the below postgresql code, the rows are duplicated based on a set of four values:
select txn_id
,max(case when nr%4=1 then elem end) cncy_cd
,max(case when nr%4=2 then elem end) txn_cncy_amt
,max(case when nr%4=3 then elem end) txn_exch_rate
,max(case when nr%4=0 then elem end) txn_nzd_amt
from test t
left join lateral (select elem,row_number()over() nr
from unnest(string_to_array(t.description, '@')) AS a(elem)) ON true
group by txn_id,(nr-1)/4 ;
and the output will be:
txn_id | cncy_cd | txn_cncy_amt | txn_exch_rate | txn_nzd_amt |
---|---|---|---|---|
3332654 | SGP | 15.00 | 10.6013 | $18.32 |
3332654 | EUR | 5.00 | 0.6013 | $ 8.32 |
3332654 | KRW | 36,000.00 | 860.3722 | $41.84 |
3332655 | THB | 36,001.00 | 861.3722 | $411.84 |
3332654 | FJD | 150.00 | 23.8235 | $6.30 |
3332654 | CAD | 20.00 | 1.5711 | $12.73 |
3332654 | FJD | 10.00 | 0.9397 | $10.64 |
3332655 | CAD | 21.00 | 11.5711 | $11.73 |
3332655 | THB | 11.00 | 0.8197 | $11.64 |
3332655 | SGP | 110.00 | 21.8135 | $61.30 |
Below is the demo of the postgres code: https://dbfiddle.uk/BGOeoIlM
I'm trying to achieve the same result using duckdb and convert it to a polars dataframe and below is my code:
a = duckdb.sql("create table test (txn_id int, description varchar(200));insert into test values (3332654,'CAD@10.00@0.9397@$10.64@THB@150.00@23.8235@$6.30@KRW@36,000.00@860.3722@$41.84@FJD@20.00@1.5711@$12.73@EUR@5.00@0.6013@$8.32@SGP@15.00@10.6013@$18.32'),(3332655,'CAD@11.00@0.8197@$11.64@THB@110.00@21.8135@$61.30@KRW@36,001.00@861.3722@$411.84@FJD@21.00@11.5711@$11.73');select txn_id ,max(case when nr%4=1 then elem end) cncy_cd ,max(case when nr%4=2 then elem end) txn_cncy_amt, max(case when nr%4=3 then elem end) txn_exch_rate ,max(case when nr%4=0 then elem end) txn_nzd_amt from test t left join lateral (select elem,row_number()over() nr from unnest(string_to_array(t.description, '@')) AS a(elem)) ON true group by txn_id,((nr-1)//4);")
Below is the output from duckdb and it does not match with the postgres output.
txn_id | cncy_cd | txn_cncy_amt | txn_exch_rate | txn_nzd_amt |
---|---|---|---|---|
3332654 | FJD | $18.32 | 10.6013 | 15.00 |
3332655 | THB | 110.00 | 21.8135 | $61.30 |
3332654 | SGP | $8.32 | 0.6013 | 5.00 |
3332655 | CAD | 11.00 | 0.8197 | $11.64 |
3332654 | KRW | $6.30 | 23.8235 | 150.00 |
3332655 | KRW | 36,001.00 | 861.3722 | $411.84 |
3332655 | FJD | 21.00 | 11.5711 | $11.73 |
3332654 | THB | $10.64 | 0.9397 | 10.00 |
3332654 | EUR | $12.73 | 1.5711 | 20.00 |
3332654 | CAD | $41.84 | 860.3722 | 36,000.00 |
Not sure what i'm missing here as i tried spending lot of time on this and could not figure out why it would behave differently. Great if someone could please help me with this?
Well, when using SQL you need to remember that in general SQL recordset is unordered collection of rows. Essentially, it means that there's no guaranteed order of rows when you select them without order by
clause. This is one of the key differences with modern 'DataFrame' row collections and 'chained expressions' queries.
In your case it means that running row_number()
over unnested array can be quite random - after array is unnested, you don't have order of array elements anymore. Basically, it's great that Postgres version works stable, but it's not a surprise that DuckDB doesn't work.
To get your code more production-ready, I'd suggest to use with ordinality
clause in Postgres, which would number the rows of the function result set:
select
txn_id,
max(case when nr%4=1 then elem end) as cncy_cd,
max(case when nr%4=2 then elem end) as txn_cncy_amt,
max(case when nr%4=3 then elem end) as txn_exch_rate,
max(case when nr%4=0 then elem end) as txn_nzd_amt
from test as t
left join lateral unnest(string_to_array(t.description, '@')) with ordinality as a(elem, nr) on true
group by
txn_id,
(nr-1)/4 ;
And in DuckDB you can use generate_subscripts
function, which would generates indexes along Nth (in your case - first and only) dimension of the array:
duckdb.sql("""
drop table if exists test;
create table test (txn_id int, description varchar(200));
insert into test
values
(3332654,'CAD@10.00@0.9397@$10.64@THB@150.00@23.8235@$6.30@KRW@36,000.00@860.3722@$41.84@FJD@20.00@1.5711@$12.73@EUR@5.00@0.6013@$8.32@SGP@15.00@10.6013@$18.32'),
(3332655,'CAD@11.00@0.8197@$11.64@THB@110.00@21.8135@$61.30@KRW@36,001.00@861.3722@$411.84@FJD@21.00@11.5711@$11.73');
select
txn_id,
max(case when nr%4=1 then elem end) cncy_cd,
max(case when nr%4=2 then elem end) txn_cncy_amt,
max(case when nr%4=3 then elem end) txn_exch_rate,
max(case when nr%4=0 then elem end) txn_nzd_amt
from test t
left join lateral (select string_to_array(t.description, '@')) as a(data) on true
left join lateral (select unnest(a.data), generate_subscripts(a.data, 1)) as b(elem, nr) on true
group by
txn_id,
(nr-1)//4;
""")
┌─────────┬─────────┬──────────────┬───────────────┬─────────────┐
│ txn_id │ cncy_cd │ txn_cncy_amt │ txn_exch_rate │ txn_nzd_amt │
│ int32 │ varchar │ varchar │ varchar │ varchar │
├─────────┼─────────┼──────────────┼───────────────┼─────────────┤
│ 3332655 │ CAD │ 11.00 │ 0.8197 │ $11.64 │
│ 3332654 │ THB │ 150.00 │ 23.8235 │ $6.30 │
│ 3332655 │ FJD │ 21.00 │ 11.5711 │ $11.73 │
│ 3332654 │ FJD │ 20.00 │ 1.5711 │ $12.73 │
│ 3332655 │ KRW │ 36,001.00 │ 861.3722 │ $411.84 │
│ 3332654 │ EUR │ 5.00 │ 0.6013 │ $8.32 │
│ 3332654 │ SGP │ 15.00 │ 10.6013 │ $18.32 │
│ 3332655 │ THB │ 110.00 │ 21.8135 │ $61.30 │
│ 3332654 │ CAD │ 10.00 │ 0.9397 │ $10.64 │
│ 3332654 │ KRW │ 36,000.00 │ 860.3722 │ $41.84 │
├─────────┴─────────┴──────────────┴───────────────┴─────────────┤
│ 10 rows 5 columns │
└────────────────────────────────────────────────────────────────┘