Search code examples
postgresqlpython-polarsrust-polarsduckdb

Duckdb sql and postgres results differ when using unnest


I have a table in postgres and basically i'm trying to duplicate rows based on the delimiter @ in the description column. Here is my table:

txn_id description
3332654 CAD@10.00@0.9397@$10.64@THB@150.00@23.8235@$6.30@KRW@36,000.00@860.3722@$41.84@FJD@20.00@1.5711@$12.73@EUR@5.00@0.6013@$8.32@SGP@15.00@10.6013@$18.32
3332655 CAD@11.00@0.8197@$11.64@THB@110.00@21.8135@$61.30@KRW@36,001.00@861.3722@$411.84@FJD@21.00@11.5711@$11.73

Using the below postgresql code, the rows are duplicated based on a set of four values:

select txn_id 
  ,max(case when nr%4=1 then elem end) cncy_cd
  ,max(case when nr%4=2 then elem end) txn_cncy_amt
  ,max(case when nr%4=3 then elem end) txn_exch_rate
  ,max(case when nr%4=0 then elem end) txn_nzd_amt
from test t
left join lateral (select elem,row_number()over() nr 
  from  unnest(string_to_array(t.description, '@')) AS a(elem)) ON true  
group by txn_id,(nr-1)/4 ;

and the output will be:

txn_id cncy_cd txn_cncy_amt txn_exch_rate txn_nzd_amt
3332654 SGP 15.00 10.6013 $18.32
3332654 EUR 5.00 0.6013 $ 8.32
3332654 KRW 36,000.00 860.3722 $41.84
3332655 THB 36,001.00 861.3722 $411.84
3332654 FJD 150.00 23.8235 $6.30
3332654 CAD 20.00 1.5711 $12.73
3332654 FJD 10.00 0.9397 $10.64
3332655 CAD 21.00 11.5711 $11.73
3332655 THB 11.00 0.8197 $11.64
3332655 SGP 110.00 21.8135 $61.30

Below is the demo of the postgres code: https://dbfiddle.uk/BGOeoIlM

I'm trying to achieve the same result using duckdb and convert it to a polars dataframe and below is my code:

 a = duckdb.sql("create table test (txn_id int, description varchar(200));insert into test values (3332654,'CAD@10.00@0.9397@$10.64@THB@150.00@23.8235@$6.30@KRW@36,000.00@860.3722@$41.84@FJD@20.00@1.5711@$12.73@EUR@5.00@0.6013@$8.32@SGP@15.00@10.6013@$18.32'),(3332655,'CAD@11.00@0.8197@$11.64@THB@110.00@21.8135@$61.30@KRW@36,001.00@861.3722@$411.84@FJD@21.00@11.5711@$11.73');select txn_id ,max(case when nr%4=1 then elem end) cncy_cd ,max(case when nr%4=2 then elem end) txn_cncy_amt, max(case when nr%4=3 then elem end) txn_exch_rate ,max(case when nr%4=0 then elem end) txn_nzd_amt from test t left join lateral (select elem,row_number()over() nr from  unnest(string_to_array(t.description, '@')) AS a(elem)) ON true  group by txn_id,((nr-1)//4);")

Below is the output from duckdb and it does not match with the postgres output.

txn_id cncy_cd txn_cncy_amt txn_exch_rate txn_nzd_amt
3332654 FJD $18.32 10.6013 15.00
3332655 THB 110.00 21.8135 $61.30
3332654 SGP $8.32 0.6013 5.00
3332655 CAD 11.00 0.8197 $11.64
3332654 KRW $6.30 23.8235 150.00
3332655 KRW 36,001.00 861.3722 $411.84
3332655 FJD 21.00 11.5711 $11.73
3332654 THB $10.64 0.9397 10.00
3332654 EUR $12.73 1.5711 20.00
3332654 CAD $41.84 860.3722 36,000.00

Not sure what i'm missing here as i tried spending lot of time on this and could not figure out why it would behave differently. Great if someone could please help me with this?


Solution

  • Well, when using SQL you need to remember that in general SQL recordset is unordered collection of rows. Essentially, it means that there's no guaranteed order of rows when you select them without order by clause. This is one of the key differences with modern 'DataFrame' row collections and 'chained expressions' queries.

    In your case it means that running row_number() over unnested array can be quite random - after array is unnested, you don't have order of array elements anymore. Basically, it's great that Postgres version works stable, but it's not a surprise that DuckDB doesn't work.

    To get your code more production-ready, I'd suggest to use with ordinality clause in Postgres, which would number the rows of the function result set:

    select
        txn_id,
        max(case when nr%4=1 then elem end) as cncy_cd,
        max(case when nr%4=2 then elem end) as txn_cncy_amt,
        max(case when nr%4=3 then elem end) as txn_exch_rate,
        max(case when nr%4=0 then elem end) as txn_nzd_amt
    from test as t
        left join lateral unnest(string_to_array(t.description, '@')) with ordinality as a(elem, nr) on true  
    group by
        txn_id,
        (nr-1)/4 ;
    

    db<>fiddle

    And in DuckDB you can use generate_subscripts function, which would generates indexes along Nth (in your case - first and only) dimension of the array:

    duckdb.sql("""
        drop table if exists test;
    
        create table test (txn_id int, description varchar(200));
        
        insert into test
        values
            (3332654,'CAD@10.00@0.9397@$10.64@THB@150.00@23.8235@$6.30@KRW@36,000.00@860.3722@$41.84@FJD@20.00@1.5711@$12.73@EUR@5.00@0.6013@$8.32@SGP@15.00@10.6013@$18.32'),
            (3332655,'CAD@11.00@0.8197@$11.64@THB@110.00@21.8135@$61.30@KRW@36,001.00@861.3722@$411.84@FJD@21.00@11.5711@$11.73');
            
        select
            txn_id,
            max(case when nr%4=1 then elem end) cncy_cd,
            max(case when nr%4=2 then elem end) txn_cncy_amt,
            max(case when nr%4=3 then elem end) txn_exch_rate,
            max(case when nr%4=0 then elem end) txn_nzd_amt
        from test t
            left join lateral (select string_to_array(t.description, '@')) as a(data) on true
            left join lateral (select unnest(a.data), generate_subscripts(a.data, 1)) as b(elem, nr) on true
        group by
            txn_id,
            (nr-1)//4;
    """)
    
    ┌─────────┬─────────┬──────────────┬───────────────┬─────────────┐
    │ txn_id  │ cncy_cd │ txn_cncy_amt │ txn_exch_rate │ txn_nzd_amt │
    │  int32  │ varchar │   varchar    │    varchar    │   varchar   │
    ├─────────┼─────────┼──────────────┼───────────────┼─────────────┤
    │ 3332655 │ CAD     │ 11.00        │ 0.8197        │ $11.64      │
    │ 3332654 │ THB     │ 150.00       │ 23.8235       │ $6.30       │
    │ 3332655 │ FJD     │ 21.00        │ 11.5711       │ $11.73      │
    │ 3332654 │ FJD     │ 20.00        │ 1.5711        │ $12.73      │
    │ 3332655 │ KRW     │ 36,001.00    │ 861.3722      │ $411.84     │
    │ 3332654 │ EUR     │ 5.00         │ 0.6013        │ $8.32       │
    │ 3332654 │ SGP     │ 15.00        │ 10.6013       │ $18.32      │
    │ 3332655 │ THB     │ 110.00       │ 21.8135       │ $61.30      │
    │ 3332654 │ CAD     │ 10.00        │ 0.9397        │ $10.64      │
    │ 3332654 │ KRW     │ 36,000.00    │ 860.3722      │ $41.84      │
    ├─────────┴─────────┴──────────────┴───────────────┴─────────────┤
    │ 10 rows                                              5 columns │
    └────────────────────────────────────────────────────────────────┘