Search code examples
sqljsonamazon-redshiftpartiql

How to read a SUPER data-type value with multiple values sharing the same property names?


I have in my redshift database a field of the SUPER type:

It contains

{
    "Order": {
      "OrderNumber": "Ab1234ZX",
      "ReasonDelay": "No reason",
      "ReasonDelay": false
    }
}

The problem is that I have two attributes with the same name (ReasonDelay) but different values ("No reason" and false).

select 
    e.contents."Order"."ReasonDelay" as first_reason,
    e.contents."Order"."ReasonDelay" as second_reason
from
    orders e

obviously gives me the same results. using [0] and [1] (obviously) does nothing:

select 
    e.contents."Order"[0]."ReasonDelay" as first_reason,
    e.contents."Order"[1]."ReasonDelay" as second_reason
from
    orders e
select 
    e.contents."Order"."ReasonDelay"[0] as first_reason,
    e.contents."Order"."ReasonDelay"[1] as second_reason
from
    orders e

How do I query this?


Solution

  • You are dealing with an ambiguous corner case in json and PartiQL. My recommendation is to remove the ambiguity as early in your data processing as possible. Different tools will handle this situation differently and the same tool is free to change how they handle this case across releases. Even if you find a way to get it working today you likely don't have a robust data solution for the long haul and not just for Redshift.

    Now if you want to disambiguate the data in Redshift (not the place I would recommend doing this) you write a regexp_replace to change your strings from

    {
        "Order": {
          "OrderNumber": "Ab1234ZX",
          "ReasonDelay": "No reason",
          "ReasonDelay": false
        }
    }
    

    To

    {
        "Order": {
          "OrderNumber": "Ab1234ZX",
          "ReasonDelay": ["No reason", false]
        }
    }
    

    for example. Clearly this needs to be done as text and then cast to super. The issue with this approach is that it will be dependent on text format and again not very robust.

    I think the best answer is to fix your data at the source or as early in the pipeline as possible.