amazon-web-servicesapache-sparkaws-glue

JSON array field changes to "double/int/string/struct" format in AWS Glue


I have a JSON file from online.

{
  "icao": "f3b100",
  "noRegData": true,
  "timestamp": 1690848000,
  "trace": [
    [
      51213.77,
      39.090519,
      -85.870361,
      4500,
      95,
      359.4,
      5,
      0,
      {
        "type": "tisb_icao",
        "track": 359.4,
        "geom_rate": 0,
        "nic": 0,
        "rc": 0,
        "nac_p": 0,
        "nac_v": 0,
        "sil": 0,
        "sil_type": "unknown",
        "alert": 0,
        "spi": 0
      },
      "tisb_icao",
      null,
      0,
      null,
      null,
      "f9adebf3"
    ]
  ]
}

I uploaded this JSON file to the S3 and try to use AWS Glue to read it.

However, in AWS Glue, it read/parse as

enter image description here

Note trace field becomes

[
  [
    {
      "double": 51213.77,
      "int": null,
      "string": null,
      "struct": null
    },
    {
      "double": 39.090519,
      "int": null,
      "string": null,
      "struct": null
    },
    {
      "double": -85.870361,
      "int": null,
      "string": null,
      "struct": null
    },
    {
      "double": null,
      "int": 4500,
      "string": null,
      "struct": null
    },
    {
      "double": 95,
      "int": null,
      "string": null,
      "struct": null
    },
    {
      "double": 359.4,
      "int": null,
      "string": null,
      "struct": null
    },
    {
      "double": null,
      "int": 5,
      "string": null,
      "struct": null
    },
    {
      "double": null,
      "int": 0,
      "string": null,
      "struct": null
    },
    {
      "double": null,
      "int": null,
      "string": null,
      "struct": {
        "type": "tisb_icao",
        "alt_geom": null,
        "track": 359.4,
        "geom_rate": 0,
        "nic": 0,
        "rc": 0,
        "nac_p": 0,
        "nac_v": 0,
        "sil": 0,
        "sil_type": "unknown",
        "alert": 0,
        "spi": 0
      }
    },
    {
      "double": null,
      "int": null,
      "string": "tisb_icao",
      "struct": null
    },
    {
      "double": null,
      "int": null,
      "string": null,
      "struct": null
    },
    {
      "double": null,
      "int": 0,
      "string": null,
      "struct": null
    },
    {
      "double": null,
      "int": null,
      "string": null,
      "struct": null
    },
    {
      "double": null,
      "int": null,
      "string": null,
      "struct": null
    },
    {
      "double": null,
      "int": null,
      "string": "f9adebf3",
      "struct": null
    }
  ]
]

In another page I saw trace has this schema

array<array<struct<double:double,int:int,string:string,struct:struct<type:string,track:double,geom_rate:int,nic:int,rc:int,nac_p:int,nac_v:int,sil:int,sil_type:string,alert:int,spi:int>>>>

which matches this new version of trace field.

The original data format in the trace is gone. Is this the expected behavior?

Could someone please guide me on how to correctly read original trace array? Thank you!


Solution

  • I believe I understand the origin of this "double/int/string/struct" format.

    Because trace is an array of array. In the second-level array, by default, if all values are in same type for example string, it will be array<string>.

    However, in this specific scenario, the second-level array is a mixture of elements including double, int, string, and struct.

    Consequently, AWS Glue makes an attempt to parse this mixture, leading to the emergence of the "double/int/string/struct" format.

    In my view, I have some potential approaches:

    1. Handle this new format data within AWS Glue itself. However, I met error some column in one file could be all integer 0, in another file are actual float. So it will become really dirty.
    2. UNNEST_ARRAY trace field using AWS Glue DataBrew for pre-processing, and then continue using AWS Glue.
    3. Handle raw JSON within AWS Glue but with PySpark script directly using explode:
        df = spark.read.option("multiline","true").json(json_file_path)
        df_exploded = df.select(
            col("icao"),
            col("noRegData"),
            col("timestamp"),
            explode(col("trace")).alias("trace_item")
        )
    
        df_result = df_exploded.select(
            "icao",
            "noRegData",
            "timestamp",
            col("trace_item").getItem(0).alias("col1"),
            col("trace_item").getItem(1).alias("col2"),
            col("trace_item").getItem(2).alias("col3"),
            col("trace_item").getItem(3).alias("col4"),
            # Continue extracting other elements as needed
        )