I have a JSON file from online.
{
"icao": "f3b100",
"noRegData": true,
"timestamp": 1690848000,
"trace": [
[
51213.77,
39.090519,
-85.870361,
4500,
95,
359.4,
5,
0,
{
"type": "tisb_icao",
"track": 359.4,
"geom_rate": 0,
"nic": 0,
"rc": 0,
"nac_p": 0,
"nac_v": 0,
"sil": 0,
"sil_type": "unknown",
"alert": 0,
"spi": 0
},
"tisb_icao",
null,
0,
null,
null,
"f9adebf3"
]
]
}
I uploaded this JSON file to the S3 and try to use AWS Glue to read it.
However, in AWS Glue, it read/parse as
Note trace
field becomes
[
[
{
"double": 51213.77,
"int": null,
"string": null,
"struct": null
},
{
"double": 39.090519,
"int": null,
"string": null,
"struct": null
},
{
"double": -85.870361,
"int": null,
"string": null,
"struct": null
},
{
"double": null,
"int": 4500,
"string": null,
"struct": null
},
{
"double": 95,
"int": null,
"string": null,
"struct": null
},
{
"double": 359.4,
"int": null,
"string": null,
"struct": null
},
{
"double": null,
"int": 5,
"string": null,
"struct": null
},
{
"double": null,
"int": 0,
"string": null,
"struct": null
},
{
"double": null,
"int": null,
"string": null,
"struct": {
"type": "tisb_icao",
"alt_geom": null,
"track": 359.4,
"geom_rate": 0,
"nic": 0,
"rc": 0,
"nac_p": 0,
"nac_v": 0,
"sil": 0,
"sil_type": "unknown",
"alert": 0,
"spi": 0
}
},
{
"double": null,
"int": null,
"string": "tisb_icao",
"struct": null
},
{
"double": null,
"int": null,
"string": null,
"struct": null
},
{
"double": null,
"int": 0,
"string": null,
"struct": null
},
{
"double": null,
"int": null,
"string": null,
"struct": null
},
{
"double": null,
"int": null,
"string": null,
"struct": null
},
{
"double": null,
"int": null,
"string": "f9adebf3",
"struct": null
}
]
]
In another page I saw trace
has this schema
array<array<struct<double:double,int:int,string:string,struct:struct<type:string,track:double,geom_rate:int,nic:int,rc:int,nac_p:int,nac_v:int,sil:int,sil_type:string,alert:int,spi:int>>>>
which matches this new version of trace
field.
The original data format in the trace
is gone. Is this the expected behavior?
Could someone please guide me on how to correctly read original trace
array? Thank you!
I believe I understand the origin of this "double/int/string/struct" format.
Because trace
is an array of array. In the second-level array, by default, if all values are in same type for example string
, it will be array<string>
.
However, in this specific scenario, the second-level array is a mixture of elements including double
, int
, string
, and struct
.
Consequently, AWS Glue makes an attempt to parse this mixture, leading to the emergence of the "double/int/string/struct" format.
In my view, I have some potential approaches:
trace
field using AWS Glue DataBrew for pre-processing, and then continue using AWS Glue.explode
: df = spark.read.option("multiline","true").json(json_file_path)
df_exploded = df.select(
col("icao"),
col("noRegData"),
col("timestamp"),
explode(col("trace")).alias("trace_item")
)
df_result = df_exploded.select(
"icao",
"noRegData",
"timestamp",
col("trace_item").getItem(0).alias("col1"),
col("trace_item").getItem(1).alias("col2"),
col("trace_item").getItem(2).alias("col3"),
col("trace_item").getItem(3).alias("col4"),
# Continue extracting other elements as needed
)