I have this json data:
consumption_json = """
{
"count": 48,
"next": null,
"previous": null,
"results": [
{
"consumption": 0.063,
"interval_start": "2018-05-19T00:30:00+0100",
"interval_end": "2018-05-19T01:00:00+0100"
},
{
"consumption": 0.071,
"interval_start": "2018-05-19T00:00:00+0100",
"interval_end": "2018-05-19T00:30:00+0100"
},
{
"consumption": 0.073,
"interval_start": "2018-05-18T23:30:00+0100",
"interval_end": "2018-05-18T00:00:00+0100"
}
]
}
"""
and I would like to covert the results list to an Arrow table.
I have managed this by first converting it to python data structure, using python's json library, and then converting that to an Arrow table.
import json
consumption_python = json.loads(consumption_json)
results = consumption_python['results']
table = pa.Table.from_pylist(results)
print(table)
pyarrow.Table
consumption: double
interval_start: string
interval_end: string
----
consumption: [[0.063,0.071,0.073]]
interval_start: [["2018-05-19T00:30:00+0100","2018-05-19T00:00:00+0100","2018-05-18T23:30:00+0100"]]
interval_end: [["2018-05-19T01:00:00+0100","2018-05-19T00:30:00+0100","2018-05-18T00:00:00+0100"]]
But, for reasons of performance, I'd rather just use pyarrow exclusively for this.
I can use pyarrow's json reader to make a table.
reader = pa.BufferReader(bytes(consumption_json, encoding='ascii'))
table_from_reader = pa.json.read_json(reader)
And 'results' is a struct nested inside a list. (Actually, everything seems to be nested).
print(table_from_reader['results'].type)
list<item: struct<consumption: double, interval_start: timestamp[s], interval_end: timestamp[s]>>
How do I turn this into a table directly?
following this https://stackoverflow.com/a/72880717/3617057
I can get closer...
import pyarrow.compute as pc
flat = pc.list_flatten(table_from_reader["results"])
print(flat)
[
-- is_valid: all not null
-- child 0 type: double
[
0.063,
0.071,
0.073
]
-- child 1 type: timestamp[s]
[
2018-05-18 23:30:00,
2018-05-18 23:00:00,
2018-05-18 22:30:00
]
-- child 2 type: timestamp[s]
[
2018-05-19 00:00:00,
2018-05-18 23:30:00,
2018-05-17 23:00:00
]
]
flat
is a ChunkedArray
whose underlying arrays are StructArray
. To convert it to a table, you need to convert each chunks to a RecordBatch
and concatenate them in a table:
pa.Table.from_batches(
[
pa.RecordBatch.from_struct_array(s)
for s in flat.iterchunks()
]
)
If flat
is just a StructArray
(not a ChunkedArray
), you can call:
pa.Table.from_batches(
[
pa.RecordBatch.from_struct_array(flat)
]
)