Search code examples
pyarrowapache-arrow

Table from nested list, struct


I have this json data:

consumption_json = """
{
    "count": 48,
    "next": null,
    "previous": null,
    "results": [
        {
            "consumption": 0.063,
            "interval_start": "2018-05-19T00:30:00+0100",
            "interval_end": "2018-05-19T01:00:00+0100"
        },
        {
            "consumption": 0.071,
            "interval_start": "2018-05-19T00:00:00+0100",
            "interval_end": "2018-05-19T00:30:00+0100"
        },
        {
            "consumption": 0.073,
            "interval_start": "2018-05-18T23:30:00+0100",
            "interval_end": "2018-05-18T00:00:00+0100"
        }
    ]
}
"""

and I would like to covert the results list to an Arrow table.

I have managed this by first converting it to python data structure, using python's json library, and then converting that to an Arrow table.

import json

consumption_python = json.loads(consumption_json)

results = consumption_python['results']

table = pa.Table.from_pylist(results)

print(table)
pyarrow.Table
consumption: double
interval_start: string
interval_end: string
----
consumption: [[0.063,0.071,0.073]]
interval_start: [["2018-05-19T00:30:00+0100","2018-05-19T00:00:00+0100","2018-05-18T23:30:00+0100"]]
interval_end: [["2018-05-19T01:00:00+0100","2018-05-19T00:30:00+0100","2018-05-18T00:00:00+0100"]]

But, for reasons of performance, I'd rather just use pyarrow exclusively for this.

I can use pyarrow's json reader to make a table.

reader = pa.BufferReader(bytes(consumption_json, encoding='ascii'))
table_from_reader = pa.json.read_json(reader)

And 'results' is a struct nested inside a list. (Actually, everything seems to be nested).

print(table_from_reader['results'].type)
list<item: struct<consumption: double, interval_start: timestamp[s], interval_end: timestamp[s]>>

How do I turn this into a table directly?

following this https://stackoverflow.com/a/72880717/3617057

I can get closer...

import pyarrow.compute as pc

flat = pc.list_flatten(table_from_reader["results"])
print(flat)
[
  -- is_valid: all not null
  -- child 0 type: double
    [
      0.063,
      0.071,
      0.073
    ]
  -- child 1 type: timestamp[s]
    [
      2018-05-18 23:30:00,
      2018-05-18 23:00:00,
      2018-05-18 22:30:00
    ]
  -- child 2 type: timestamp[s]
    [
      2018-05-19 00:00:00,
      2018-05-18 23:30:00,
      2018-05-17 23:00:00
    ]
]

Solution

  • flat is a ChunkedArray whose underlying arrays are StructArray. To convert it to a table, you need to convert each chunks to a RecordBatch and concatenate them in a table:

    pa.Table.from_batches(
        [
            pa.RecordBatch.from_struct_array(s)
            for s in flat.iterchunks()
        ]
    )
    

    If flat is just a StructArray (not a ChunkedArray), you can call:

    pa.Table.from_batches(
        [
            pa.RecordBatch.from_struct_array(flat)
        ]
    )