Search code examples
pyarrowapache-arrow

What can a `RecordBatch` do that a `StructArray` cannot?


Among the different types of arrays that exist in Arrow, one of them is the StructArray. When converted to a pandas structure using PyArrow, it returns a pd.Series with a number of rows, each of them containing a dictionary:

>>> pa.array([{'x': 1, 'y': True}, {'z': 3.4, 'x': 4}]).to_pandas()
0    {'x': 1.0, 'y': True, 'z': None}
1     {'x': 4.5, 'y': None, 'z': 3.4}
dtype: object

On the other hand, a RecordBatch advertises itself as "a collection of equal-length array instances":

>>> pa.RecordBatch.from_pylist([{'x': 1, 'y': True, 'z': None}, {'z': 3.4, 'x': 4.5}]).to_pandas()
     x     y    z
0  1.0  True  NaN
1  4.5  None  3.4

However, even if one takes the form of a pd.Series and the other a pd.DataFrame, it seems to me that, essentially, they both contain the same information. One is an array, the other is a collection of arrays.

Therefore, what can a RecordBatch do that a StructArray cannot?


Solution

  • They're different abstractions. A StructArray is an Array with an associated type; it's a nested array. A RecordBatch is a…RecordBatch and contains a schema; abstractly, it's a 2D chunk of data where each column is contiguous in memory.

    They intentionally have similar interfaces because they "look similar" to each other, but for instance, a StructArray can't be written to a file by itself; it would need to be wrapped and converted into a RecordBatch.

    Also, StructArray (as an Array) has a top-level validity bitmap (i.e. each of its child arrays has their own validity information, and so does the StructArray itself). But a RecordBatch doesn't have its own validity bitmap.