Among the different types of arrays that exist in Arrow, one of them is the StructArray
. When converted to a pandas structure using PyArrow, it returns a pd.Series
with a number of rows, each of them containing a dictionary:
>>> pa.array([{'x': 1, 'y': True}, {'z': 3.4, 'x': 4}]).to_pandas()
0 {'x': 1.0, 'y': True, 'z': None}
1 {'x': 4.5, 'y': None, 'z': 3.4}
dtype: object
On the other hand, a RecordBatch
advertises itself as "a collection of equal-length array instances":
>>> pa.RecordBatch.from_pylist([{'x': 1, 'y': True, 'z': None}, {'z': 3.4, 'x': 4.5}]).to_pandas()
x y z
0 1.0 True NaN
1 4.5 None 3.4
However, even if one takes the form of a pd.Series
and the other a pd.DataFrame
, it seems to me that, essentially, they both contain the same information. One is an array, the other is a collection of arrays.
Therefore, what can a RecordBatch
do that a StructArray
cannot?
They're different abstractions. A StructArray is an Array with an associated type; it's a nested array. A RecordBatch is a…RecordBatch and contains a schema; abstractly, it's a 2D chunk of data where each column is contiguous in memory.
They intentionally have similar interfaces because they "look similar" to each other, but for instance, a StructArray can't be written to a file by itself; it would need to be wrapped and converted into a RecordBatch.
Also, StructArray (as an Array) has a top-level validity bitmap (i.e. each of its child arrays has their own validity information, and so does the StructArray itself). But a RecordBatch doesn't have its own validity bitmap.