Search code examples
serializationschemaparquetpyarrow

Is there a way to deserialize PyArrow Table Schemas?


I've been using PyArrow tables as an intermediate step between a few sources of data and parquet files. as_table pa.Table.from_pylist(my_items) is really useful for what it does - but it doesn't allow for any real validation. You can vacuously call as_table.validate() on the resulting Table, but it's only validating against its own inferred types, and won't be catching anything about non-nullable fields.

I could make a bunch of schemas by hand, pa.Field by pa.Field, but some of these are large, or kind of complex. I'd hoped to be able to create an object from a known-good canonical example, serialize the resulting schema (or do some code generation or other means of preserving it), and then use that to validate future reads and writes. The fact that the schema object has a .serialize() method is tantalizing:

s = pa.Table.from_pylist(known_good_objects).schema
serialized = s.serialize().to_pybytes()
# What's in here? 
print(serialized)
b'\xff\xff\xff\xff\x99\x21\x...'

Okay, it's some sort of binary thing. I've trawled the official docs several times and there's not much help there. Trying to .decode() the bytes with various types of UTF fails, and I can't find any equivalent .deserialize() method that does anything expected. Is there some IPC magic I can use here? Could I just pickle the resulting objects and load them later? What's the path to reusing or generating pyarrow Schemas from in-memory objects?


Solution

  • The Schema.serialize() method serializes the schema as an IPC message, as mentioned it the docstring (https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html#pyarrow.Schema.serialize), i.e. using Arrow's specification for serialization (https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc).

    It's indeed not very well documented how to deserialize such message (the user guide at https://arrow.apache.org/docs/python/ipc.html only shows this for actual record batch data, not for individual schema messages). But you can use the pyarrow.ipc module to work with IPC messages in general, and specifically if you know you have a schema message, you can use read_schema:

    >>> schema = pa.schema([("col1", pa.int64()), ("col2", pa.float64())])
    >>> schema
    col1: int64
    col2: double
    
    >>> schema_serialized = schema.serialize().to_pybytes()
    >>> pa.ipc.read_schema(pa.py_buffer(schema_serialized))
    col1: int64
    col2: double
    
    

    That said, if you want to serialize it just for temporary storage in a python project/script, as you mentioned you can also use pickle:

    >>>import pickle
    >>> pickle.loads(pickle.dumps(schema))
    col1: int64
    col2: double
    

    The IPC message protocol is language-agnostic (not python specific, so you could share this schema message with non-python libraries) and stable across python/pyarrow versions. But depending on your needs, pickle can be sufficient and a bit easier to use.