Search code examples
pythonpyarrow

Write and read a pyarrow schema from file


I'm transforming 120 JSON tables (of type List[Dict] in python in-memory) of varying schemata to Arrow to write it to .parquet files on ADLS, utilizing the pyarrow package.

I want to store the schema of each table in a separate file so I don't have to hardcode it for the 120 tables. As I iterate over my tables, I want to load each schema from file and transform the JSON data to Arrow by passing the schema.

import pyarrow as pa

data = [
    {"col1": 1, "col2": "a"},
    {"col1": 2, "col2": "b"},
    {"col1": 3, "col2": "c"},
    {"col1": 4, "col2": "d"},
    {"col1": 5, "col2": "e"}
]

# How to load the schema from file and parse it into a `pa.schema`?
my_schema = pa.schema([
    pa.field('year', pa.int64()),
    pa.field('somthing', pa.string())]
)
arrow_table = pa.Table.from_pylist(data, schema=my_schema)

# How to write this schema to file? 
arrow_table.schema

I could write a custom file format for the schema and write a parser that reads the (e.g. txt) file, transforming its content into the pa.datatype() stuff, but I hope there is an easier, "official" solution to this?


Solution

  • You can store the meta data using pyarrow.parquet.write_metadata and read it back using pyarrow.parquet.read_schema

    import pyarrow as pa
    import pyarrow.parquet as pq
    
    table = pa.table({"col1": [1,2,3]})
    
    pq.write_metadata(table.schema, "table.metadata")
    schema = pq.read_schema("table.metadata")