I'm transforming 120 JSON tables (of type List[Dict]
in python in-memory) of varying schemata to Arrow
to write it to .parquet
files on ADLS, utilizing the pyarrow
package.
I want to store the schema of each table in a separate file so I don't have to hardcode it for the 120 tables. As I iterate over my tables, I want to load each schema from file and transform the JSON data to Arrow by passing the schema.
import pyarrow as pa
data = [
{"col1": 1, "col2": "a"},
{"col1": 2, "col2": "b"},
{"col1": 3, "col2": "c"},
{"col1": 4, "col2": "d"},
{"col1": 5, "col2": "e"}
]
# How to load the schema from file and parse it into a `pa.schema`?
my_schema = pa.schema([
pa.field('year', pa.int64()),
pa.field('somthing', pa.string())]
)
arrow_table = pa.Table.from_pylist(data, schema=my_schema)
# How to write this schema to file?
arrow_table.schema
I could write a custom file format for the schema and write a parser that reads the (e.g. txt) file, transforming its content into the pa.datatype()
stuff, but I hope there is an easier, "official" solution to this?
You can store the meta data using pyarrow.parquet.write_metadata and read it back using pyarrow.parquet.read_schema
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({"col1": [1,2,3]})
pq.write_metadata(table.schema, "table.metadata")
schema = pq.read_schema("table.metadata")