google-cloud-dataflow apache-beam parquet pyarrow apache-beam-io

How to generate the pyarrow schema for the dynamic values

I am trying to write a parquest schema for my json message that needs to be written back to a GCS bucket using apache_beam

My json is like below:

data = {
    "name": "user_1",
    "result": [
        {
            "subject": "maths",
            "marks": 99
        },
        {
            "subject": "science",
            "marks": 76
        }
    ],
    "section": "A"
}

result array in the above example can have many value minimum is 1.

Solution

This is the schema you need:

import pyarrow as pa

schema = pa.schema(
    [
        pa.field("name", pa.string()),
        pa.field(
            "result",
            pa.list_(
                pa.struct(
                    [
                        pa.field("subject", pa.string()),
                        pa.field("marks", pa.int32()),
                    ]
                )
            ),
        ),
        pa.field("section", pa.string()),
    ]
)

If you have a file containing one record per line:

{"name": "user_1", "result": [{"subject": "maths", "marks": 99}, {"subject": "science", "marks": 76}], "section": "A"}
{"name": "user_2", "result": [{"subject": "maths", "marks": 10}, {"subject": "science", "marks": 75}], "section": "A"}

You can load it using:

from pyarrow import json as pa_json
table = pa_json.read_json('filename.json', parse_options=pa_json.ParseOptions(explicit_schema=schema))