Search code examples
google-cloud-dataflowapache-beamparquetpyarrowapache-beam-io

How to generate the pyarrow schema for the dynamic values


I am trying to write a parquest schema for my json message that needs to be written back to a GCS bucket using apache_beam

My json is like below:

data = {
    "name": "user_1",
    "result": [
        {
            "subject": "maths",
            "marks": 99
        },
        {
            "subject": "science",
            "marks": 76
        }
    ],
    "section": "A"
}

result array in the above example can have many value minimum is 1.


Solution

  • This is the schema you need:

    import pyarrow as pa
    
    schema = pa.schema(
        [
            pa.field("name", pa.string()),
            pa.field(
                "result",
                pa.list_(
                    pa.struct(
                        [
                            pa.field("subject", pa.string()),
                            pa.field("marks", pa.int32()),
                        ]
                    )
                ),
            ),
            pa.field("section", pa.string()),
        ]
    )
    
    

    If you have a file containing one record per line:

    {"name": "user_1", "result": [{"subject": "maths", "marks": 99}, {"subject": "science", "marks": 76}], "section": "A"}
    {"name": "user_2", "result": [{"subject": "maths", "marks": 10}, {"subject": "science", "marks": 75}], "section": "A"}
    

    You can load it using:

    from pyarrow import json as pa_json
    table = pa_json.read_json('filename.json', parse_options=pa_json.ParseOptions(explicit_schema=schema))