I am trying to write a parquest schema for my json message that needs to be written back to a GCS bucket using apache_beam
My json is like below:
data = {
"name": "user_1",
"result": [
{
"subject": "maths",
"marks": 99
},
{
"subject": "science",
"marks": 76
}
],
"section": "A"
}
result array in the above example can have many value minimum is 1.
This is the schema you need:
import pyarrow as pa
schema = pa.schema(
[
pa.field("name", pa.string()),
pa.field(
"result",
pa.list_(
pa.struct(
[
pa.field("subject", pa.string()),
pa.field("marks", pa.int32()),
]
)
),
),
pa.field("section", pa.string()),
]
)
If you have a file containing one record per line:
{"name": "user_1", "result": [{"subject": "maths", "marks": 99}, {"subject": "science", "marks": 76}], "section": "A"}
{"name": "user_2", "result": [{"subject": "maths", "marks": 10}, {"subject": "science", "marks": 75}], "section": "A"}
You can load it using:
from pyarrow import json as pa_json
table = pa_json.read_json('filename.json', parse_options=pa_json.ParseOptions(explicit_schema=schema))