Search code examples
pythonavroavscfastavro

Parsing Multiple AVRO (avsc files) which refer each other using python (fastavro)


I have a AVRO schema which is currently in single avsc file like below. Now I want to move address record to a different common avsc file which should be referenced from many other avsc file. So Customer and address will be separate avsc files. How can I separate them and and have customer avsc file reference address avsc file. Also how would both the files can be processed using python. I am currently using fast avro in python3 to process the single avsc file but open to use any other utility in python3 or pyspark.

File name - customer_details.avsc

[
{
    "type": "record",
    "namespace": "com.company.model",
    "name": "AddressRecord",
    "fields": [
        {
            "name": "streetaddress",
            "type": "string"
        },
        {
            "name": "city",
            "type": "string"
        },
        {
            "name": "state",
            "type": "string"
        },
        {
            "name": "zip",
            "type": "string"
        }
    ]
},
{
    "namespace": "com.company.model",
    "type": "record",
    "name": "Customer",
    "fields": [
        {
            "name": "firstname",
            "type": "string"
        },
        {
            "name": "lastname",
            "type": "string"
        },
        {
            "name": "email",
            "type": "string"
        },
        {
            "name": "phone",
            "type": "string"
        },
        {
            "name": "address",
            "type": {
                "type": "array",
                "items": "com.company.model.AddressRecord"
            }
        }
    ]
}
]
import fastavro

s1 = fastavro.schema.load_schema('customer_details.avsc')

How can split the schema in different file where address record file can be referenced from other avsc file. Then how would I process multiple avsc files using fast Avro (Python) or any other python utility?


Solution

  • To do this, the schema for the AddressRecord should be in a file called com.company.model.AddressRecord.avsc with the following contents:

    {
        "type": "record",
        "namespace": "com.company.model",
        "name": "AddressRecord",
        "fields": [
            {
                "name": "streetaddress",
                "type": "string"
            },
            {
                "name": "city",
                "type": "string"
            },
            {
                "name": "state",
                "type": "string"
            },
            {
                "name": "zip",
                "type": "string"
            }
        ]
    }
    

    The Customer schema doesn't necessarily need a special naming convention since it is the top level schema, but it's probably a good idea to follow the same convention. So it would be in a file named com.company.model.Customer.avsc with the following contents:

    {
        "namespace": "com.company.model",
        "type": "record",
        "name": "Customer",
        "fields": [
            {
                "name": "firstname",
                "type": "string"
            },
            {
                "name": "lastname",
                "type": "string"
            },
            {
                "name": "email",
                "type": "string"
            },
            {
                "name": "phone",
                "type": "string"
            },
            {
                "name": "address",
                "type": {
                    "type": "array",
                    "items": "com.company.model.AddressRecord"
                }
            }
        ]
    }
    

    The files must be in the same directory.

    Then you should be able to do fastavro.schema.load_schema('com.company.model.Customer.avsc')