Search code examples
apache-sparkpysparkavroconfluent-schema-registryspark-avro

Serializing a spark dataframe to avro in spark using to_avro


I have a spark dataframe that has the following schema

StructType(
    StructField(id,StringType,true),
    StructField(type,StringType,true),
)

that I need to convert to avro with the following avro schema using the to_avro function from spark-avro like so to_avro(spark_df, jsonFormatSchema)

{
  "type": "record",
  "name": "Value",
  "fields": [
    {
      "name": "id",
      "type": "string"
    },
    {
      "name": "type",
      "type": "string"
    },
    {
      "name": "x",
      "type": [
        "null",
        "string"
      ],
      "default": null
    },
    {
      "name": "y",
      "type": [
        {
          "type": "boolean",
          "connect.default": false
        },
        "null"
      ],
      "default": false
    }
  ],
}

Now obviously, my spark dataframe does not have the columns x and y, how do I define the avro schema so that the avro binary my spark dataframe is serialized into will contain null/default values for those fields instead of throwing an IncompatibleSchemaException?

I thought the "null" value in the type array would take care of fields that are not present in the input spark dataframe but that turned out to be wrong.


Solution

  • The problem is that default values are only used when decoding, not encoding. See this section in the specification: https://avro.apache.org/docs/current/specification/#schema-record

    Specifically this part:

    A default value for this field, only used when reading instances that lack the field for schema evolution purposes. The presence of a default value does not make the field optional at encoding time.