apache-spark pyspark avro confluent-schema-registry spark-avro

Serializing a spark dataframe to avro in spark using to_avro

I have a spark dataframe that has the following schema

StructType(
    StructField(id,StringType,true),
    StructField(type,StringType,true),
)

that I need to convert to avro with the following avro schema using the to_avro function from spark-avro like so to_avro(spark_df, jsonFormatSchema)

{
  "type": "record",
  "name": "Value",
  "fields": [
    {
      "name": "id",
      "type": "string"
    },
    {
      "name": "type",
      "type": "string"
    },
    {
      "name": "x",
      "type": [
        "null",
        "string"
      ],
      "default": null
    },
    {
      "name": "y",
      "type": [
        {
          "type": "boolean",
          "connect.default": false
        },
        "null"
      ],
      "default": false
    }
  ],
}

Now obviously, my spark dataframe does not have the columns x and y, how do I define the avro schema so that the avro binary my spark dataframe is serialized into will contain null/default values for those fields instead of throwing an IncompatibleSchemaException?

I thought the "null" value in the type array would take care of fields that are not present in the input spark dataframe but that turned out to be wrong.

Solution

The problem is that default values are only used when decoding, not encoding. See this section in the specification: https://avro.apache.org/docs/current/specification/#schema-record

Specifically this part:

A default value for this field, only used when reading instances that lack the field for schema evolution purposes. The presence of a default value does not make the field optional at encoding time.