Search code examples
arraysnullavro

Nullable Array Fields in Avro


I have the following JSON dataset:

{  
   "hashtags":null
}
{  
   "hashtags":[  
      "value1",
      "value2"
   ]
}

and the following Avro schema generated from the Kite SDK (which looks correct - a union of null or array of string):

{
  "type" : "record",
  "name" : "tweet",
  "fields" : [ {
    "name" : "hashtags",
    "type" : [ "null", {
      "type" : "array",
      "items" : "string"
    } ],
    "doc" : "Type inferred from 'null'"
  } ]
}

When I try to covert data using

avro-tools fromjson --schema-file tweet.avsc twitter.json > twitter.avro

I get the following error (trimmed for brevity):

Exception in thread "main" org.apache.avro.AvroTypeException: Expected start-union. Got START_ARRAY

Changing the null case to an empty array:

{  "hashtags":null   } to {  "hashtags":[] }

with the schema changed to allow strings or null in the items field

"type" : {
      "type" : "array",
      "items" : [ "null", "string" ]
    }

works correctly, once the strings in the input JSON are qualified as 'strings'.

As such, is it possible to have a nullable array field, or, from Avro's perspective, is the nullability handled with an empty array?


Solution

  • Whenever you have a union in your schema, you have to explicitly tell Avro what type to interpret the data as. Or preprocess your data as you have done, so that you eliminate the need for a union. The way you have it now, you'll find that using

    "type" : {
      "type" : "array",
      "items" : "string"
    }
    

    works too because you've coerced all data to the array type. The null is unneeded.