Search code examples
javaavroconfluent-platformavro-tools

Why does an Avro field that was string now require avro.java.string type?


In Avro IDL I have a Message record defined as follows:

record Message{

    MessageId id;
    array<string> dataField;
}

I am using this record in another record with a null union:

record Invoice{
    ...
    union {null,array<Message>} message;
}

We have a Java Kafka consumer (we're using Confluent Platform) that is using the avro-maven-plugin version 1.10.2, configured with <stringType>String</stringType>

When we are making a call such as this:

List<String> msgList = message.getDataField();
for (String msg : msgList) {...}

we receive the following error on the second line: class org.apache.avro.util.Utf8 cannot be cast to class java.lang.String

Previously, the Invoice object was defined as:

 record Invoice{
    ...
    array<Message> message;
}

and we did not receive this error. We have found that in our schema file, changing from

 "name" : "dataField",
      "type" : {
        "type" : "array",
        "items" : "string"
      }

to

"name" : "dataField",
 "type" : {
   "type" : "array",
     "items" :{
        "type": "string",
        "avro.java.string" : "String"
   }
 }

corrects the problem.

I'm unclear as to why adding the union caused this change in behavior. Should I declare all of the strings in the schema with the avro.java.string and if so, how do I do that with Avro IDL?


Solution

  • At this point, there appears to be a couple of ways to resolve this, at least when using the Confluent Platform, version 5.5.1 or later. And I'm considering the problem to be an open defect with Avro.

    The first option is to update the Avro Schema file with a global search and replace of "type":"string" to

    "type": {
           "avro.java.string": "String",
           "type": "string"
        }
    

    This first option would need to be done after creating any files via Avro IDL since it doesn't support this construct, making IDL less useful in this case. Strangely, this approach does not appear to impact records that come in via REST Proxy that have "type":"string" associated without the additional avro.java.string information. They appear able to use a schema defined in either way; I was expecting the updated schema with the avro.java.string information to cause problems with REST Proxy requests that don't have that detail.

    The second option is to set auto.register.schemas=false and use.latest.version=true, though this may cause unintended consequences with compatibility in the future.

    The third option is to just not use the <stringType> directive in the Maven configuration for Avro Tools. This means a lot of coding around the CharacterSequence that is used by default, usually in the form of .toString() methods.