I am using two different schemas to write data using Avro. The first one is
{
"type" : "record",
"name" : "DynamicFact_aSource",
"fields" : [ {
"name" : "user",
"type" : {
"type" : "record",
"name" : "User",
"namespace" : "foo",
"fields" : [...]
}
}, {
"name" : "event",
"type" : {
"type" : "record",
"name" : "Event",
"namespace" : "foo",
"fields" : [...]
}
}, {
"name" : "aLabel",
"type" : [ "null", {
"type" : "array",
"items" : {
"type" : "record",
"name" : "Label",
"namespace" : "foo",
"doc" : "* Holds a single string.",
"fields" : [ {
"name" : "value",
"type" : {
"type" : "string",
"avro.java.string" : "String"
}
} ]
}
} ],
"default" : null
} ]
}
and the second one is
{
"type" : "record",
"name" : "DynamicFact_bSource",
"fields" : [ {
"name" : "user",
"type" : {
"type" : "record",
"name" : "User",
"namespace" : "foo",
"fields" : [...]
}
}, {
"name" : "event",
"type" : {
"type" : "record",
"name" : "Event",
"namespace" : "foo",
"fields" : [...]
}
}, {
"name" : "anotherLabel",
"type" : [ "null", {
"type" : "array",
"items" : {
"type" : "record",
"name" : "Label",
"namespace" : "foo",
"doc" : "* Holds a single string.",
"fields" : [ {
"name" : "value",
"type" : {
"type" : "string",
"avro.java.string" : "String"
}
} ]
}
} ],
"default" : null
} ]
}
Now I want to read data from these two different schemas into one RDD. In order to do that, I am creating a new - reader - schema like so
{
"type": "record",
"name": "DynamicFact_multiSourceAggregation",
"fields":
[
{
"name": "user",
"type":
{
"type": "record",
"name": "User",
"namespace": "foo",
"fields":
[]
}
},
{
"name": "event",
"type":
{
"type": "record",
"name": "Event",
"namespace": "foo",
"fields":
[]
}
},
{
"name": "theMultiSourceAttributeLabel",
"type":
[
"null",
{
"type": "array",
"items":
{
"type": "record",
"name": "Label",
"namespace": "foo",
"doc": "* Holds a single string.",
"fields":
[
{
"name": "value",
"type":
{
"type": "string",
"avro.java.string": "String"
}
}
]
}
}
],
"default": null,
"aliases":
[
"aLabel",
"anotherLabel"
]
}
]
}
The reader schema defined user
and event
exactly the same way as the writers' schemas do but for the 3rd field instead of aLabel
and anotherLabel
that the two writer schmas use, I am using another name theMultiSourceAttributeLabel
and setting aliases for it
"aliases":
[
"aLabel",
"anotherLabel"
]
so that the reader can match Avro fields with aLabel
or anotherLabel
to theMultiSourceAttributeLabel
. Although the user and event are loaded successfully using the reader's schema, aLabel
and anotherLabel
fields are not. This means that all theMultiSourceAttributeLabel
fields are null.
Should Avro be able to figure out this mapping once provided with the list of aliases? Why are all theMultiSourceAttributeLabel
in the RDD read null?
The problem here is that the writer schemas have different names. This requires the aliasing of the reader schema name as well - apart from the field name aliasing. So when defining the reader schema one must do
def schema(): Schema = {
val builder = SchemaBuilder
.record(s"DynamicFact_$name")
.aliases(getSourceSchemaAliases: _*)
.fields()
...
}
this will produce
{
"type": "record",
"name": "DynamicFact_multiSourceAggregation",
"fields":[...],
"aliases":
[
"DynamicFact_aSource",
"DynamicFact_bSource"
]
}
which is enough for the reader schema to successfully read both writer schemas