Search code examples
javaapache-sparkapache-spark-sqlspark-avro

Map Avro files on Java class with different field names


I've got a problem with simple spark task, which reads Avro file and then save it as Hive parquet table.

I've got 2 types of file, in general they are the same, but the key struct is a little different - field names.

Type 1

root
|-- pk: strucnt (nullable = true)
    |-- term_id: string (nullale = true)

Type 2

root
|-- pk: strucnt (nullable = true)
    |-- id: string (nullale = true)

I'm reading Avro using spark-avro. And then map this DF to bean like this

Dataset<SomeClass> df = avroDF.as(Encoders.bean(SomeClass.class));

SomeClass is a simple one-field class with getter and setter.

public class SomeClass{
    private String term_id;
    ...
}

So if I'm reading Avro type 1 - it's OK. But if I'm reading Avro type 2 - the error occures. And vice versa if I'm changing the field name to private String id;

Is there any universal solution for my problem? I found @AvroName, but it doesn't allow to set several names. Thanks.


Solution

  • The possible solution is

    StructType avroExtendedSchema = avroDF.schema().add("id",DataTypes.StringType);
    avroDF.map(row->RowFactory(row.getStruct(0),row.getStruct(0).getString(0)), 
           RowEncoder.apply(avroExtendedSchema)).toDF();
    

    So the second field of DF will be named "id" and contain the string key. First "pk" struct can be dropped in the future.

    avroDF.drop("pk");
    

    PS I found the third type of schema:

    root
    |-- pk: strucnt (nullable = true)
        |-- id: int(nullale = true)
    

    So the final code is like:

    DataType keyType = avroDF.select("pk.*").schema().fields[0].dataType();
    StructType avroExtendedSchema = avroDF.schema().add("id",keyType);
    avroDF.map(row->RowFactory(row.getStruct(0),row.getStruct(0).get(0)), 
           RowEncoder.apply(avroExtendedSchema)).drop("pk").toDF();
    

    This code suites for any primitive\String key.