Search code examples
scalaavroparquet

How to make a parquet file from scala case class using AvroParquetWriter?


I have a case class like below:

case class Person(id:Int,name: String)

Now, i wrote the below method to make a parquet file from a Seq[T] using AvroParquetWriter.

  def writeToFile[T](data: Iterable[T], schema: Schema, path: String, accessKey: String, secretKey: String): Unit = {
    val conf = new Configuration

    conf.set("fs.s3.awsAccessKeyId", accessKey)
    conf.set("fs.s3.awsSecretAccessKey", secretKey)

    val s3Path = new Path(path)
    val writer = AvroParquetWriter.builder[T](s3Path)
      .withConf(conf)
      .withSchema(schema)
      .withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
      .build()
      .asInstanceOf[ParquetWriter[T]]

    data.foreach(writer.write)

    writer.close()
  }

The schema is:

val schema = SchemaBuilder
    .record("Person")
      .fields()
      .requiredInt("id")
      .requiredString("name")
      .endRecord()

Now, when i call writeToFile with below code, i got exception:

val personData = Seq(Person(1,"A"),Person(2,"B"))

ParquetService.writeToFile[Person](
      data = personData,
      schema = schema,
      path = s3Path,
      accessKey = accessKey,
      secretKey = secretKey

java.lang.ClassCastException:com.entities.Person cannot be cast to org.apache.avro.generic.IndexedRecord

Why Person can not be casted to IndexedRecord? Is there anything extra i need to do to get rid of this exception?


Solution

  • I had a similar issue and according to this example

    https://github.com/apache/parquet-mr/blob/f84938441be49c665595c936ac631c3e5f171bf9/parquet-avro/src/test/java/org/apache/parquet/avro/TestReflectReadWrite.java#L141

    you are missing one method call on writer builder.

    val writer = AvroParquetWriter.builder[T](s3Path)
      .withConf(conf)
      .withSchema(schema)
      .withDataModel(ReflectData.get) //This one
      .withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
      .build()
    

    Also, if you wish to support nulls in your data, you can use ReflectData.AllowNull.get()