Search code examples
scalaapache-sparkhdfsavro

Configure Avro file size written to HDFS by Spark


I am writing a Spark dataframe in Avro format to HDFS. And I would like to split large Avro files so they would fit into the Hadoop block size and at the same time would not be too small. Are there any dataframe or Hadoop options for that? How can I split the files to be written into smaller ones?

Here is the way I write the data to HDFS:

dataDF.write
  .format("avro")
  .option("avroSchema",parseAvroSchemaFromFile("/avro-data-schema.json"))
      .toString)
  .save(dataDir)

Solution

  • I have researched a lot and found out that it is not possible to set up a limit in file sizes only in the number of Avro records. So the only solution would be to create an application for mapping the number of records to file sizes.