Search code examples
avrodata-ingestionavro-tools

How can I generate a single .avro file for large flat file with 30MB+ data


currently two avro files are getting generated for 10 kb file, If I follow the same thing with my actual file (30MB+) I will n number of files.

so need a solution to generate only one or two .avro files even if the source file of large.

Also is there any way to avoid manual declaration of column names.

current approach...

spark-shell --packages com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.10:2.0.1

import org.apache.spark.sql.types.{StructType, StructField, StringType}

// Manual schema declaration of the 'co' and 'id' column names and types val customSchema = StructType(Array( StructField("ind", StringType, true), StructField("co", StringType, true)))

val df = sqlContext.read.format("com.databricks.spark.csv").option("comment", "\"").option("quote", "|").schema(customSchema).load("/tmp/file.txt")

df.write.format("com.databricks.spark.avro").save("/tmp/avroout")

// Note: /tmp/file.txt is input file/dir, and /tmp/avroout is the output dir


Solution

  • Try specifying number of partitions of your dataframe while writing the data as avro or any format. To fix this use repartition or coalesce df function.

    df.coalesce(1).write.format("com.databricks.spark.avro").save("/tmp/avroout")

    So that it writes only one file in "/tmp/avroout"

    Hope this helps!