Search code examples
rapache-sparkavrosparkr

SparkR 2.2.0. Writing AVRO fails


I am relatively new to Spark, accessing it from SparkR, and trying to write an AVRO file to disk, but I keep getting an error saying Task failed while writing rows

I am running SparkR 2.2.0 -SNAPSHOT, Scala version 2.11.8, and started my SparkR session via:

sparkR.session(master = "spark://[some ip here]:7077",
           appName = "nateSparkRAVROTest",
           sparkHome = "/home/ubuntu/spark",
           enableHiveSupport = FALSE,
           sparkConfig = list(spark.executor.memory="28g"),
           sparkPackages =c("org.apache.hadoop:hadoop-aws:2.7.3", "com.amazonaws:aws-java-sdk-pom:1.10.34", "com.databricks:spark-avro_2.11:3.2.0"))

I am wondering if I need to setup or install anything special? I include the com.databricks:spark-avro_2.11:3.2.0 package in my session launching command, have seen it download the package while launching the session, and am trying to write AVRO files via this command:

SparkR::write.df(myFormalClassSparkDataFrameObject, path = "/home/nathan/SparkRAVROTest/", source = "com.databricks.spark.avro", mode="overwrite")

I'm hoping someone with more experience using SparkR has experienced this error and could provide some insight. Thank you for your time.

Kind Regards, Nate


Solution

  • I was able to get it to work using com.databricks:spark-avro_2.11:4.0.0 in my Spark config.

    An example SparkR config that helped is this:

    SparkR::sparkR.session(master="local[*]", 
                       sparkConfig = list(spark.driver.memory="14g",
                                          spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version="2",
                                          spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs = "FALSE",
                                          spark.kryoserializer.buffer.max="1024m",
                                          spark.speculation="FALSE",
                                          spark.referenceTracking="FALSE"
                       ), 
                       sparkPackages =c("org.apache.hadoop:hadoop-aws:2.7.3",
                        "com.amazonaws:aws-java-sdk:1.7.4",
                         "com.amazonaws:aws-java-sdk-pom:1.11.221",
                         "com.databricks:spark-avro_2.11:4.0.0",
                          "org.apache.httpcomponents:httpclient:4.5.2"))