I am relatively new to Spark, accessing it from SparkR, and trying to write an AVRO file to disk, but I keep getting an error saying Task failed while writing rows
I am running SparkR 2.2.0 -SNAPSHOT, Scala version 2.11.8, and started my SparkR session via:
sparkR.session(master = "spark://[some ip here]:7077",
appName = "nateSparkRAVROTest",
sparkHome = "/home/ubuntu/spark",
enableHiveSupport = FALSE,
sparkConfig = list(spark.executor.memory="28g"),
sparkPackages =c("org.apache.hadoop:hadoop-aws:2.7.3", "com.amazonaws:aws-java-sdk-pom:1.10.34", "com.databricks:spark-avro_2.11:3.2.0"))
I am wondering if I need to setup or install anything special? I include the com.databricks:spark-avro_2.11:3.2.0
package in my session launching command, have seen it download the package while launching the session, and am trying to write AVRO files via this command:
SparkR::write.df(myFormalClassSparkDataFrameObject, path = "/home/nathan/SparkRAVROTest/", source = "com.databricks.spark.avro", mode="overwrite")
I'm hoping someone with more experience using SparkR has experienced this error and could provide some insight. Thank you for your time.
Kind Regards, Nate
I was able to get it to work using com.databricks:spark-avro_2.11:4.0.0
in my Spark config.
An example SparkR config that helped is this:
SparkR::sparkR.session(master="local[*]",
sparkConfig = list(spark.driver.memory="14g",
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version="2",
spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs = "FALSE",
spark.kryoserializer.buffer.max="1024m",
spark.speculation="FALSE",
spark.referenceTracking="FALSE"
),
sparkPackages =c("org.apache.hadoop:hadoop-aws:2.7.3",
"com.amazonaws:aws-java-sdk:1.7.4",
"com.amazonaws:aws-java-sdk-pom:1.11.221",
"com.databricks:spark-avro_2.11:4.0.0",
"org.apache.httpcomponents:httpclient:4.5.2"))