Search code examples
amazon-web-servicesamazon-emrspark-avro

Bootstrapping spark-avro jar to Amazon EMR cluster


I want to read avro files located in Amazon S3 from the Zeppelin notebook. I understand Databricks has a wonderful package for it spark-avro. What are the steps that I need to take in order to bootstrap this jar file to my cluster and make it working?

When I write this in my notebook, val df = sqlContext.read.avro("s3n://path_to_avro_files_in_one_bucket/")

I get the below error - <console>:34: error: value avro is not a member of org.apache.spark.sql.DataFrameReader

I have had a look at this. I guess the solution posted there does not work for the latest version of Amazon EMR.

If someone could give me pointers, that would really help.


Solution

  • Here is how I associate the spark-avro dependencies. This method works for associating any other dependencies to spark.

    1. Make sure your spark version is compatible with your spark-avro. You'll find the details of the dependencies here.

    2. I put my spark-avro file in my S3 bucket. You can use hdfs or any other store.

    3. While launching an EMR cluster, add the following JSON in the configuration, [{"classification":"spark-defaults", "properties":{"spark.files":"/path_to_spark-avro_jar_file", "spark.jars":"/path_to_spark-avro_jar_file"}, "configurations":[]}]

    This is not the only way to do this. Please refer this link for more details.