I want to read avro files located in Amazon S3 from the Zeppelin notebook. I understand Databricks has a wonderful package for it spark-avro
. What are the steps that I need to take in order to bootstrap this jar file to my cluster and make it working?
When I write this in my notebook,
val df = sqlContext.read.avro("s3n://path_to_avro_files_in_one_bucket/")
I get the below error -
<console>:34: error: value avro is not a member of org.apache.spark.sql.DataFrameReader
I have had a look at this. I guess the solution posted there does not work for the latest version of Amazon EMR.
If someone could give me pointers, that would really help.
Here is how I associate the spark-avro dependencies. This method works for associating any other dependencies to spark.
Make sure your spark version is compatible with your spark-avro. You'll find the details of the dependencies here.
I put my spark-avro file in my S3 bucket. You can use hdfs or any other store.
While launching an EMR cluster, add the following JSON in the configuration,
[{"classification":"spark-defaults", "properties":{"spark.files":"/path_to_spark-avro_jar_file", "spark.jars":"/path_to_spark-avro_jar_file"}, "configurations":[]}]
This is not the only way to do this. Please refer this link for more details.