Installing spark-avro

I'm trying to read avro files in pyspark. Found out from How to read Avro file in PySpark that spark-avro is the best way to do that but I can't figure out how to install that from their Github repo. There's no downloadable jar, do I build it myself? How?

It's Spark 1.6 (pyspark) running on a cluster. I didn't set it up so don't know much about the configs but I have sudo access so I guess I should be able to install stuff. But the machine doesn't have direct internet access so need to manually copy and install stuff to it.

Thank you.

Solution

You can add spark-avro as a package when running pyspark or spark-submit: https://github.com/databricks/spark-avro#with-spark-shell-or-spark-submit but this will require internet access on driver (driver will then distribute all files to the executors).

If you have no internet access on a driver you will need to build spark-avro yourself to a fat jar:

git clone https://github.com/databricks/spark-avro.git
cd spark-avro
# If you are using spark package other than newest, 
# checkout appropriate tag based on table in spark-avro README, 
# for example for spark 1.6:
# git checkout v2.0.1 
./build/sbt assembly

Then test it using pyspark shell:

./bin/pyspark --jars ~/git/spark-avro/target/scala-2.11/spark-avro-assembly-3.1.0-SNAPSHOT.jar

>>> spark.range(10).write.format("com.databricks.spark.avro").save("/tmp/output")
>>> spark.read.format("com.databricks.spark.avro").load("/tmp/output").show()
+---+
| id|
+---+
|  7|
|  8|
|  9|
|  2|
|  3|
|  4|
|  0|
|  1|
|  5|
|  6|
+---+