I'm trying to read avro files in pyspark. Found out from How to read Avro file in PySpark that spark-avro is the best way to do that but I can't figure out how to install that from their Github repo. There's no downloadable jar, do I build it myself? How?
It's Spark 1.6 (pyspark) running on a cluster. I didn't set it up so don't know much about the configs but I have sudo access so I guess I should be able to install stuff. But the machine doesn't have direct internet access so need to manually copy and install stuff to it.
Thank you.
You can add spark-avro
as a package when running pyspark
or spark-submit
: https://github.com/databricks/spark-avro#with-spark-shell-or-spark-submit but this will require internet access on driver (driver will then distribute all files to the executors).
If you have no internet access on a driver you will need to build spark-avro
yourself to a fat jar:
git clone https://github.com/databricks/spark-avro.git
cd spark-avro
# If you are using spark package other than newest,
# checkout appropriate tag based on table in spark-avro README,
# for example for spark 1.6:
# git checkout v2.0.1
./build/sbt assembly
Then test it using pyspark shell:
./bin/pyspark --jars ~/git/spark-avro/target/scala-2.11/spark-avro-assembly-3.1.0-SNAPSHOT.jar
>>> spark.range(10).write.format("com.databricks.spark.avro").save("/tmp/output")
>>> spark.read.format("com.databricks.spark.avro").load("/tmp/output").show()
+---+
| id|
+---+
| 7|
| 8|
| 9|
| 2|
| 3|
| 4|
| 0|
| 1|
| 5|
| 6|
+---+