Search code examples
pysparkspark-avro

Installing spark-avro


I'm trying to read avro files in pyspark. Found out from How to read Avro file in PySpark that spark-avro is the best way to do that but I can't figure out how to install that from their Github repo. There's no downloadable jar, do I build it myself? How?

It's Spark 1.6 (pyspark) running on a cluster. I didn't set it up so don't know much about the configs but I have sudo access so I guess I should be able to install stuff. But the machine doesn't have direct internet access so need to manually copy and install stuff to it.

Thank you.


Solution

  • You can add spark-avro as a package when running pyspark or spark-submit: https://github.com/databricks/spark-avro#with-spark-shell-or-spark-submit but this will require internet access on driver (driver will then distribute all files to the executors).

    If you have no internet access on a driver you will need to build spark-avro yourself to a fat jar:

    git clone https://github.com/databricks/spark-avro.git
    cd spark-avro
    # If you are using spark package other than newest, 
    # checkout appropriate tag based on table in spark-avro README, 
    # for example for spark 1.6:
    # git checkout v2.0.1 
    ./build/sbt assembly
    

    Then test it using pyspark shell:

    ./bin/pyspark --jars ~/git/spark-avro/target/scala-2.11/spark-avro-assembly-3.1.0-SNAPSHOT.jar
    
    >>> spark.range(10).write.format("com.databricks.spark.avro").save("/tmp/output")
    >>> spark.read.format("com.databricks.spark.avro").load("/tmp/output").show()
    +---+
    | id|
    +---+
    |  7|
    |  8|
    |  9|
    |  2|
    |  3|
    |  4|
    |  0|
    |  1|
    |  5|
    |  6|
    +---+