Search code examples
apache-sparkapache-spark-sqlemramazon-emr

Adding spark-csv dependency in Zeppelin is creating a network error


Adding spark-csv dependency in Zeppelin is creating a network error. I went to the Spark interpreter in Zeppelin and added the Spark-csv dependency. com.databricks:spark-csv_2.10:1.2.0. I also added it in the argument option.

enter image description here

I restarted Zeppelin and ran the following command :

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("https://github.com/databricks/spark-csv/raw/master/src/test/resources/cars.csv")
df.printSchema()

enter image description here

Am I adding the dependency correctly?

UPDATE

Tried changing the library to com.databricks:spark-csv_2.11:jar:1.6.0 and got the following :

Error setting properties for interpreter 'spark.spark': Could not find artifact com.databricks:spark-csv_2.11:jar:1.6.0 in central (http://repo1.maven.org/maven2/)

enter image description here


Solution

  • It looks like you used pretty old library version, in addition built for scala 2.10 (where your spark seems to be 2.11).

    Change the package to com.databricks:spark-csv_2.11:1.5.0 and it should work.