apache-spark azure-databricks databricks-connect

Databricks connect fails with No FileSystem for scheme: abfss

I have setup Databricks Connect so that I can develop locally and get Intellij goodies while at the same time leverage the power of a big Spark cluster on Azure Databricks.

When I want to read or write to Azure Data Lake spark.read.csv("abfss://blah.csv) I get the following

xception in thread "main" java.io.IOException: No FileSystem for scheme: abfss
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:547)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.immutable.List.flatMap(List.scala:355)
    at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:618)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:467)

From this I had the impression that it won't be a problem to reference Azure Data Lake locally since the code is executed remotely. Apparently I am mistaken.

Does anyone have a solution to this problem?

Solution

The reason for the problem was that I wanted to have the sources of Spark and be able to execute the workloads on Databricks. Unfortunately databricks-connect jars don't contain sources. So that means that I need to manually import them in the project. And here is the rub - exactly like it says in the docs:

... If this is not possible, make sure that the JARs you add are at the front of the classpath. In particular, they must be ahead of any other installed version of Spark (otherwise you will either use one of those other Spark versions and run locally ...

I did just that.

Now I am able to bake my cake and eat it!

The only problem is that if I add new dependencies I have to this this reordering once again.