It seems impossible to write to Azure Datalake Gen2 using spark, unless you're using Databricks.
I'm using jupyter
with almond
to run spark in a notebook locally.
I have imported the hadoop dependencies:
import $ivy.`org.apache.hadoop:hadoop-azure:2.7.7`
import $ivy.`com.microsoft.azure:azure-storage:8.4.0`
which allows me to use the wasbs://
protocol when trying to write my dataframe to azure
spark.conf.set(
"fs.azure.sas.[container].prodeumipsadatadump.blob.core.windows.net",
"?sv=2018-03-28&ss=b&srt=sco&sp=rwdlac&se=2019-09-09T23:33:45Z&st=2019-09-09T15:33:45Z&spr=https&sig=[truncated]")
This is where the error comes:
val data = spark.read.json(spark.createDataset(
"""{"name":"Yin", "age": 25.35,"address":{"city":"Columbus","state":"Ohio"}}""" :: Nil))
data
.write
.orc("wasbs://[filesystem]@[datalakegen2storageaccount].blob.core.windows.net/lalalalala")
We are now greeted with "Blob API is not yet supported for hierarchical namespace accounts" error:
org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Blob API is not yet supported for hierarchical namespace accounts.
So is this indeed impossible? Should I just abandon the Datalake gen2 and just use regular blob storage? Microsoft really dropped the ball in creating a "Data lake" product but creating no documentation for a connector with spark.
Working with ADLS Gen2 in spark is straightforward and microsoft haven't "dropped the ball", so much as "the hadoop binaries shipped with ASF Spark don't include the ABFS client". Those in HD/Insights, Cloudera CDH6.x etc do.
ADLS Gen2 is the best object store Microsoft have deployed - with hierarchical namespaces you get O(1) directory operations, which for spark means High performance task and job commits. Security and permissions are great too.
Yes it is unfortunate that it doesn't work out the box with the spark distribution you have -but Microsoft are not in a position to retrofit a new connector to a set of artifacts released in 2017. You're going to have to upgrade your dependencies.