I am trying to read a delta table stored on azure from a local spark cluster. The way I try to reach it is through Azure Data Lake Storage Gen2 (abfss://
), not the legacy Blob Storage
The final goal is a pyspark application but, to understand what's going on, I am trying to read the table from a spark shell. Here is how I launch it:
spark-shell \
--packages io.delta:delta-core_2.12:2.2.0,org.apache.hadoop:hadoop-azure:3.3.1 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \
--conf "fs.azure.account.key.<storage_account>.dfs.core.windows.net=<storage_key>" \
Here is how I try to read the table
val dt = spark.read.format("delta").load(f"abfss://hub@<storage_account>.dfs.core.windows.net/fec/fec.delta")
and here is the error I get
org.apache.hadoop.fs.azurebfs.contracts.exceptions.KeyProviderException: Failure to initialize configuration
at org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51)
at org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:548)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1449)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:215)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:128)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.delta.DeltaTableUtils$.findDeltaTableRoot(DeltaTable.scala:184)
at org.apache.spark.sql.delta.sources.DeltaDataSource$.parsePathIdentifier(DeltaDataSource.scala:314)
at org.apache.spark.sql.delta.catalog.DeltaTableV2.x$1$lzycompute(DeltaTableV2.scala:70)
at org.apache.spark.sql.delta.catalog.DeltaTableV2.x$1(DeltaTableV2.scala:65)
at org.apache.spark.sql.delta.catalog.DeltaTableV2.timeTravelByPath$lzycompute(DeltaTableV2.scala:65)
at org.apache.spark.sql.delta.catalog.DeltaTableV2.timeTravelByPath(DeltaTableV2.scala:65)
at org.apache.spark.sql.delta.catalog.DeltaTableV2.$anonfun$timeTravelSpec$1(DeltaTableV2.scala:99)
I have the impression that I followed the documentation and I was able to read delta table in python using delta-rs with the very same creds: I am sure about the creds.
I probably forgot to set something but, the more I read docs, the more confused I feel. I also tried to set an Oauth2 auth but end up with the same exception. The more I think of it, the more I feel that --conf "fs.azure.account.key.<storage_account>.blob.core.windows.net=<storage_key>"
is not taken into account (but I have no idea why)
Turns out, authentication has to be set from the spark session
spark-shell \
--packages io.delta:delta-core_2.12:2.2.0,org.apache.hadoop:hadoop-azure:3.3.2 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \
and then
spark.conf.set("fs.azure.account.key.<storage_account>.dfs.core.windows.net","<storage_key>")
val dt = spark.read.format("delta").load(f"abfss://hub@satestoct.dfs.core.windows.net/fec/fec.delta")