I'm using blob storage in azure (AzureNativeFileSystemStore in org.apache.hadoop.fs.azure).
When I attempt to set my checkpoint directory to point at it, this fails:
spark.conf.set(
"fs.azure.account.key.myaccount.blob.core.windows.net",
"mykey"
)
spark.sparkContext.setCheckpointDir("wasbs://[email protected]/raw/temp/")
The error message is claiming that my storage key is missing from spark.conf:
No credentials found for account myaccount.blob.core.windows.net in the configuration
Full stack from pyspark:
Py4JJavaError: An error occurred while calling o3578.setCheckpointDir.
: org.apache.hadoop.fs.azure.AzureException: org.apache.hadoop.fs.azure.AzureException: No credentials found for account xxx.blob.core.windows.net in the configuration, and its container datalake is not accessible using anonymous credentials. Please check if the container exists first. If it is not publicly available, you have to provide account credentials.
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:1123)
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:566)
at org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1423)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3316)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:137)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3365)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3333)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:492)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.spark.SparkContext.$anonfun$setCheckpointDir$2(SparkContext.scala:2595)
at scala.Option.map(Option.scala:230)
at org.apache.spark.SparkContext.setCheckpointDir(SparkContext.scala:2593)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.hadoop.fs.azure.AzureException: No credentials found for account xxx.blob.core.windows.net in the configuration, and its container datalake is not accessible using anonymous credentials. Please check if the container exists first. If it is not publicly available, you have to provide account credentials.
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.connectUsingAnonymousCredentials(AzureNativeFileSystemStore.java:899)
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:1118)
... 22 more
This prevents my checkpoint directory from being set.
However if I first interact with the account using a slightly different API, like "spark.read", then I will also be able to set the checkpoint directory!:
spark.read.parquet("wasbs://[email protected]/a/b/c/invalid.parquet")
spark.sparkContext.setCheckpointDir("wasbs://[email protected]/raw/temp/")
The confusing behavior seems to be a bug, right?
Is the problem with the azure blob storage provider? Or is it Spark itself that is the source of the bug? It seems troublesome to me that we are given a misleading error about a missing configuration key (... and it took a small miracle to identify the right key in the first place : "fs.azure.account.key.myaccount.blob.core.windows.net").
Any pointers would be appreciated. If there is an easier way to setCheckpointDir for HDFS please let me know. Thanks in advance.
PS. In case the version matters, I found these jars in my synapse-spark environment:
/usr/hdp/current/hadoop-client/azure-storage-7.0.1.jar
/usr/hdp/current/hadoop-client/hadoop-azure-3.1.1.5.0-97710309.jar
It was bothersome that I kept being told that the credentials weren't provided:
"No credentials found for account myaccount.blob.core.windows.net in the configuration"
... I never forget to set the conf (using spark.conf.set). I'm not sure why Spark/Hadoop are confused.
The solution ended up being to set the config before the session even starts.
p_CsharpSparkSessionBuilder
.Config($"spark.hadoop.fs.azure.account.key.{StorageAccountName}.blob.core.windows.net", StorageAccountKey)
.Config($"fs.azure.account.key.{StorageAccountName}.dfs.core.windows.net", StorageAccountKey)
.Config($"fs.azure.account.key.{StorageAccountName}.blob.core.windows.net", StorageAccountKey);
The manipulation of the "session builder" is possible because my drivers are livy batches. I'm not sure what the solution would have been if the code ran in a notebook, because clearly Spark wasn't properly allowing me to change config after the fact.
For some reason I also needed to set a different config key for hadoop ("spark.hadoop.fs.azure.account.key.whatever"). I'm not really sure why, but I intend on googling for the differences between that and "fs.azure.account.key".
Between the change in the session builder, and the additional config key, everything started working as expected, and now I'm using wasbs for checkpoints rather than abfss. I know it may be a step backwards in terms of technologies ... but my checkpoints were EXPENSIVE on abfss and something had to be done.