azure apache-spark hadoop azure-blob-storage azure-hdinsight

Files not getting saved in Azure blob using Spark in HDInsights cluster

We've set up the HDInsights cluster on Azure with Blob as the storage for Hadoop. We tried uploading files to the Hadoop using hadoop CLI and the files were getting uploaded to the Azure Blob.

Command used to upload:

hadoop fs -put somefile /testlocation

However, when we tried using Spark to write files to the Hadoop, it was not getting uploaded to Azure Blob storage but to the disk of the VMs at the directory specified in the hdfs-site.xml for the datanode

Code used:

df1mparquet = spark.read.parquet("hdfs://hostname:8020/dataSet/parquet/")

df1mparquet .write.parquet("hdfs://hostname:8020/dataSet/newlocation/")

Strange behavior:

When we run:

hadoop fs -ls / => It lists the files from Azure Blob storage

hadoop fs -ls hdfs://hostname:8020/ => It lists the files from local storage

is this an expected behavior?

Solution

You need to look at the value of fs.defaultFS in the core-site.xml.

Sounds like the default filesystem is blob storage.

https://hadoop.apache.org/docs/current/hadoop-azure/index.html

Regarding Spark, if it's loading the same hadoop configs as the CLI, you shouldn't need to specify the namenode host/port, just use the file paths, and it'll also default to blob storage.

If you specify a full URI to a different filesystem, then it'll use that, but hdfs:// should be different than the actual local file://