We've set up the HDInsights cluster on Azure with Blob as the storage for Hadoop. We tried uploading files to the Hadoop using hadoop CLI and the files were getting uploaded to the Azure Blob.
Command used to upload:
hadoop fs -put somefile /testlocation
However, when we tried using Spark to write files to the Hadoop, it was not getting uploaded to Azure Blob storage but to the disk of the VMs at the directory specified in the
hdfs-site.xml for the datanode
df1mparquet = spark.read.parquet("hdfs://hostname:8020/dataSet/parquet/")
When we run:
hadoop fs -ls / => It lists the files from Azure Blob storage
hadoop fs -ls hdfs://hostname:8020/ => It lists the files from local storage
is this an expected behavior?
You need to look at the value of
fs.defaultFS in the
Sounds like the default filesystem is blob storage.
Regarding Spark, if it's loading the same hadoop configs as the CLI, you shouldn't need to specify the namenode host/port, just use the file paths, and it'll also default to blob storage.
If you specify a full URI to a different filesystem, then it'll use that, but
hdfs:// should be different than the actual local