Search code examples
databricksazure-databricksautoloaddatabricks-autoloader

Read data from mount in Databricks (using Autoloader)


I am using azure blob storage to store data and feeding this data to Autoloader using mount. I was looking for a way to allow Autoloader to load a new file from any mount. Let's say I have these folders in my mount:

mnt/

├─ blob_container_1

├─ blob_container_2

When I use .load('/mnt/') no new files are detected. But when I consider folders individually then it works fine like .load('/mnt/blob_container_1')

I want to load files from both mount paths using Autoloader (running continuously).


Solution

  • You can use the path for providing prefix patterns, for example:

    df = spark.readStream.format("cloudFiles") \
      .option("cloudFiles.format", <format>) \
      .schema(schema) \
      .load("<base_path>/*/files")
    

    For example, if you would like to parse only png files within a directory that contains files with different suffixes, you can do:

    df = spark.readStream.format("cloudFiles") \
      .option("cloudFiles.format", "binaryFile") \
      .option("pathGlobfilter", "*.png") \
      .load(<base_path>)
    

    Refer – https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html#filtering-directories-or-files-using-glob-patterns