Search code examples
scalaapache-sparkapache-spark-sqlazure-databricksdata-partitioning

Partition Table on top of folders containing sub-folders which contains json files in spark


I am working on spark in Databricks. I have a mount point for my storage location pointing to my directory. Let's call the directory as "/mnt/abc1/abc2" - path. In this "abc2" directory, lets say I have 10 folders named as "xyz1" .. "xyz10". All these "xyz%" folders contain json files, lets call them "xyz1_1.json", so on. I need to build a table such that I can access my json into spark table by referring it as path + "abc2.xyz1.xyz1_1.json"

var path = "/mnt/abc1/"
var data = spark.read.json(path)

This works when the json files are directly underlying inside the path and not inside the folders in our path. I want to figure out a way to which can automatically detect the underlying folders and the sub-folders containing the jsons, and build the table on top of it.


Solution

  • With spark 3+ you may add the option recursiveFileLookup as true to search sub directories

    var path = "/mnt/abc1/"
    var data = spark.read.option("recursiveFileLookup","true").json(path)