Search code examples
pythonapache-sparkpysparkazure-databricks

Identifying Files with Extensions Using Wildcards


After mounting my data lake to Databricks, I attempt to load all JSON files into a dataframe using *.json suffix but it is not working:

df = spark.read.option("recursiveFileLookup", "true") \
    .json("/mnt/adls_gen/prod/**/*.json")

I get below error when executing the above code

[PATH_NOT_FOUND] Path does not exist: dbfs:/mnt/adls_gen/prod/**/*.json.

If I remove the file extension, the operation succeeds:

df = spark.read.option("recursiveFileLookup", "true") \
    .json("/mnt/adls_gen/prod/**/*")

...but it is also reading other files, such as files with extensions *.json_old and *.txt.

I'm unfamiliar with any alternative options to use in this scenario. Is there another method available for filtering by file extension? My files in the data lake have various extensions, so I'm seeking a solution that accommodates this diversity.

Apache Spark version is 3.4.1 (Scala 2.12).


Solution

  • I thought that was the purpose of having a Path Glob Filter among other generic file source options:

    Path Glob Filter
    pathGlobFilter is used to only include files with file names matching the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. It does not change the behavior of partition discovery.

    df_filtered = spark.read
      .option("header","true")
      .option("recursiveFileLookup","true")
      .csv("s3a://mybucket/testdata/csvs", pathGlobFilter="*.csv")
    
    from pyspark.sql.functions import *
    df_filtered.select(input_file_name()).distinct().show(truncate=False)
    +-------------------------------------+
    |input_file_name()                    |
    +-------------------------------------+
    |s3a://mybucket/testdata/csvs/c000.csv|
    |s3a://mybucket/testdata/csvs/c001.csv|
    |         :         :          :      |
    +-------------------------------------+