Search code examples
xmlapache-sparkpysparkdatabricks

Using spark.read.from("xml").option("recursiveFileLookup", "true") for xml files in subdirectories


I would like to recursively load all files that are in xml format into my dataframe in a directory that has additional subdirectories. With other file formats (txt, parquet,..) the code seems to work.

df = (
    spark.read
    .format("xml")
    .option("rowTag", "library")
    .option("wholetext", "true")
    .option("recursiveFileLookup","true")
    .option("pathGlobFilter", "*.xml")
    .load("path/to/dir")
)

I have tested this code with different file formats, but xml files are not found.


Solution

  • Looks like I found an answer right away, although it may not be entirely satisfactory. Basically, I have found two possibilities:

    1. Change format from "xml" to "text".
      This allows recursive reading, but unfortunately the content of the xml file is not read in as nicely as before.
    df = (
        spark.read
        .format("text")
        .option("rowTag", "library")
        .option("wholetext", "true")
        .option("recursiveFileLookup","true")
        .option("pathGlobFilter", "*.xml")
        .load("path/to/dir")
    )
    
    1. Append a glob pattern to the path at the load option.
    df = (
        spark.read
        .format("xml")
        .option("rowTag", "library")
        .option("wholetext", "true")
        .load("path/to/dir/**/*.xml")
    )
    

    This makes the two options "recursiveFileLookup" and "pathGlobFilter" unnecessary.
    ** in the glob pattern searches recursively through all directories and
    *.xml searches for files ending with .xml.