I would like to recursively load all files that are in xml format into my dataframe in a directory that has additional subdirectories. With other file formats (txt, parquet,..) the code seems to work.
df = (
spark.read
.format("xml")
.option("rowTag", "library")
.option("wholetext", "true")
.option("recursiveFileLookup","true")
.option("pathGlobFilter", "*.xml")
.load("path/to/dir")
)
I have tested this code with different file formats, but xml files are not found.
Looks like I found an answer right away, although it may not be entirely satisfactory. Basically, I have found two possibilities:
df = (
spark.read
.format("text")
.option("rowTag", "library")
.option("wholetext", "true")
.option("recursiveFileLookup","true")
.option("pathGlobFilter", "*.xml")
.load("path/to/dir")
)
df = (
spark.read
.format("xml")
.option("rowTag", "library")
.option("wholetext", "true")
.load("path/to/dir/**/*.xml")
)
This makes the two options "recursiveFileLookup" and "pathGlobFilter" unnecessary.
** in the glob pattern searches recursively through all directories and
*.xml searches for files ending with .xml.