Search code examples
pythonregexpysparkglob

Read files from a folder with names don't contains a string, using PySpark


I have a folder with files similar to -

./env_california_0100.xml
./env_california_0200.xml
./env_california_0300.xml
./env_california_0400.xml
./env_0100.xml
./env_0200.xml
./env_0300.xml
./env_0400.xml

using pyspark, if I want to read files whose names contain the string 'california', then I would use

df=spark.read.format("com.databricks.spark.xml").option("rowTag","someTag").load("/some_folder/*california*.xml")

But how to read files which do not have the string 'california'?


Solution

  • Use glob to extract the list of file, then unpack that list in the load call:

       .load(*glob.glob( "/some_folder/*[!california]*.xml"))
    

    Because we can load multiple file like this: .load(path1,path2,....)