Search code examples
apache-sparkpyspark

PySpark exclude files from list


When I use sc.textFile('*.txt') I read everything.

I'd like to be able to filter out several files.

e.g. How can I read all file except ['bar.txt', 'foo.txt']?


Solution

  • This is more of a workaround:

    get file list:

    import os
    file_list = os.popen('hadoop fs -ls <your dir>').readlines()
    

    Filter it:

    file_list = [x for x in file_list if (x not in ['bar.txt','foo.txt')
                 and x[-3:]=='txt']
    

    Read it:

    rdd = sc.textFile(['<your dir>/'+x for x in file_list])