When I use sc.textFile('*.txt')
I read everything.
I'd like to be able to filter out several files.
e.g. How can I read all file except ['bar.txt', 'foo.txt']
?
This is more of a workaround:
get file list:
import os
file_list = os.popen('hadoop fs -ls <your dir>').readlines()
Filter it:
file_list = [x for x in file_list if (x not in ['bar.txt','foo.txt')
and x[-3:]=='txt']
Read it:
rdd = sc.textFile(['<your dir>/'+x for x in file_list])