In my notebook folder there are 2000 files, which are named as part-00000.xml.gz
,part-00001.xml.gz
,...,part-02000.xml.gz
I would like to use sc.textFile
to generate every 200 of them as a RDD file at a time, and repeat 10 times to get 10 RDD files.
How to write a code in python to do this? Thank you very much.
If your files are small in size, I would advise to go with wholeTextFiles
to load all of the files at once into the RDD.
textFilesRDD = sc.wholeTextFiles(dirPath)
Else, if you want to load n number of chunks into a RDD, it can be done via hadoop API, which is already described in this answer.