Search code examples
python-3.xpysparkrdd

Access 200 files in RDD at a time pyspark


In my notebook folder there are 2000 files, which are named as part-00000.xml.gz,part-00001.xml.gz,...,part-02000.xml.gz

I would like to use sc.textFile to generate every 200 of them as a RDD file at a time, and repeat 10 times to get 10 RDD files.

How to write a code in python to do this? Thank you very much.


Solution

  • If your files are small in size, I would advise to go with wholeTextFiles to load all of the files at once into the RDD.

    textFilesRDD = sc.wholeTextFiles(dirPath)
    

    Else, if you want to load n number of chunks into a RDD, it can be done via hadoop API, which is already described in this answer.