Access 200 files in RDD at a time pyspark

In my notebook folder there are 2000 files, which are named as part-00000.xml.gz,part-00001.xml.gz,...,part-02000.xml.gz

I would like to use sc.textFile to generate every 200 of them as a RDD file at a time, and repeat 10 times to get 10 RDD files.

How to write a code in python to do this? Thank you very much.

Solution

If your files are small in size, I would advise to go with wholeTextFiles to load all of the files at once into the RDD.

textFilesRDD = sc.wholeTextFiles(dirPath)

Else, if you want to load n number of chunks into a RDD, it can be done via hadoop API, which is already described in this answer.

How to tackle time limit exceeded error in leetcode
Python requests is slow and takes very long to complete HTTP or HTTPS request
Iterating over two lists one after another
Python -i flag for production
list() vs iterable unpacking in Python 3.5+
How to resolve "cannot import name '_MissingValues' from 'sklearn.utils._param_validation'" issue when trying to import imblearn?
Collatz Loop Structure
Hide axis label only, not entire axis, in Pandas plot
How can I validate a date in Python 3.x?
How to render a pdf file as pdf in browser using django
Polars pivot + unpivot operation with multiple values (pandas stack / unstack alternative / UDF over)
Selenium gives "selenium.common.exceptions.WebDriverException: Message: unknown error: cannot find Chrome binary" on Mac
Finding the source code of methods implemented in C?
What is the behaviour of xarray when multiplying data arrays?
Proper use of ctypes to call _Py_Mangle?
How do I install PIL/Pillow for Python 3.6?
Python loop on list
Where to locate virtual environment installed using poetry | Where to find poetry installed virtual environment
String-based enum in Python
How do I get .nbt file to save properly with a new nbtlib.tag.File object?
Hackerrank doesn't accept my code. Why?
How to install PYODBC in Databricks
How Can I Find the Range of a Function in Sympy
how to set proxy with authentication in selenium chromedriver python?
What's the polars equivalent to the pandas `.iloc` method?
How to pass Desired Capabilities to Undetected Chromedriver with Selenium Python?
ajax not passing data to view in Django
Problem with pip install mariadb - mariadb_config not found
Why do imaginary number calculations behave differently for exponents up to 100 and above 100?
ClassifierChain with Random Forest: Why is np.nan not supported even though Base Estimator handles it?