Search code examples
pythonmultithreadingdownloadhuggingfacehuggingface-datasets

How can I multithreadedly download a HuggingFace dataset?


I want to download a HuggingFace dataset, e.g. uonlp/CulturaX:

from datasets import load_dataset
ds = load_dataset("uonlp/CulturaX", "en")

However, it downloads on one thread at 50 MB/s, while my network is 10 Gbps. Since this dataset is 16 TB, I'd prefer to download it faster so that I don't have to wait for a few days. How can I multithreadedly download a HuggingFace dataset?


Solution

  • One can use the num_proc attribute (thanks Quentin Lhoest for pointing me to it):

    from datasets import load_dataset
    ds = load_dataset("uonlp/CulturaX", "en", num_proc=8)
    

    Note that uonlp/CulturaX has been gated since the question was posted. One must therefore first run in a terminal:

    huggingface-cli login --token $HUGGINGFACE_TOKEN
    

    where $HUGGINGFACE_TOKEN can be found on https://huggingface.co/settings/tokens

    and go to uonlp/CulturaX to accept the dataset access agreement.