Search code examples
pythonfsspec

Download multiple files concurrently


I'm using fsspec to interact with remote filesystems, in my case its GCS, but I believe the solution would be general.

For a single file, I'm using the following code (if you need the helper function code, it's here)

def open_any_file(filepath: str, mode: str = "r", **kwargs) -> t.Generator[t.IO, None, None]:
    """
    Open file and close it after use. Works for local, remote, http, https, s3, gcs, etc.

    :param filepath: Filepath.
    :param mode: Mode.
    :param kwargs: Keyword arguments.
    :return: File object.
    """

    protocol, path = get_protocol_and_path(filepath)
    filepath = PurePosixPath(path)
    filesystem = fsspec.filesystem(protocol)

    load_path = get_filepath_str(filepath, protocol)

    # Figure out content type
    if "content_type" not in kwargs and filepath.suffix == ".json":
        kwargs["content_type"] = "application/json"

    with filesystem.open(load_path, mode=mode, **kwargs) as f:
        yield f

Assuming I have a thousand JSONs to download, what would be the most efficient way to do so? Should I go for parallelization? threading? Async?

What would be the optimal choice in terms of execution-time, and what would be the implementation for it?


Solution

  • The function you want is here: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.generic.rsync

    You will pass it two directories, source and destination, and fsspec will figure out which filesystem implementation to use for each, and do concurrent copies if the backend supports it. fsspec is async internally for s3, gcs, abfs and http.

    For copying a bunch of files from a particular backend on a pattern ("*.json"), you will need the implementation-specific get() method (copy to local files) or cat() (grab into in-memory bytes). This is because rsync does not support patterns (yet?).

    Example with rsync:

    remote = "gsc://mybucket/dir"
    local = "/path/to/jsons"
    
    import fsspec.generic
    fsspec.generic.rsync(remote, local)
    

    In the case that you need to pass configuration options to GCS, you can either use fsspec's config, or first make a GenericFileSystem (see its docstrings) and pass with fs=.

    Example with get glob

    fs = fsspec.filesystem("gcs", ...)
    fs.get("gcs://bucket/path/*.json", "/path/to/jsons/")