python google-cloud-storage python-imaging-library python-asyncio python-multiprocessing

Efficiently Read Images from Google Cloud Storage folder using Python PIL

I want to read all the images in a GCS Folder (jpeg or png) efficiently.

I used the simple sequential read but it is very slow.

from gcsfs import GCSFileSystem
from PIL import Image

gcs = GCSFileSystem(project=PROJECT, token=ADC_JSON_PATH)
image_gcspaths = gcs.ls(GCS_FOLDER)

is_image_path = lambda p: p.endswith('.jpeg') or p.endswith('.png')
image_gcspaths = [
  image_gcspath for image_gcspath in image_gcspaths if is_img_path(is_image_path)
]

images = []
for image_gcspath in image_gcspaths:
    image = Image.open(gcs.open(image_gcspath, 'rb'))
    image = image.convert('RGB')
    images.append(image)

return images

Is there a faster way to do this using asyncio? This answer states GCS does not support asyncio.

If that is the case then what is the fastest way to read all the images in a GCS bucket?

Solution

Publishing this as community wiki for other's sake.

As mentioned by @John Hanley:

The term very slow means nothing. It depends on what is slow. One HTTP data transfer typically consumes 25% to 100% of available network bandwidth. The slower your network connection the higher that percentage is. The Google Cloud Storage Python SDK does not provide parallel downloads of multiple objects. You will need to write that functionality yourself or select a different library. However, make sure you know what you are optimizing else you will have no or little improvement.

You can check this link for references: