Options for storing many small images for fast batch access on Google Cloud?

We have a few datasets of small images, where each image is about 100KB, and there about 50K images per dataset (around 5GB each dataset). We typically use these datasets to batch-load each image incrementally into a memory of a Google VM instance in order to perform machine learning studies. This is done several times a day.

Currently, a few of us each have our own Google Persistent Disk attached to the VM with the datasets replicated on each. This is not ideal since they are pricey, however, data access is very fast which allows us to iterate on our studies fairly rapidly. We don't share one disk because of the inconvenience of having to manage read/write settings with Google disks when sharing.

Is there an alternative Google Cloud option to handle this use case? Google Buckets are too slow since it is reading many small files.


  • If your main interest is having rapid I/O your best bet is using an SSD for obvious reasons. Why I don't understand is why you don't want to share one disk. You can have one SSD attached to one of your instances as R/W for loading and modifying your datasets and mounting it read-only to the instances that need to fetch the data.

    I'm not sure how faster will be this solution compared to using a bucket, though. I guess you are aware that gsutil has an option for multithreading transfers, which exponentially increases the data transfer speed, specially when transfering a lot of small files? The flag is -m

    If you want to go with the instance with R/W SSD and multiple read only clients see below:

    One option is to set up an NFS on your SSD, one instance will act as the NFS server with R/W rights and the rest will have only read permissions. I will be using Ubuntu 16.04 but the process is similar in all distros:

    1 - Install the required packages on both server and clients:

    Server: sudo apt install nfs-kernel-server 
    Client: sudo apt install nfs-common 

    2 - Mount the disk SSD disk on the server (after formatting it to the filesystem you want to use):


    jordim@instance-5:~$ lsblk 
    sdb      8:16   0  50G  0 disk  <--- My extra SSD disk
    sda      8:0    0  10G  0 disk 
    └─sda1   8:1    0  10G  0 part /
    jordim@instance-5:~$ sudo fdisk /dev/sdb

    (I will create a single primary ext4 partition)

    jordim@instance-5:~$ sudo fdisk /dev/sdb
    (create partition)
    jordim@instance-5:~$ lsblk 
    sdb      8:16   0  50G  0 disk 
    └─sdb1   8:17   0  50G  0 part <- Newly created partition
    sda      8:0    0  10G  0 disk 
    └─sda1   8:1    0  10G  0 part /
    jordim@instance-5:~$ sudo mkfs.ext4 /dev/sdb1
    jordim@instance-5:~$ sudo mkdir /mount
    jordim@instance-5:~$ sudo mount /dev/sdb1 /mount/

    Make a dir for your NFS share folder:

    jordim@instance-5:/mount$ sudo mkdir shared

    Now configure the exports on your server. Add the folder to share and the private IPs of the clients. Also you can tweak permissions here, use "ro" for "read only" or "rw" for read-write permissions.

    jordim@instance-5:/mount$ sudo vim /etc/exports 

    (inside the exports file, note the IP is the private IP of the client instance):


    Now start the nfs service on the server:

    root@instance-5:/mount# systemctl start nfs-server

    Now to create the mountpoint on the client:

    jordim@instance-4:~$ sudo mkdir -p /nfs/share

    And mount the folder:

    jordim@instance-4:~$ sudo mount /nfs/share

    Now let's test it:


    jordim@instance-5:/mount/share$ touch test


    jordim@instance-4:/nfs/share$ ls

    Also, see the mounts:

    jordim@instance-4:/nfs/share$ df -h
    Filesystem               Size  Used Avail Use% Mounted on
    udev                     1.8G     0  1.8G   0% /dev
    tmpfs                    370M  9.9M  360M   3% /run
    /dev/sda1                9.7G  1.5G  8.2G  16% /
    tmpfs                    1.9G     0  1.9G   0% /dev/shm
    tmpfs                    5.0M     0  5.0M   0% /run/lock
    tmpfs                    1.9G     0  1.9G   0% /sys/fs/cgroup
    tmpfs                    370M     0  370M   0% /run/user/1001   50G   52M   47G   1% /nfs/share

    There you go, now you have only one instance with a r/w disk and as many clients as you want with read only permissions.