Options for storing many small images for fast batch access on Google Cloud?

We have a few datasets of small images, where each image is about 100KB, and there about 50K images per dataset (around 5GB each dataset). We typically use these datasets to batch-load each image incrementally into a memory of a Google VM instance in order to perform machine learning studies. This is done several times a day.

Currently, a few of us each have our own Google Persistent Disk attached to the VM with the datasets replicated on each. This is not ideal since they are pricey, however, data access is very fast which allows us to iterate on our studies fairly rapidly. We don't share one disk because of the inconvenience of having to manage read/write settings with Google disks when sharing.

Is there an alternative Google Cloud option to handle this use case? Google Buckets are too slow since it is reading many small files.

Solution

If your main interest is having rapid I/O your best bet is using an SSD for obvious reasons. Why I don't understand is why you don't want to share one disk. You can have one SSD attached to one of your instances as R/W for loading and modifying your datasets and mounting it read-only to the instances that need to fetch the data.

I'm not sure how faster will be this solution compared to using a bucket, though. I guess you are aware that gsutil has an option for multithreading transfers, which exponentially increases the data transfer speed, specially when transfering a lot of small files? The flag is -m

 -m           Causes supported operations (acl ch, acl set, cp, mv, rm, rsync,
              and setmeta) to run in parallel. This can significantly improve
              performance if you are performing operations on a large number of
              files over a reasonably fast network connection.
              gsutil performs the specified operation using a combination of
              multi-threading and multi-processing, using a number of threads
              and processors determined by the parallel_thread_count and
              parallel_process_count values set in the boto configuration
              file. You might want to experiment with these values, as the
              best values can vary based on a number of factors, including
              network speed, number of CPUs, and available memory.  

              Using the -m option may make your performance worse if you
              are using a slower network, such as the typical network speeds
              offered by non-business home network plans. It can also make
              your performance worse for cases that perform all operations
              locally (e.g., gsutil rsync, where both source and destination
              URLs are on the local disk), because it can "thrash" your local
              disk.  

              If a download or upload operation using parallel transfer fails
              before the entire transfer is complete (e.g. failing after 300 of
              1000 files have been transferred), you will need to restart the
              entire transfer.  

              Also, although most commands will normally fail upon encountering
              an error when the -m flag is disabled, all commands will
              continue to try all operations when -m is enabled with multiple
              threads or processes, and the number of failed operations (if any)
              will be reported at the end of the command's execution.

If you want to go with the instance with R/W SSD and multiple read only clients see below:

One option is to set up an NFS on your SSD, one instance will act as the NFS server with R/W rights and the rest will have only read permissions. I will be using Ubuntu 16.04 but the process is similar in all distros:

1 - Install the required packages on both server and clients:

Server: sudo apt install nfs-kernel-server 
Client: sudo apt install nfs-common

2 - Mount the disk SSD disk on the server (after formatting it to the filesystem you want to use):

Server:

jordim@instance-5:~$ lsblk 
NAME   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb      8:16   0  50G  0 disk  <--- My extra SSD disk
sda      8:0    0  10G  0 disk 
└─sda1   8:1    0  10G  0 part /

jordim@instance-5:~$ sudo fdisk /dev/sdb

(I will create a single primary ext4 partition)

jordim@instance-5:~$ sudo fdisk /dev/sdb
(create partition)

jordim@instance-5:~$ lsblk 
NAME   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb      8:16   0  50G  0 disk 
└─sdb1   8:17   0  50G  0 part <- Newly created partition
sda      8:0    0  10G  0 disk 
└─sda1   8:1    0  10G  0 part /

jordim@instance-5:~$ sudo mkfs.ext4 /dev/sdb1
(...)
jordim@instance-5:~$ sudo mkdir /mount

jordim@instance-5:~$ sudo mount /dev/sdb1 /mount/

Make a dir for your NFS share folder:

jordim@instance-5:/mount$ sudo mkdir shared

Now configure the exports on your server. Add the folder to share and the private IPs of the clients. Also you can tweak permissions here, use "ro" for "read only" or "rw" for read-write permissions.

jordim@instance-5:/mount$ sudo vim /etc/exports

(inside the exports file, note the IP is the private IP of the client instance):

/mount/share    10.142.0.5(ro,sync,no_subtree_check)

Now start the nfs service on the server:

root@instance-5:/mount# systemctl start nfs-server

Now to create the mountpoint on the client:

jordim@instance-4:~$ sudo mkdir -p /nfs/share

And mount the folder:

jordim@instance-4:~$ sudo mount 10.142.0.6:/mount/share /nfs/share

Now let's test it:

Server:

jordim@instance-5:/mount/share$ touch test

Client:

jordim@instance-4:/nfs/share$ ls
test

Also, see the mounts:

jordim@instance-4:/nfs/share$ df -h
Filesystem               Size  Used Avail Use% Mounted on
udev                     1.8G     0  1.8G   0% /dev
tmpfs                    370M  9.9M  360M   3% /run
/dev/sda1                9.7G  1.5G  8.2G  16% /
tmpfs                    1.9G     0  1.9G   0% /dev/shm
tmpfs                    5.0M     0  5.0M   0% /run/lock
tmpfs                    1.9G     0  1.9G   0% /sys/fs/cgroup
tmpfs                    370M     0  370M   0% /run/user/1001
10.142.0.6:/mount/share   50G   52M   47G   1% /nfs/share

There you go, now you have only one instance with a r/w disk and as many clients as you want with read only permissions.