Search code examples
iogoogle-cloud-platform

Options for storing many small images for fast batch access on Google Cloud?


We have a few datasets of small images, where each image is about 100KB, and there about 50K images per dataset (around 5GB each dataset). We typically use these datasets to batch-load each image incrementally into a memory of a Google VM instance in order to perform machine learning studies. This is done several times a day.

Currently, a few of us each have our own Google Persistent Disk attached to the VM with the datasets replicated on each. This is not ideal since they are pricey, however, data access is very fast which allows us to iterate on our studies fairly rapidly. We don't share one disk because of the inconvenience of having to manage read/write settings with Google disks when sharing.

Is there an alternative Google Cloud option to handle this use case? Google Buckets are too slow since it is reading many small files.


Solution

  • If your main interest is having rapid I/O your best bet is using an SSD for obvious reasons. Why I don't understand is why you don't want to share one disk. You can have one SSD attached to one of your instances as R/W for loading and modifying your datasets and mounting it read-only to the instances that need to fetch the data.

    I'm not sure how faster will be this solution compared to using a bucket, though. I guess you are aware that gsutil has an option for multithreading transfers, which exponentially increases the data transfer speed, specially when transfering a lot of small files? The flag is -m

     -m           Causes supported operations (acl ch, acl set, cp, mv, rm, rsync,
                  and setmeta) to run in parallel. This can significantly improve
                  performance if you are performing operations on a large number of
                  files over a reasonably fast network connection.
                  gsutil performs the specified operation using a combination of
                  multi-threading and multi-processing, using a number of threads
                  and processors determined by the parallel_thread_count and
                  parallel_process_count values set in the boto configuration
                  file. You might want to experiment with these values, as the
                  best values can vary based on a number of factors, including
                  network speed, number of CPUs, and available memory.  
    
                  Using the -m option may make your performance worse if you
                  are using a slower network, such as the typical network speeds
                  offered by non-business home network plans. It can also make
                  your performance worse for cases that perform all operations
                  locally (e.g., gsutil rsync, where both source and destination
                  URLs are on the local disk), because it can "thrash" your local
                  disk.  
    
                  If a download or upload operation using parallel transfer fails
                  before the entire transfer is complete (e.g. failing after 300 of
                  1000 files have been transferred), you will need to restart the
                  entire transfer.  
    
                  Also, although most commands will normally fail upon encountering
                  an error when the -m flag is disabled, all commands will
                  continue to try all operations when -m is enabled with multiple
                  threads or processes, and the number of failed operations (if any)
                  will be reported at the end of the command's execution.
    

    If you want to go with the instance with R/W SSD and multiple read only clients see below:

    One option is to set up an NFS on your SSD, one instance will act as the NFS server with R/W rights and the rest will have only read permissions. I will be using Ubuntu 16.04 but the process is similar in all distros:

    1 - Install the required packages on both server and clients:

    Server: sudo apt install nfs-kernel-server 
    Client: sudo apt install nfs-common 
    

    2 - Mount the disk SSD disk on the server (after formatting it to the filesystem you want to use):

    Server:

    jordim@instance-5:~$ lsblk 
    NAME   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
    sdb      8:16   0  50G  0 disk  <--- My extra SSD disk
    sda      8:0    0  10G  0 disk 
    └─sda1   8:1    0  10G  0 part /
    
    jordim@instance-5:~$ sudo fdisk /dev/sdb
    

    (I will create a single primary ext4 partition)

    jordim@instance-5:~$ sudo fdisk /dev/sdb
    (create partition)
    
    jordim@instance-5:~$ lsblk 
    NAME   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
    sdb      8:16   0  50G  0 disk 
    └─sdb1   8:17   0  50G  0 part <- Newly created partition
    sda      8:0    0  10G  0 disk 
    └─sda1   8:1    0  10G  0 part /
    
    jordim@instance-5:~$ sudo mkfs.ext4 /dev/sdb1
    (...)
    jordim@instance-5:~$ sudo mkdir /mount
    
    jordim@instance-5:~$ sudo mount /dev/sdb1 /mount/
    

    Make a dir for your NFS share folder:

    jordim@instance-5:/mount$ sudo mkdir shared
    

    Now configure the exports on your server. Add the folder to share and the private IPs of the clients. Also you can tweak permissions here, use "ro" for "read only" or "rw" for read-write permissions.

    jordim@instance-5:/mount$ sudo vim /etc/exports 
    

    (inside the exports file, note the IP is the private IP of the client instance):

    /mount/share    10.142.0.5(ro,sync,no_subtree_check)
    

    Now start the nfs service on the server:

    root@instance-5:/mount# systemctl start nfs-server
    

    Now to create the mountpoint on the client:

    jordim@instance-4:~$ sudo mkdir -p /nfs/share
    

    And mount the folder:

    jordim@instance-4:~$ sudo mount 10.142.0.6:/mount/share /nfs/share
    

    Now let's test it:

    Server:

    jordim@instance-5:/mount/share$ touch test
    

    Client:

    jordim@instance-4:/nfs/share$ ls
    test
    

    Also, see the mounts:

    jordim@instance-4:/nfs/share$ df -h
    Filesystem               Size  Used Avail Use% Mounted on
    udev                     1.8G     0  1.8G   0% /dev
    tmpfs                    370M  9.9M  360M   3% /run
    /dev/sda1                9.7G  1.5G  8.2G  16% /
    tmpfs                    1.9G     0  1.9G   0% /dev/shm
    tmpfs                    5.0M     0  5.0M   0% /run/lock
    tmpfs                    1.9G     0  1.9G   0% /sys/fs/cgroup
    tmpfs                    370M     0  370M   0% /run/user/1001
    10.142.0.6:/mount/share   50G   52M   47G   1% /nfs/share
    

    There you go, now you have only one instance with a r/w disk and as many clients as you want with read only permissions.