Search code examples
google-cloud-platformgoogle-cloud-storagegoogle-kubernetes-enginekubeflow

What is the best way to feed image data (tfrecords) from GCS to your model?


I set myself a goal to solve the MNIST Skin Cancer dataset using only Google Cloud.

Using GCS & Kubeflow on Google Kubernetes.

I converted the data from jpeg to tfrecord with the following script: https://github.com/tensorflow/tpu/blob/master/tools/datasets/jpeg_to_tf_record.py

I have seen a lot of examples how they feed a csv-file to their model but no examples with image data.

Should it be smart to copy all the tfrecords to the Google Cloud Shell so I can feed the data to my model like that? Or are there any better methods available?

Thanks in advance.


Solution

  • In the case you are using Kubeflow, I would suggest to use the kubeflow pipelines.

    For the preprocessing you could use an image that is build on top of the standard pipeline dataflow image gcr.io/ml-pipeline/ml-pipeline-dataflow-tft:latest where you simply copy your dataflow code and run it:

    FROM gcr.io/ml-pipeline/ml-pipeline-dataflow-tft:latest
    RUN mkdir /{folder}
    COPY run_dataflow_pipeline.py /{folder}
    ENTRYPOINT ["python", "/{folder}/run_dataflow_pipeline.py"]
    

    See this boilerplate for the dataflow code that does exactly this. The idea is that you write the TF records to Google Cloud Storage (GCS).

    Subsequently you could use Google Cloud's ML engine for the actual training. In this case you can start also from the image google/cloud-sdk:latest and basically copy over the required files with probably a bash script that will be run to execute the gcloud commands to start the training job.

    FROM google/cloud-sdk:latest
    RUN mkdir -p /{src} && \
        cd /{src} 
    COPY train.sh ./
    ENTRYPOINT ["bash", "./train.sh"]
    

    An elegant way to pass on the storage location of your TF records into your model is to use TF.data:

    # Construct a TFRecordDataset
    train_records = [os.path.join('gs://{BUCKET_NAME}/', f.name) for f in
                     bucket.list_blobs(prefix='data/TFR/train')]
    validation_records = [os.path.join('gs://{BUCKET_NAME}/', f.name) for f in
                          bucket.list_blobs(prefix='data/TFR/validation')]
    
    ds_train = tf.data.TFRecordDataset(train_records, num_parallel_reads=4).map(decode)
    ds_val = tf.data.TFRecordDataset(validation_records,num_parallel_reads=4).map(decode)
    
    # potential additional steps for performance: 
    # https://www.tensorflow.org/guide/performance/datasets)
    
    # Train the model
    model.fit(ds_train,
              validation_data=ds_val,
              ...,
              verbose=2)
    

    Check out this blog post for an actual implementation of a similar (more complex) kubeflow pipeline