python azure object-detection-api kubeflow kubeflow-pipelines

How to set up data access in a distributed training job (TF Job) for Object Detection API on Azure

I've been trying to set up distributed training for the TensorFlow Object Detection API on Azure for a while. I've been confused a bit on how exactly to set up my data into the job.

Previously, I used to make this work pretty easily on gcloud using AI-Platform. All I required was:

gcloud ai-platform jobs submit training $JOB_NAME \
    --runtime-version $VERSION \
    --job-dir=$JOB_DIR \
    --packages $OBJ_DET,$SLIM,$PYCOCOTOOLS \
    --module-name object_detection.model_main \
    --region us-central1 \
    --config $CONF/config.yaml \
    -- \
    --model_dir=$MODEL_DIR \
    --pipeline_config_path=$PIPELINE_PATH

Where config.yaml contained the cluster configuration, and JOB_DIR, MODEL_DIR, PIPELINE_PATH all pointed to their respective bucket storage locations (gs://*). My training data used to be stored in the bucket as well, and the location was specified in my pipeline.config.

Now on Azure, there doesn't seem to be a direct way I can run a distributed training job. I've deployed a GPU-accelerated Kubernetes cluster with AKS, and then installed the NVIDIA drivers. I've also deployed Kubeflow, and dockerized the Object Detection API.

My data, in the form of tfrecords, is present in an Azure blob storage container. The Kubeflow examples/documentation I'm looking at (TFJob, AzureEndtoEnd) allocates persistent volumes, which seems great, but I don't understand how my job/training-code will access my tfrecords.

(I've been wondering if perhaps I could do something in the preprocessing part in the Azure End to End pipeline; there I could write some lines of python code to download the data using the azure-storage-blob python library. This is still conjecture, I haven't tried this out yet.)

So any help on this conundrum would be appreciated. I would also appreciate if I'm pointed to any useful, up-to-date resources. Here are two of the other resources I've looked at:

https://medium.com/@sozercan/tensorflow-object-detection-on-azure-part-2-using-kubernetes-to-run-distributed-tensorflow-ced5b9a6184a

This seemed to be a good example, but some parts of it are outdated. tf/k8s is no longer there and has been moved to kubeflow; consequently, the helm chart is unavailable as well.
https://github.com/kubeflow/examples/tree/master/object_detection

This is a direct example from the Kubeflow repository, however it too seems outdated. It seems to rely on ksonnet, which has been discontinued.

Solution

Okay I ended up figuring this out myself. Turns out you can define a persistent volume claim on top of a storage class. The storage class can be specified to be an Azure File Share, which makes everything much more convenient.

sc.yaml:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: <NAME>
provisioner: kubernetes.io/azure-file
mountOptions:
  - dir_mode=0777
  - file_mode=0777
  - uid=0
  - gid=0
  - mfsymlinks
  - cache=strict
parameters:
  storageAccount: <STORAGE_ACC_NAME>

pvc.yaml:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: <NAME>
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: <NAME>
  resources:
    requests:
      storage: 20Gi

The persistent volume can then be created and claimed by:

kubectl apply -f sc.yaml
kubectl apply -f pvc.yaml

After this a share shows up in the specified storage account, and you can simply utilize Azure file share's systems to seamlessly upload data into it (like perhaps using azcopy to move data from your local machine or an existing share/container).