Search code examples
pythonazureobject-detection-apikubeflowkubeflow-pipelines

How to set up data access in a distributed training job (TF Job) for Object Detection API on Azure


I've been trying to set up distributed training for the TensorFlow Object Detection API on Azure for a while. I've been confused a bit on how exactly to set up my data into the job.

Previously, I used to make this work pretty easily on gcloud using AI-Platform. All I required was:

gcloud ai-platform jobs submit training $JOB_NAME \
    --runtime-version $VERSION \
    --job-dir=$JOB_DIR \
    --packages $OBJ_DET,$SLIM,$PYCOCOTOOLS \
    --module-name object_detection.model_main \
    --region us-central1 \
    --config $CONF/config.yaml \
    -- \
    --model_dir=$MODEL_DIR \
    --pipeline_config_path=$PIPELINE_PATH

Where config.yaml contained the cluster configuration, and JOB_DIR, MODEL_DIR, PIPELINE_PATH all pointed to their respective bucket storage locations (gs://*). My training data used to be stored in the bucket as well, and the location was specified in my pipeline.config.

Now on Azure, there doesn't seem to be a direct way I can run a distributed training job. I've deployed a GPU-accelerated Kubernetes cluster with AKS, and then installed the NVIDIA drivers. I've also deployed Kubeflow, and dockerized the Object Detection API.

My data, in the form of tfrecords, is present in an Azure blob storage container. The Kubeflow examples/documentation I'm looking at (TFJob, AzureEndtoEnd) allocates persistent volumes, which seems great, but I don't understand how my job/training-code will access my tfrecords.

(I've been wondering if perhaps I could do something in the preprocessing part in the Azure End to End pipeline; there I could write some lines of python code to download the data using the azure-storage-blob python library. This is still conjecture, I haven't tried this out yet.)

So any help on this conundrum would be appreciated. I would also appreciate if I'm pointed to any useful, up-to-date resources. Here are two of the other resources I've looked at:


Solution

  • Okay I ended up figuring this out myself. Turns out you can define a persistent volume claim on top of a storage class. The storage class can be specified to be an Azure File Share, which makes everything much more convenient.

    sc.yaml:

    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: <NAME>
    provisioner: kubernetes.io/azure-file
    mountOptions:
      - dir_mode=0777
      - file_mode=0777
      - uid=0
      - gid=0
      - mfsymlinks
      - cache=strict
    parameters:
      storageAccount: <STORAGE_ACC_NAME>
    

    pvc.yaml:

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: <NAME>
    spec:
      accessModes:
        - ReadWriteMany
      storageClassName: <NAME>
      resources:
        requests:
          storage: 20Gi
    

    The persistent volume can then be created and claimed by:

    kubectl apply -f sc.yaml
    kubectl apply -f pvc.yaml
    

    After this a share shows up in the specified storage account, and you can simply utilize Azure file share's systems to seamlessly upload data into it (like perhaps using azcopy to move data from your local machine or an existing share/container).