Search code examples
google-cloud-platformpytorchdatasettpu

How to train a Pytorch model with a custom dataset on a TPU?


FYI, I can only spare the initial 300$ since I'm a student so I need to minimize the trial & error phase.

I have my Pytorch-based model which currently runs on local GPU with a ~100GB of frames dataset that is in my local storage, I'm looking for a guide that shows how to set up a machine to train & test my model with TPUs on the dataset which will be in my Google Drive(?)(or any other recommended cloud storage).

The guides I found don't match up to my description, most of them either run on GPU or TPU with a dataset that is included in a dataset library, I prefer not to waste time and budget on trying to assemble a puzzle from those pieces.


Solution

  • First, to use TPUs on Google Cloud TPUs you have to use the PyTorch/XLA library, as its enable the support to use TPUs with PyTorch.

    There is some options to do so, you can use code lab or create an environment on GCP to this. I understand that you may want to know how is to work in a "real environment" besides working on codelab, but there will be no much difference, and codelab is often used as main environment for ml development.

    • About your dataset, I recommend you to upload it Google Cloud Storage and access it via gs url like gs://bucket_name/data.csv. It also have a free tier

    Also, keep mind that with a TPU instance and a notebook to code in a notebook in GCP will drain your 300 $ in a fell days (or hours). Just the TPU v3 ready for pytorch you cost around $ 6k/month.

    On colab:

    On GCP:

    • Enable the TPU API and create a TPU instance. enter image description here

    • Create a notebook to write your code.

    • Se the XRT_TPU_CONFIG env variable with the IP of your TPU on he code:

    os.environ["XRT_TPU_CONFIG"]="tpu_worker;0;10.0.200.XX:8470"
    

    enter image description here

    • Follow the code examples on how to import correctly and use the libraries.