Search code examples
pythonazureazure-batch

Azure Batch data science VM python packages missing


I'm using the Linux DSVM image: microsoft-dsvm linux-data-science-vm-ubuntu linuxdsvmubuntu

My python code fails at first line import pandas as pd with a python error, module not found.

When i remote SSH into the node and run a pip install pandas it tells me it's already installed. Same goes for numpy etc.

I've tried to setup a start task with /bin/bash -c "pip install pandas" etc. but it fails with command pip not found.

Again when running from the SSH shell pip is on the PATH and there is no problem running it.

Can anyone point me in the right direction?

The simple tutorials from microsoft works fine as they don't rely on any external packages. So I'm able to upload my python file and datasets etc. from blob storage onto the machine. And python runs ok. It's just like all the data science specific packages and pip is missing when the task is running, but its there when i SSH into the node.

Bonus question, is jupyter suppose to be running on port 8000?


Solution

  • First you have to install pip in your compute nodes.

    bin/bash -c "sudo apt-get -y update && export DEBIAN_FRONTEND=noninteractive && sudo apt-get install -y python3-pip && sudo pip3 install pandas;"
    

    Provide this command as a startup task to the azure batch pool which will install pip and pandas in your virtual machines.

    Same way put all the libraries that you want to install in a requirements.txt and give sudo pip3 install -r requirements.txt after installing pip.