machine-learning google-cloud-platform google-cloud-ml gcp-ai-platform-notebook

How does AI Platform (ML Engine) allocate resources to jobs?

I'm trying out a few experiments using Google's AI Platform and have a few questions regarding that.

Basically, my project is structured as per the docs with a trainer task and a separate batch prediction task. I want to understand how AI Platform allocates resources to the tasks I execute. Comparing it with the current SOTA solutions like Spark, Tensorflow and Pytorch is where my doubts arise.

These engines/ libraries have distributed workers with dedicated coordination systems and have separate distributed implementation of all the machine learning algorithms. Since my tasks are written using ScikitLearn, how do these computations parallellize across the cluster that is provisioned by AI Platform since sklearn doesn't have any such distributed computing capabilities?

Following the docs here. The command I'm using,

gcloud ai-platform jobs submit training $JOB_NAME \
  --job-dir $JOB_DIR \
  --package-path $TRAINING_PACKAGE_PATH \
  --module-name $MAIN_TRAINER_MODULE \
  --region $REGION \
  --runtime-version=$RUNTIME_VERSION \
  --python-version=$PYTHON_VERSION \
  --scale-tier $SCALE_TIER

Any help/ clarifications would be appreciated!

Solution

Alas, AI Platform Training can't automatically distribute your scikit-learn tasks. It basically just sets up the cluster, deploys your package to each node, and runs it.

You might want to try a distributed backend such as Dask for scaling out the task -- it has a drop-in replacement for Joblib that can run scikit-learn pipelines on a cluster.

I found one tutorial here: https://matthewrocklin.com/blog/work/2017/02/07/dask-sklearn-simple

Hope that helps!