Search code examples
pythondaskdask-distributeddask-kubernetes

What causes Dask futures to get stuck in 'pending' state?


I created my own very slightly modified Dockerfile based on the dask-docker Dockerfile that installs adlfs and copies one of my custom libraries into the container in order to make it available to all worker nodes. I deployed my container to my Kubernetes cluster and have connected to it from a REPL on my local machine, creating a client and a function locally:

>>> def add1(n): return n + 1
...
>>> client = Client(my_ip + ':8786')

But when I run client.submit, I'm getting either distributed.protocol.pickle "Failed to deserialize b'...'" error messages or Futures stuck in a 'pending' state:

>>> f = client.submit(add1, 2)
>>> distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x05\x95\xba\x03\x00\x00\x00\x00\x00\x00\x8c\x16tblib.pickling_support...'
...
ValueError: unsupported pickle protocol: 5
>>>
>>> f = client.submit(add1, 2)
>>> f
<Future: pending, key: add1-d5d2ff94399d4bb4e41150868f4c6da7>

It seems like the pickle protocol error will occur only once when I submit the first job, then afterward, everything is just stuck in pending.

From kubectl, I see that I have:

  • one LoadBalancer service named dask-scheduler,
  • two deployments: 1x dask-scheduler and 3x dask-worker,
  • and the corresponding one dask-scheduler-... and three dask-worker-... pods

What would cause this, and how can I debug? I opened up the web interface to my Dask scheduler, and it shows that I have an instance of add1 that has erred, but it gives no details.

For what it's worth, the only changes I made to the Dockerfile were:

    # ...
    && find /opt/conda/lib/python*/site-packages/bokeh/server/static -type f,l -name '*.js' -not -name '*.min.js' -delete \
    && rm -rf /opt/conda/pkgs

RUN pip install adlfs==0.3.0          # new line

COPY prepare.sh /usr/bin/prepare.sh   # existing line
COPY foobar.sh /usr/bin/foobar.sh     # new line
COPY my_file.so /usr/bin/my_file.so   # new line

Edit: I'll note that if I deploy the Dask image (image: "daskdev/dask:2.11.0" in my K8s manifest), things work just fine. So in trying to create a customized Docker image, something seems to get misconfigured with Dask. I commented out my changes to the Dockerfile, ran docker rmi on my local and ACR images, tore down my deployed service and deployments, then rebuilt a container, pushed it, and made the deployment, but it still fails.


Solution

  • Looks like the issue is the distinction between the Dask image I was successfully deploying and the Dockerfile upon which I was creating my own image. The former (2.11.0), bakes in Dask 2.11.0 while the latter bakes in both Dask 2.16.0 and Python 3.8. Some difference in these versions causes the issue.

    When I update my Dockerfile to use 2.11.0 and remove the explicit Python dependency, things work fine.