Torchserve fails to load model on docker while it runs locally

I have a torchscript model (.pt) that I can successfully load and serve with torch serve on my local machine. On the other side, when trying to deploy it in the oficial torchserve docker it will complain about the model and dont load it.

My local environment libraries are:

torchserve version: 0.5.2
torch-model-archiver version: 0.5.2
torch version: 1.10
java version: 17
Operating System and version: MacOS 11.4

With Docker im using pytorch/torchserve:latest-cpu that I expect to have all versioning sorted out (i dont install specific versions).

I would like to know if this is some bugs with the latest images or some missdoing from my side (and how to fix it). I will provide more details on environment and how to reproduce below.

I have created a reproducible example in my repository, https://github.com/jiwidi/torchservebug. Clone it to reproduce it like this:

Run locally

From the root folder run

$ sh test.sh

This runs successfully.

Run with docker

From the root folder run

$ docker build . -t debug:v1

$ docker run debug:v1

This doesn't run, torchserve cant load the model and outputs java erros as well as some torch errors.

Full failure log from docker can be found in this github issue https://github.com/pytorch/serve/issues/1402

Solution

The first thing to know is that docker tags are just tags. Just because it is tagged "latest" does not mean anything. In fact you will see that version 0.5.2-cpu is newer than latest-cpu here on dockerhub

Using this one at least gets rid of the java errors. Other than that you should know that EXPOSE as a command in dockerfiles is confusing: It does absolutely nothing except serve as documentation. So if you need those ports to be accessible when you run the container you will have to publish the ports with the -p flag:

docker run -p 8080:8080 -p 8081:8081 debug:v1

Hopefully that helps you on your way.