I try to create a docker image for our training machine. Installation of horovod for python fails. It seems that the issue is that a c++17 compiler is not used.
My dockerfile so far:
FROM nvidia/cuda:12.2.0-devel-ubuntu22.04
ENV DEBIAN_FRONTEND noninteractive
RUN echo "root:123123" | chpasswd
RUN groupadd -g 1000 user1 && useradd user1 -u 1000 -g 1000 -m -s /bin/bash
RUN apt-get -y update
RUN apt-get -y upgrade
RUN apt-get install -y python3-pip python3-setuptools python3-opencv sudo vim wget cmake
ENV PYTHON_BIN "python3.10"
ENV PYTHON_SITE_PACKAGES="/usr/local/lib/{PYTHON_BIN}/dist-packages"
RUN /usr/bin/update-alternatives --install /usr/bin/python python /usr/bin/${PYTHON_BIN} 10 && \
/usr/bin/update-alternatives --install /usr/bin/python3 python3 /usr/bin/${PYTHON_BIN} 10
RUN ${PYTHON_BIN} -m pip install torch torchvision pyyaml pandas scikit-image openexr
I build the image with
docker build --rm --no-cache --tag my_image --file "./Dockerfile"
and start it with
docker run --gpus all -it --rm --user $(id -u):$(id -g) --entrypoint /bin/bash my_image
After I entered the docker container and made myself to root, my attempt to install horovod is
MMVC_CUDA_ARGS="-std=c++17" HOROVOD_BUILD_ARCH_FLAGS="-std=c++17" HOROVOD_WITHOUT_MXNET=1 HOROVOD_WITHOUT_TENSORFLOW=1 pip install horovod[pytorch]
According to the documentation (https://horovod.readthedocs.io/en/stable/install.html) a c++17 compiler is required, which should be installed.
# `which g++` --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
An excerpt from the error message indicates, that the compiler cannot do c++17:
[ 97%] Building CXX object horovod/torch/CMakeFiles/pytorch.dir/cuda_util.cc.o
cd /tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/build/temp.linux-x86_64-3.10/RelWithDebInfo/horovod/torch && /usr/bin/c++ -DEIGEN_MPL2_ONLY=1 -DHAVE_CUDA=1 -DHAVE_GLOO=1 -DHAVE_GPU=1 -DHAVE_NVTX=1 -DPYTORCH_VERSION=2003000000 -DTORCH_API_INCLUDE_EXTENSION_H=1 -Dpytorch_EXPORTS -I/tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/third_party/HTTPRequest/include -I/tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/third_party/boost/assert/include -I/tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/third_party/boost/config/include -I/tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/third_party/boost/core/include -I/tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/third_party/boost/detail/include -I/tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/third_party/boost/iterator/include -I/tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/third_party/boost/lockfree/include -I/tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/third_party/boost/mpl/include -I/tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/third_party/boost/parameter/include -I/tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/third_party/boost/predef/include -I/tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/third_party/boost/preprocessor/include -I/tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/third_party/boost/static_assert/include -I/tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/third_party/boost/type_traits/include -I/tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/third_party/boost/utility/include -I/tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/third_party/lbfgs/include -I/tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/third_party/gloo -I/tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/third_party/eigen -I/tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/third_party/flatbuffers/include -isystem /usr/local/cuda/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -pthread -fPIC -Wall -ftree-vectorize -mf16c -mavx -mfma -O3 -g -DNDEBUG -fPIC -std=c++14 -MD -MT horovod/torch/CMakeFiles/pytorch.dir/cuda_util.cc.o -MF CMakeFiles/pytorch.dir/cuda_util.cc.o.d -o CMakeFiles/pytorch.dir/cuda_util.cc.o -c /tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/horovod/torch/cuda_util.cc
In file included from /tmp/pip-install-qqukqcka/horovod_0b6322654f564ffc82c68379b1882f61/horovod/torch/cuda_util.cc:22:
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/ATen.h:4:2: error: #error C++17 or later compatible compiler is required to use ATen.
4 | #error C++17 or later compatible compiler is required to use ATen.
So I guess my attempts to use c++17 was wrong. How can I tell pip to use a c++17 compiler for installation.
After a long odyssey I finally found the crucial hint in the horovod issue section at github. For some reason I did not find it at the beginning of my search.
HOROVOD_WITHOUT_MXNET=1 HOROVOD_WITHOUT_TENSORFLOW=1 pip install git+https://github.com/thomas-bouvier/horovod.git@compile-cpp17[pytorch]
They fixed the issue on another branch.