Search code examples
linuxcudaapex

Installing Apex. fatal error: cuda_profiler_api.h: No such file or directory


I am trying to install apex following the steps:

git clone https://github.com/NVIDIA/apex
cd apex
pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--deprecated_fused_adam" --global-option="--xentropy" --global-option="--fast_multihead_attn" ./
cd ..

When I start the installation, I get the following error:

  Compiling cuda extensions with
  nvcc: NVIDIA (R) Cuda compiler driver
  Copyright (c) 2005-2022 NVIDIA Corporation
  Built on Wed_Jun__8_16:49:14_PDT_2022
  Cuda compilation tools, release 11.7, V11.7.99
  Build cuda_11.7.r11.7/compiler.31442593_0
  from /3tb/share/anaconda3/envs/ak_env/bin

  running install
  /3tb/share/anaconda3/envs/ak_env/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
    warnings.warn(
  running build
  running build_py
  running build_ext
  /3tb/share/anaconda3/envs/ak_env/lib/python3.10/site-packages/torch/utils/cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
    warnings.warn(msg.format('we could not find ninja.'))
  building 'scaled_upper_triang_masked_softmax_cuda' extension
  gcc -pthread -B /3tb/share/anaconda3/envs/ak_env/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -fPIC -O2 -isystem /3tb/share/anaconda3/envs/ak_env/include -fPIC -O2 -isystem /3tb/share/anaconda3/envs/ak_env/include -fPIC -I ~/seq2seq/apex/csrc -I/3tb/share/anaconda3/envs/ak_env/lib/python3.10/site-packages/torch/include -I/3tb/share/anaconda3/envs/ak_env/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/3tb/share/anaconda3/envs/ak_env/lib/python3.10/site-packages/torch/include/TH -I/3tb/share/anaconda3/envs/ak_env/lib/python3.10/site-packages/torch/include/THC -I/3tb/share/anaconda3/envs/ak_env/include -I/3tb/share/anaconda3/envs/ak_env/include/python3.10 -c csrc/megatron/scaled_upper_triang_masked_softmax.cpp -o build/temp.linux-x86_64-cpython-310/csrc/megatron/scaled_upper_triang_masked_softmax.o -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
  /3tb/share/anaconda3/envs/ak_env/bin/nvcc -I ~/seq2seq/apex/csrc -I/3tb/share/anaconda3/envs/ak_env/lib/python3.10/site-packages/torch/include -I/3tb/share/anaconda3/envs/ak_env/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/3tb/share/anaconda3/envs/ak_env/lib/python3.10/site-packages/torch/include/TH -I/3tb/share/anaconda3/envs/ak_env/lib/python3.10/site-packages/torch/include/THC -I/3tb/share/anaconda3/envs/ak_env/include -I/3tb/share/anaconda3/envs/ak_env/include/python3.10 -c csrc/megatron/scaled_upper_triang_masked_softmax_cuda.cu -o build/temp.linux-x86_64-cpython-310/csrc/megatron/scaled_upper_triang_masked_softmax_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 -std=c++14
  csrc/megatron/scaled_upper_triang_masked_softmax_cuda.cu:21:10: fatal error: cuda_profiler_api.h: No such file or directory
     21 | #include <cuda_profiler_api.h>
        |          ^~~~~~~~~~~~~~~~~~~~~
  compilation terminated.
  csrc/megatron/scaled_upper_triang_masked_softmax_cuda.cu:21:10: fatal error: cuda_profiler_api.h: No such file or directory
     21 | #include <cuda_profiler_api.h>
        |          ^~~~~~~~~~~~~~~~~~~~~
  compilation terminated.
  error: command '/3tb/share/anaconda3/envs/ak_env/bin/nvcc' failed with exit code 255
  error: subprocess-exited-with-error
  
  × Running setup.py install for apex did not run successfully.
  │ exit code: 1
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: /3tb/share/anaconda3/envs/ak_env/bin/python -u -c '

Here is the output of nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

Some solutions that I find suggest doing the following:

export PATH="/usr/local/cuda-11.7/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.7/lib64:$LD_LIBRARY_PATH"

However, /usr/local/cuda-11.7 is not exists in my system.

How can I solve this issue.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:03:00.0 Off |                  N/A |
|  0%   46C    P0    37W / 180W |      0MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Solution

  • I was able to solve this issue by manually installing cuda-11.7 toolkit even though I have cuda-11.7 installed using conda

    https://developer.nvidia.com/cuda-11-7-0-download-archive?target_os=Linux

    After installing it, I followed these instructions

    Please make sure that

    • PATH includes /usr/local/cuda-11.7/bin
    • LD_LIBRARY_PATH includes /usr/local/cuda-11.7/lib64, or, add /usr/local/cuda-11.7/lib64 to /etc/ld.so.conf and run ldconfig as root

    By using the following commands before compiling apex

    export PATH="/usr/local/cuda-11.7/bin:$PATH"
    export LD_LIBRARY_PATH="/usr/local/cuda-11.7/lib64:$LD_LIBRARY_PATH"
    
    • Note: I had to use the root user for the compiling due to issues with installing the toolkit, which you may not need to do. After that I changed the ownership to the regular user. It is not recommanded to use the root.