pytorch azure-databricks huggingface-transformers

Can't install Flash Attention in Azure Databricks GPU (for Hugging Face model)

I can successfully run the following code on a CPU cluster in Databricks.

import torch
import transformers
model =  transformers.AutoModelForCausalLM.from_pretrained(
    "mosaicml/mpt-7b",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)

On the CPU databricks cluster, I first installed Pytorch 2.0.1 ;Transformers 4.28.1 & einops 0.6.1.

However, the same Python code fails on a GPU cluster - with the following error:

ImportError: This modeling file requires the following packages that were not found in your environment: flash_attn. Run pip install flash_attn

I then tried to install the required package pip install flash-attn on the Databricks GPU cluster. (based on instructions HERE

However, I have been unable to install Flash Attention on the GPU cluster.

On GPU I tried the following:

Attempted to install 'flash-attn' library on the GPU cluster. pip install flash-attn resulted in the following error:

Collecting flash-attn
  Downloading flash_attn-1.0.4.tar.gz (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 28.1 MB/s eta 0:00:0000:010:01
  Preparing metadata (setup.py) ... done
Requirement already satisfied: torch in /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages (from flash-attn) (1.13.1)
Requirement already satisfied: einops in /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages (from flash-attn) (0.6.1)
Requirement already satisfied: packaging in /databricks/python3/lib/python3.10/site-packages (from flash-attn) (21.3)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /databricks/python3/lib/python3.10/site-packages (from packaging->flash-attn) (3.0.9)
Requirement already satisfied: nvidia-cublas-cu11==11.10.3.66 in /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages (from torch->flash-attn) (11.10.3.66)
Requirement already satisfied: nvidia-cudnn-cu11==8.5.0.96 in /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages (from torch->flash-attn) (8.5.0.96)
Requirement already satisfied: typing-extensions in /databricks/python3/lib/python3.10/site-packages (from torch->flash-attn) (4.3.0)
Requirement already satisfied: nvidia-cuda-nvrtc-cu11==11.7.99 in /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages (from torch->flash-attn) (11.7.99)
Requirement already satisfied: nvidia-cuda-runtime-cu11==11.7.99 in /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages (from torch->flash-attn) (11.7.99)
Requirement already satisfied: setuptools in /databricks/python3/lib/python3.10/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch->flash-attn) (63.4.1)
Requirement already satisfied: wheel in /databricks/python3/lib/python3.10/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch->flash-attn) (0.37.1)
Building wheels for collected packages: flash-attn
  Building wheel for flash-attn (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [317 lines of output]
      
      
      torch.__version__  = 1.13.1+cu117
      
      
      fatal: not a git repository (or any of the parent directories): .git
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-310
      creating build/lib.linux-x86_64-cpython-310/flash_attn
      copying flash_attn/flash_blocksparse_attn_interface.py -> build/lib.linux-x86_64-cpython-310/flash_attn
      copying flash_attn/rotary.py -> build/lib.linux-x86_64-cpython-310/flash_attn
      copying flash_attn/__init__.py -> build/lib.linux-x86_64-cpython-310/flash_attn
      copying flash_attn/flash_attn_triton_tmp_og.py -> build/lib.linux-x86_64-cpython-310/flash_attn
      copying flash_attn/flash_blocksparse_attention.py -> build/lib.linux-x86_64-cpython-310/flash_attn
      copying flash_attn/flash_attn_triton.py -> build/lib.linux-x86_64-cpython-310/flash_attn
      copying flash_attn/bert_padding.py -> build/lib.linux-x86_64-cpython-310/flash_attn
      copying flash_attn/flash_attn_triton_og.py -> build/lib.linux-x86_64-cpython-310/flash_attn
      copying flash_attn/flash_attention.py -> build/lib.linux-x86_64-cpython-310/flash_attn
      copying flash_attn/fused_softmax.py -> build/lib.linux-x86_64-cpython-310/flash_attn
      copying flash_attn/flash_attn_triton_varlen.py -> build/lib.linux-x86_64-cpython-310/flash_attn
      copying flash_attn/flash_attn_interface.py -> build/lib.linux-x86_64-cpython-310/flash_attn
      copying flash_attn/flash_attn_triton_tmp.py -> build/lib.linux-x86_64-cpython-310/flash_attn
      copying flash_attn/attention_kernl.py -> build/lib.linux-x86_64-cpython-310/flash_attn
      copying flash_attn/flash_attn_triton_single_query.py -> build/lib.linux-x86_64-cpython-310/flash_attn
      creating build/lib.linux-x86_64-cpython-310/flash_attn/modules
      copying flash_attn/modules/embedding.py -> build/lib.linux-x86_64-cpython-310/flash_attn/modules
      copying flash_attn/modules/__init__.py -> build/lib.linux-x86_64-cpython-310/flash_attn/modules
      copying flash_attn/modules/mlp.py -> build/lib.linux-x86_64-cpython-310/flash_attn/modules
      copying flash_attn/modules/block.py -> build/lib.linux-x86_64-cpython-310/flash_attn/modules
      copying flash_attn/modules/mha.py -> build/lib.linux-x86_64-cpython-310/flash_attn/modules
      creating build/lib.linux-x86_64-cpython-310/flash_attn/utils
      copying flash_attn/utils/__init__.py -> build/lib.linux-x86_64-cpython-310/flash_attn/utils
      copying flash_attn/utils/generation.py -> build/lib.linux-x86_64-cpython-310/flash_attn/utils
      copying flash_attn/utils/benchmark.py -> build/lib.linux-x86_64-cpython-310/flash_attn/utils
      copying flash_attn/utils/distributed.py -> build/lib.linux-x86_64-cpython-310/flash_attn/utils
      copying flash_attn/utils/pretrained.py -> build/lib.linux-x86_64-cpython-310/flash_attn/utils
      creating build/lib.linux-x86_64-cpython-310/flash_attn/layers
      copying flash_attn/layers/rotary.py -> build/lib.linux-x86_64-cpython-310/flash_attn/layers
      copying flash_attn/layers/__init__.py -> build/lib.linux-x86_64-cpython-310/flash_attn/layers
      copying flash_attn/layers/patch_embed.py -> build/lib.linux-x86_64-cpython-310/flash_attn/layers
      creating build/lib.linux-x86_64-cpython-310/flash_attn/triton
      copying flash_attn/triton/__init__.py -> build/lib.linux-x86_64-cpython-310/flash_attn/triton
      copying flash_attn/triton/fused_attention.py -> build/lib.linux-x86_64-cpython-310/flash_attn/triton
      creating build/lib.linux-x86_64-cpython-310/flash_attn/losses
      copying flash_attn/losses/cross_entropy_apex.py -> build/lib.linux-x86_64-cpython-310/flash_attn/losses
      copying flash_attn/losses/__init__.py -> build/lib.linux-x86_64-cpython-310/flash_attn/losses
      copying flash_attn/losses/cross_entropy_parallel.py -> build/lib.linux-x86_64-cpython-310/flash_attn/losses
      copying flash_attn/losses/cross_entropy.py -> build/lib.linux-x86_64-cpython-310/flash_attn/losses
      creating build/lib.linux-x86_64-cpython-310/flash_attn/ops
      copying flash_attn/ops/__init__.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops
      copying flash_attn/ops/layer_norm.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops
      copying flash_attn/ops/gelu_activation.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops
      copying flash_attn/ops/rms_norm.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops
      copying flash_attn/ops/fused_dense.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops
      copying flash_attn/ops/activations.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops
      creating build/lib.linux-x86_64-cpython-310/flash_attn/models
      copying flash_attn/models/gptj.py -> build/lib.linux-x86_64-cpython-310/flash_attn/models
      copying flash_attn/models/vit.py -> build/lib.linux-x86_64-cpython-310/flash_attn/models
      copying flash_attn/models/__init__.py -> build/lib.linux-x86_64-cpython-310/flash_attn/models
      copying flash_attn/models/llama.py -> build/lib.linux-x86_64-cpython-310/flash_attn/models
      copying flash_attn/models/gpt.py -> build/lib.linux-x86_64-cpython-310/flash_attn/models
      copying flash_attn/models/bert.py -> build/lib.linux-x86_64-cpython-310/flash_attn/models
      copying flash_attn/models/gpt_j.py -> build/lib.linux-x86_64-cpython-310/flash_attn/models
      copying flash_attn/models/opt.py -> build/lib.linux-x86_64-cpython-310/flash_attn/models
      copying flash_attn/models/gpt_neox.py -> build/lib.linux-x86_64-cpython-310/flash_attn/models
      running build_ext
      building 'flash_attn_cuda' extension
      creating /tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/build/temp.linux-x86_64-cpython-310
      creating /tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/build/temp.linux-x86_64-cpython-310/csrc
      creating /tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/build/temp.linux-x86_64-cpython-310/csrc/flash_attn
      creating /tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/build/temp.linux-x86_64-cpython-310/csrc/flash_attn/src
      Emitting ninja build file /tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/build/temp.linux-x86_64-cpython-310/build.ninja...
      Compiling objects...
      Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
      [1/9] /usr/local/cuda/bin/nvcc  -I/tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/csrc/flash_attn -I/tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/csrc/flash_attn/src -I/tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/csrc/flash_attn/cutlass/include -I/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/torch/include -I/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/torch/include/TH -I/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/local_disk0/.ephemeral_nfs/envs/pythonEnv-69bd3443-3436-4892-a827-1f1b494c1c35/include -I/usr/include/python3.10 -c -c /tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/csrc/flash_attn/src/fmha_bwd_hdim32.cu -o /tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/build/temp.linux-x86_64-cpython-310/csrc/flash_attn/src/fmha_bwd_hdim32.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --ptxas-options=-v -lineinfo -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      FAILED: /tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/build/temp.linux-x86_64-cpython-310/csrc/flash_attn/src/fmha_bwd_hdim32.o
      /usr/local/cuda/bin/nvcc  -I/tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/csrc/flash_attn -I/tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/csrc/flash_attn/src -I/tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/csrc/flash_attn/cutlass/include -I/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/torch/include -I/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/torch/include/TH -I/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/local_disk0/.ephemeral_nfs/envs/pythonEnv-69bd3443-3436-4892-a827-1f1b494c1c35/include -I/usr/include/python3.10 -c -c /tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/csrc/flash_attn/src/fmha_bwd_hdim32.cu -o /tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/build/temp.linux-x86_64-cpython-310/csrc/flash_attn/src/fmha_bwd_hdim32.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --ptxas-options=-v -lineinfo -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      In file included from /tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/csrc/flash_attn/src/fmha.h:39,
                       from /tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/csrc/flash_attn/src/fmha_bwd_launch_template.h:6,
                       from /tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/csrc/flash_attn/src/fmha_bwd_hdim32.cu:5:
      /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:6:10: fatal error: cusparse.h: No such file or directory
          6 | #include <cusparse.h>
            |          ^~~~~~~~~~~~
      compilation terminated.
      In file included from /tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/csrc/flash_attn/src/fmha.h:39,
                       from /tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/csrc/flash_attn/src/fmha_bwd_launch_template.h:6,
                       from /tmp/pip-install-nluf5697/flash-attn_79c1dfb03cc4482ba86c435b2db7a8b6/csrc/flash_attn/src/fmha_bwd_hdim32.cu:5:
      /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:6:10: fatal error: cusparse.h: No such file or directory
          6 | #include <cusparse.h>
            |          ^~~~~~~~~~~~
      compilation terminated.


......................................... [ removed middle of error message] 

          cmd_obj.run()
        File "/databricks/python/lib/python3.10/site-packages/setuptools/command/build.py", line 24, in run
          super().run()
        File "/databricks/python/lib/python3.10/site-packages/setuptools/_distutils/command/build.py", line 132, in run
          self.run_command(cmd_name)
        File "/databricks/python/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
          self.distribution.run_command(command)
        File "/databricks/python/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
          super().run_command(command)
        File "/databricks/python/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 992, in run_command
          cmd_obj.run()
        File "/databricks/python/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 79, in run
          _build_ext.run(self)
        File "/databricks/python/lib/python3.10/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
          _build_ext.build_ext.run(self)
        File "/databricks/python/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 346, in run
          self.build_extensions()
        File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 843, in build_extensions
          build_ext.build_extensions(self)
        File "/databricks/python/lib/python3.10/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
          _build_ext.build_ext.build_extensions(self)
        File "/databricks/python/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 466, in build_extensions
          self._build_extensions_serial()
        File "/databricks/python/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 492, in _build_extensions_serial
          self.build_extension(ext)
        File "/databricks/python/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 202, in build_extension
          _build_ext.build_extension(self, ext)
        File "/databricks/python/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 547, in build_extension
          objects = self.compiler.compile(
        File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 658, in unix_wrap_ninja_compile
          _write_ninja_file_and_compile_objects(
        File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1573, in _write_ninja_file_and_compile_objects
          _run_ninja_build(
        File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build
          raise RuntimeError(message) from e
      RuntimeError: Error compiling objects for extension
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> flash-attn

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

Second, I tried the following: 2. updated torch library on GPU to version 2.0.1 (to match the successful approach on CPU) , tried again to reinstall flash-attn . This still didn't work.

My suspicion is that the code works fine on CPU cluster because it is NOT prepackaged with an earlier version of PyTorch. I install PyTorch 2.0.1 on Databricks CPU and it works fine. However, the GPU Cluster is preinstalled with an earlier version of Pytorch, and Flash Attention has not installed on GPU.

Solution

This is likely related to missing CUDA dependencies.

Please try restarting the cluster, running the following on a notebook cell and reinstalling flash_attn - and then give it another shot:

!wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcusparse-dev-11-7_11.7.3.50-1_amd64.deb -O /tmp/libcusparse-dev-11-7_11.7.3.50-1_amd64.deb && \
  wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcublas-dev-11-7_11.10.1.25-1_amd64.deb -O /tmp/libcublas-dev-11-7_11.10.1.25-1_amd64.deb && \
  wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcusolver-dev-11-7_11.4.0.1-1_amd64.deb -O /tmp/libcusolver-dev-11-7_11.4.0.1-1_amd64.deb && \
  wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcurand-dev-11-7_10.2.10.91-1_amd64.deb -O /tmp/libcurand-dev-11-7_10.2.10.91-1_amd64.deb && \
  dpkg -i /tmp/libcusparse-dev-11-7_11.7.3.50-1_amd64.deb && \
  dpkg -i /tmp/libcublas-dev-11-7_11.10.1.25-1_amd64.deb && \
  dpkg -i /tmp/libcusolver-dev-11-7_11.4.0.1-1_amd64.deb && \
  dpkg -i /tmp/libcurand-dev-11-7_10.2.10.91-1_amd64.deb

Update - added both the init script and general instructions as part of this repo: https://github.com/rafaelvp-db/databricks-llm-prompt-engineering