cuBLAS error encountered when trying GPU-supported Bertopic

The following error message occurs when I try to implement GPU-Bertopic:

Traceback (most recent call last):
  File "/mnt/JianFeng/2_Language_Models/2_3_0_BERTopic_Debug.py", line 19, in <module>
    topics, probs = topic_model.fit_transform(all_file_doc)
  File "/home/ai_user/.local/lib/python3.9/site-packages/bertopic/_bertopic.py", line 408, in fit_transform
    umap_embeddings = self._reduce_dimensionality(embeddings, y)
  File "/home/ai_user/.local/lib/python3.9/site-packages/bertopic/_bertopic.py", line 3355, in _reduce_dimensionality
    self.umap_model.fit(embeddings, y=y)
  File "/home/ai_user/.local/lib/python3.9/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper
    ret = func(*args, **kwargs)
  File "/home/ai_user/.local/lib/python3.9/site-packages/cuml/internals/api_decorators.py", line 393, in dispatch
    return self.dispatch_func(func_name, gpu_func, *args, **kwargs)
  File "/home/ai_user/.local/lib/python3.9/site-packages/cuml/internals/api_decorators.py", line 190, in wrapper
    return func(*args, **kwargs)
  File "base.pyx", line 687, in cuml.internals.base.UniversalBase.dispatch_func
  File "umap.pyx", line 603, in cuml.manifold.umap.UMAP.fit
RuntimeError: cuBLAS error encountered at: file=/__w/cuml/cuml/python/_skbuild/linux-x86_64-3.9/cmake-build/_deps/raft-src/cpp/include/raft/core/resource/cublas_handle.hpp line=75: call='cublasSetStream(ret, get_cuda_stream(res))', Reason=1:CUBLAS_STATUS_NOT_INITIALIZED
Obtained 41 stack frames...

My code here.

import numpy as np
import pandas as pd
import os
import sys
import pickle as pkl
from bertopic import BERTopic
import cuml
os.environ['CUDA_VISIBLE_DEVICES']="1"

# GPU support
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP

# CPU (very slow if sample size is large)
#from umap import UMAP
#from hdbscan import HDBSCAN

all_file_doc = ['hi, good day!', 'how are you', 'abc def ghi', 'xyz 1234 qqq' ,'nasdaq amex', 'aapl msft']
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state=42)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, nr_topics=3, calculate_probabilities=True)
topics, probs = topic_model.fit_transform(all_file_doc)

If I use CPU then the error disappears. I guess it is due to some prerequisites failure or config issues...

The code is running on Ubuntu 20.04.3 with "NVIDIA-SMI 470.223.02 Driver Version: 470.223.02 CUDA Version: 11.4" and 4*RTX3090 GPU (GPU memory 24268MB). My Python version is 3.9.5. I use the following command to install cuml (copied from: https://docs.rapids.ai/install#prerequisites)

pip install \
    --extra-index-url=https://pypi.nvidia.com \
    cudf-cu11==23.12.* dask-cudf-cu11==23.12.* cuml-cu11==23.12.* \
    cugraph-cu11==23.12.* cuspatial-cu11==23.12.* cuproj-cu11==23.12.* \
    cuxfilter-cu11==23.12.* cucim-cu11==23.12.* pylibraft-cu11==23.12.* \
    raft-dask-cu11==23.12.*

Thank you so much!

I just googled cuBLAS error and most results are related with out-of-memory or size parameter mismatch, but given the code can be run on CPU, the error should be due to other reasons.

I also checked the pip list. I found that bertopic (0.16) automatically installed many "...-cu12" packages, for example, nvidia-cublas-cu12, I guess (but not sure) this is the reason. I tried to degrade the bertopic but this does not help.

Solution

It could be a similar error to this issue.

When running the code you posted I noticed that torch was used by BERTopic. The error you're getting can be due to Pytorch using a memory pool of a portion of your GPU, cuML using another memory pool with the rest of your GPU memory and cuBLAS initializing some memory outside of these pools, resulting in a cuBLAS initialization error.

A solution would be to add this piece of code between your imports and your code in order to use only one memory pool:

import torch
from rmm.allocators.torch import rmm_torch_allocator

torch.cuda.memory.change_current_allocator(rmm_torch_allocator)