I am training a Mamba model on two different GPU architectures: RTX 4090 and RTX A6000. Despite setting all random seeds and using deterministic algorithms, I am observing significant non-deterministic behavior in the training process on the RTX 4090, while the RTX A6000 exhibits almost deterministic results. This issue is due to the use of atomic adds in the backward pass of the Mamba model, as discussed in this GitHub issue.
I would like to maintain the current hyperparameters, as they provide the best performance. Therefore, I am looking for any tricks or settings that can make the RTX 4090 behave more like the RTX A6000 environment to achieve consistent results. Here are the details of my setup:
RTX4090
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:26:00.0 Off | Off |
| 30% 22C P8 12W / 450W | 17MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 426218 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
RTXA6000
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:24:00.0 Off | Off |
| 30% 22C P8 25W / 300W | 12MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 7388 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
So far I've tried setting:
import os
import torch
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
torch.backends.cuda.matmul.allow_tf32 = False
torch.set_default_dtype(torch.float32)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
I've also tried using
torch.use_deterministic_algorithms(True)
, but it raise the error
File "/path/mamba_ssm/ops/triton/ssd_chunk_state.py", line 845, in _chunk_state_bwd_db
torch.cumsum(ddA_cumsum, dim=-1, out=ddA_cumsum)
RuntimeError: cumsum_cuda_kernel does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation, or you can use the 'warn_only=True' option, if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation.
Any help is appreciated.
Although you can, and should, try to replicate your experimental setup exactly on both GPUs by doing e.g.
import torch
import numpy as np
import random
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
The GPUs you mention have different architectures, which affects things such as floating point arithmetic, rounding, etc... Achieving bit-for-bit determinism across different GPU architectures is EXTREMELY hard, if not completely impossible. In my experience, training a model on an a100 vs v100 for example with the same hyperparameters, seeds, etc... can and more often than not will yield different results.
In order for you to run experiments that require you to compare performance of models, like for example ablation studies, you MUST train all the models on the same hardware. If not, there is no way for you to make any reliable conclusion about your results.
That being said, it is still important to set seeds so that running an experiment on the same hardware always yields the same results.