Replicating GPU environment across architectures

I am training a Mamba model on two different GPU architectures: RTX 4090 and RTX A6000. Despite setting all random seeds and using deterministic algorithms, I am observing significant non-deterministic behavior in the training process on the RTX 4090, while the RTX A6000 exhibits almost deterministic results. This issue is due to the use of atomic adds in the backward pass of the Mamba model, as discussed in this GitHub issue.

I would like to maintain the current hyperparameters, as they provide the best performance. Therefore, I am looking for any tricks or settings that can make the RTX 4090 behave more like the RTX A6000 environment to achieve consistent results. Here are the details of my setup:

RTX4090

| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:26:00.0 Off |                  Off |
| 30%   22C    P8              12W / 450W |     17MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    426218      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

RTXA6000

| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               On  | 00000000:24:00.0 Off |                  Off |
| 30%   22C    P8              25W / 300W |     12MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      7388      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

So far I've tried setting:

import os
import torch

os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
torch.backends.cuda.matmul.allow_tf32 = False
torch.set_default_dtype(torch.float32)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

I've also tried using torch.use_deterministic_algorithms(True), but it raise the error

  File "/path/mamba_ssm/ops/triton/ssd_chunk_state.py", line 845, in _chunk_state_bwd_db
    torch.cumsum(ddA_cumsum, dim=-1, out=ddA_cumsum)
RuntimeError: cumsum_cuda_kernel does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation, or you can use the 'warn_only=True' option, if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation.

Any help is appreciated.

Solution

Although you can, and should, try to replicate your experimental setup exactly on both GPUs by doing e.g.

import torch
import numpy as np
import random

seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

The GPUs you mention have different architectures, which affects things such as floating point arithmetic, rounding, etc... Achieving bit-for-bit determinism across different GPU architectures is EXTREMELY hard, if not completely impossible. In my experience, training a model on an a100 vs v100 for example with the same hyperparameters, seeds, etc... can and more often than not will yield different results.

In order for you to run experiments that require you to compare performance of models, like for example ablation studies, you MUST train all the models on the same hardware. If not, there is no way for you to make any reliable conclusion about your results.

That being said, it is still important to set seeds so that running an experiment on the same hardware always yields the same results.