Why use MPS, Time Slicing or MIG if Nvidia's defaults have better performance?

I'm trying to wrap my head around the implications of Nvidia's GPU sharing strategies:

MIG
Time Slicing
MPS

But given how opaque I've found their docs to be on the subject, so far I've been piecing together my understanding of each by experimenting with each option and reading relevant source code e.g. nvidia's device plugin.

The current item I'm looking at is benchmarking each strategy. I ran 7 replicas in k8s of the same app for all four variants. I used an a100 with 80Gb of vram

import os

# Set YOLOv8 to quiet mode
os.environ['YOLO_VERBOSE'] = 'False'

from prometheus_client import start_http_server, Histogram
from ultralytics import YOLO
import torch

start_http_server(8000)

device = torch.device("cuda")

model = YOLO("yolov8n.pt").to(device=device)

h = Histogram('gpu_stress_inference_yolov8_milliseconds_duration', 'Description of histogram', buckets=(1, 5, 10, 15, 20, 25, 30, 35, 40, 50, 75, 100, 150, 200, 500, 1000, 5000))

def run_model():
    results = model("https://ultralytics.com/images/bus.jpg")
    # print(model.device.type)
    h.observe(results[0].speed['inference'])

while True:
    run_model()

The results are as follows:

Therefore, if the speed of inference speed is best with the default settings, and it can already support having multiple applications talking to one GPU, then why bother using any other strategy? I understand their docs state that a strategy like MIG can give you memory tolerance i.e. one application sharing the GPU can't bring down another, but if we put that to one side, is there really any good reason to use these strategies if you're prioritising performance?

To add to the confusion, if I do a matmul with two enormous matrices, there's zero difference in the performance of each strategy:

import torch
import time
from prometheus_client import start_http_server, Histogram

# Check if CUDA is available and Tensor Cores are supported
if not torch.cuda.is_available():
    raise SystemError("CUDA is not available on this system")

device = torch.device("cuda")

torch.cuda.set_sync_debug_mode(debug_mode="warn")

torch.set_default_device(device) # ensure we actually use the GPU and don't do the calculations on the CPU

h = Histogram('gpu_stress_mat_mul_seconds_duration', 'Description of histogram', buckets=(0.001, 0.005, 0.01, 0.1, 0.25, 0.5, 1.0, 2.0, 3.0, 4.0, 5.0, 10.0, 20.0, 50.0, 100.0, 200.0, 500.0, 1000.0))

def mat_mul(m1, m2):
    return torch.matmul(m1, m2)


# Function to perform matrix multiplication using Tensor Cores
def stress(matrix_size=16384):
    # Create random matrices on the GPU
    m1 = torch.randn(matrix_size, matrix_size, dtype=torch.float16)
    m2 = torch.randn(matrix_size, matrix_size, dtype=torch.float16)
    
    # Perform matrix multiplications indefinitely
    while True:
        start = time.time()
        output = torch.matmul(m1, m2)
        print(output.any())
        end = time.time()
        h.observe(end - start)            

if __name__ == "__main__":
    start_http_server(8000)
    stress()

I must be missing here as their docs e.g. MPS seem to imply that these strategies are better for GPU sharing.

Update

Thanks to the answer given by @robert-crovella I went back to the results and found that whilst latency is worse, throughput is much better.

Solution

TL;DR: Your initial observation may be sensible, given some additional definition of the test case and noting that you seem to be measuring latency of an individual inference request. The default case probably is the best setting considering only the latency measured for each request.

Longer:

Defining terms:

You haven't clearly defined or shown what you mean by "Default" and "Time-Slicing" cases.

Your first graph data appears sensible to me under the following definitions:

Default: not doing anything in particular to the A100 GPU before running the test. Running the test using normal/default settings
Time-slicing: Either running with some sort of vGPU virtualization or otherwise modifying the GPU time-slicing multi-process scheduler. Note the statement at that link:

A typical resource request provides exclusive access to GPUs. A request for a time-sliced GPU provides shared access.

It's also important to note that your test is measuring (it seems) latency of inference. One could also measure inference throughput, and the ranking of the cases might be different. The benefit implication/suggestion in the MPS case almost certainly applies to throughput more directly than latency measurement.

Brief description of the modes under the above proviso:

MIG: The GPU is physically partitioned at the hardware level. Each partitioned instance (up to 7, currently) have a portion of the GPU capability in terms of compute throughput, memory size, and memory bandwidth. Compared to running on an unpartitioned GPU, running the same individual test/measurement on a MIG instance will almost certainly run slower. It is like running on a GPU that has (e.g.) 1/7th of the capability of the unpartitioned GPU.
Default: The GPU and its behavior is unmodified. The normal/default behavior for a CUDA GPU is to allow multiple tenants to share the GPU in a mostly unspecified fashion. However note the above statement about K8s use with GPU Operator - the default case implies exclusive per-process access to the GPU. This means for each inference request, the request will have access to the entire GPU for that request. But multiple requests from separate tenants will incur some kind of serialization.
MPS: The GPU has an additional "adapter" attached to the front-end work-delivery. Work is delivered to the GPU as if it emanated from a single process. This means that the individual tenants do not get exclusive access to the GPU; somehow the GPU is shared. Furthermore, some kind of overlap of activity may happen, which would not otherwise happen in the default case for multi-tenant/multi-process. In particular, kernel execution overlap is possible in the MPS case.
Time-sliced: As mentioned previously, either vGPU is in use or the K8s GPU operator is being used to modify GPU behavior. At a high level, the effect of this is similar to my description of the default case without K8s/operator intervention: the GPU is shared in some unspecified fashion. On modern GPUs, by observation, the sharing is time-sliced, even at the CUDA kernel level (i.e. CUDA kernel pre-emption may be in use).

Performance description:

It's important to note that you are measuring inference latency (I believe). I think the best way to describe why the results seem sensible is to compare the cases pair-wise.

In the default case, including the exclusive note above, each inference request runs as if it had an entire A100 GPU all to itself. From a latency perspective (the elapsed time for that particular request to complete), this is undoubtedly the best scenario.

In the MIG case (let's pretend you partitioned the GPU into 7 instances), each inference request will have 1/7th of the GPU to run on. This will certainly take longer than the default case, considering only the latency of each requests.

In the MPS case and time-sliced case, the GPU is no longer exclusive to a single inference request. Multiple requests can be using the GPU at a particular time, and one inference request could be halted mid-processing to allow another request to proceed. This is the nature of time-slicing. Considering the latency of a particular request, it will certainly be slower if it happens to be interrupted mid-processing by the time-slicer to allow another request to be processed.

So default is probably the best setting, considering only the latency per request.

But the other metric that people sometimes care about is throughput. How many requests per second are completed independent of, or ignoring the time it takes to complete each request (i.e. ignoring latency)? For high-volume inferencing, this is basically a measurement of efficiency. In that case, the default setting might not be the best. It results in the smallest overlap between processing of requests (i.e. serialization of activity is the highest).

Use suggestions:

MIG particularly shines when you want to provide fixed resources to each tenant/client/process. This means that the activity of one client has little impact on the observed behavior of another client. But in the high-volume inferencing case this does not necessarily minimize latency nor maximize throughput. One of its most important benefits is around QoS in a shared setting.

MPS is almost certainly better than the default case any time you are interested in throughput or efficiency in a multi-client/tenant/process case. It does not not make the performance of any single client higher, however. It only improves performance when an aggregate view is taken, such as throughput or efficiency. Furthermore, if you can equivalently partition your work in such a way that a single process can dispatch all the work to the GPU rather than multiple processes sharing a GPU, the single process (and non-MPS) case is usually more efficient in my experience. But in many cases we are inherently using or must use a multi-process work breakdown structure, and MPS helps with efficiency/throughput in those cases.

In high-volume inferencing, a figure of merit that is sometimes advanced is maximum throughput (or efficiency) subject to a maximum/upper bound on latency for each request. None of the above technologies by themselves will conveniently allow for that. In that case, an additional technology like Triton Inference Server may be of interest.