Search code examples
pytorchneural-networkgpuschedulinghardware-acceleration

How a neural network is mapped to a GPU?


I want to understand when a GPU executes a neural network, how the operations are mapped to the GPU's hardware resources. I am familiar with the architecture of GPUs (especially NVIDIA) and I generally know how an NN is executed by them, but I do not know how to get to detailed and fine-grain scheduling of operations to the hardware resources and how the cores execute them. I am wondering if there is any tool or a set of tools for that.

To be more specific, let's imagine that I have a pre-trained neural network in pytorch and want to run it on an NVIDIA 3090 GPU. How can I get the detailed scheduling of the operations (either at the MAC operations or neurons/channels/layers of the NN) to corresponding hardware resources via SMs or threads?


Solution

  • This question is extremely broad, but I can give some information.

    When you load a neural network, you are loading tensors of weights. These weights are typically loaded on CPU, then passed to GPU memory (HBM).

    In addition to the weights, you have the model logic (ie the forward method of a pytorch model). Note that model logic is separate from the weights themselves.

    The model logic decides what weights are executed when.

    Say we have the model:

    class MyModel(nn.Module):
        def __init__(self):
            super().__init__()
    
            self.layer1 = nn.Linear(32, 8)
            self.layer2 = nn.Linear(8, 1)
    
        def forward(self, x):
            x = self.layer1(x)
            x = torch.relu(x)
            x = self.layer2(x)
            return x
    

    Our weights in the model's state dict are the weight/bias tensors of layer1 and layer2. Our model execution logic is the layer1/relu/layer2 code in the forward method.

    When we run inference on the model, the forward method determines the order of operations.

    Each layer has a corresponding GPU kernel. The kernel decides how the input weights/activations are broken down into grids/blocks and distributed among the GPU SMs.

    Typically the GPU executes one layer at a time, using as much compute as possible for that layer.

    With the model above, it would look something like this:

    1. Given some input x and model weights in HBM
    2. Move x and weights for layer1 into SRAM
    3. Execute GPU kernel for torch.nn.Linear with x and layer1 weights
    4. Return result back to HBM
    5. Move x (now the result of layer1) into SRAM
    6. Execute GPU kernel for torch.relu on x
    7. Move result back to HBM
    8. Move x (now the result of the relu operation) and weights for layer2 into SRAM
    9. Execute GPU kernel for torch.nn.Linear with x and layer2 weights
    10. Move result back to HBM

    For the above, each kernel execution would distribute inputs/weights to different grids/blocks for execution. The exact distribution depends on the logic of the kernel itself.

    This gets more complicated with potentials for kernel fusing (ie on the relu) and whatnot, but this is the basic idea.

    You can use the pytorch profiler to look at what kernels are being executed when during the forward pass.