Search code examples
pythondeep-learningneural-networkpytorch

Why should I use a 2**N value and how do I choose the right one?


I'm working through the lessons on building a neural network and I'm confused as to why 512 is used for the linear_relu_stack in the example code:

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
            nn.ReLU()
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

I started googling around and saw many examples of the torch.nn.Linear function using various values of 2**N but it isn't clear to me why they are using powers of 2 nor how they are choosing which value to use.


Solution

  • The reason is how hardware makes the process. In deep learning matrix operations are the main computations and source of floating point operations (FLOPs).

    Single Instruction Multiple Data (SIMD) operations in CPUs happen in batch sizes, which are powers of 2. Consider take a look if you are interested:

    https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37631.pdf

    And for GPUs:

    https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html

    Memory allocated through the CUDA Runtime API, such as via cudaMalloc(), is guaranteed to be aligned to at least 256 bytes. Therefore, choosing sensible thread block sizes, such as multiples of the warp size (i.e., 32 on current GPUs), facilitates memory accesses by warps that are properly aligned. (Consider what would happen to the memory addresses accessed by the second, third, and subsequent thread blocks if the thread block size was not a multiple of warp size, for example.)

    This means that any multiple of 32 will optimize the memory access, and thus, the processing speed, while you are using a gpu.

    About the right value, pyramidal shape usually works better, because as you go deeper, the neural network tends to create internal representations of the transformed data, in an expected hierarchical, thus, pyramidal shape. So a good guess is to use decreasing amounts of neurons at each layer as you come close to the output, e.g:

    self.flatten = nn.Flatten()
    self.linear_relu_stack = nn.Sequential(
        nn.Linear(28*28, 512),
        nn.ReLU(),
        nn.Linear(512, 128),
        nn.ReLU(),
        nn.Linear(128, 10),
        nn.ReLU()
        )
    

    But there is no general rule and you can find whole fields of study (like Neural Architecture Search) about how to find optimal hyper-parameters for neural networks.

    you can take a look here for some deeper information:

    https://arxiv.org/pdf/1608.04064.pdf