Search code examples
deep-learningartificial-intelligencegpt-3large-language-model

Figuring out general specs for running LLM models


I have three questions :

Given count of LLM parameters in Billions, how can you figure how much GPU RAM do you need to run the model ?

If you have enough CPU-RAM (i.e. no GPU) can you run the model, even if it is slow

Can you run LLM models (like h2ogpt, open-assistant) in mixed GPU-RAM and CPU-RAM ?


Solution

  • How do you calculate the amount of RAM needed? I'm assuming that you mean just inference, no training.

    The paper "Reducing Activation Recomputation in Large Transformer Models" has good information on calculating the size of a Transformer layer.

    b: batchsize
    s: sequence length
    l: layers
    a: attention heads
    h: hidden dimensions
    p: bytes of precision
    
    activations per layer = s*b*h*(34 +((5*a*s)/h))
    

    The paper calculated this at 16bit precision. The above is in bytes, so if we divide by 2 we can later multiply by the number of bytes of precision used later.

    activations = l * (5/2)*a*b*s^2 + 17*b*h*s #divided by 2 and simplified
    
    total = p * (params + activations)
    

    Let's look at llama2 7b for an example:

    params = 7*10^9
    
    p = 32   #precision
    b = 1    #batchsize 
    s = 2048 #sequence length
    l = 32   #layers
    a = 32   #attention heads
    h = 4096 #hidden dimension
    
    activations => 10,880,024,576
    p * (activations + params) => about 66 GB
    

    Note you can drastically reduce the memory needed by quantization. At bit quantization you get that down to a little over 8GB.

    I hope that helps and that I didn't miss anything important.

    Edit: I found this resource, not sure how accurate it is, but looks nice. https://vram.asmirnov.xyz