deep-learning artificial-intelligence gpt-3 large-language-model

Figuring out general specs for running LLM models

I have three questions :

Given count of LLM parameters in Billions, how can you figure how much GPU RAM do you need to run the model ?

If you have enough CPU-RAM (i.e. no GPU) can you run the model, even if it is slow

Can you run LLM models (like h2ogpt, open-assistant) in mixed GPU-RAM and CPU-RAM ?

Solution

How do you calculate the amount of RAM needed? I'm assuming that you mean just inference, no training.

The paper "Reducing Activation Recomputation in Large Transformer Models" has good information on calculating the size of a Transformer layer.

b: batchsize
s: sequence length
l: layers
a: attention heads
h: hidden dimensions
p: bytes of precision

activations per layer = s*b*h*(34 +((5*a*s)/h))

The paper calculated this at 16bit precision. The above is in bytes, so if we divide by 2 we can later multiply by the number of bytes of precision used later.

activations = l * (5/2)*a*b*s^2 + 17*b*h*s #divided by 2 and simplified

total = p * (params + activations)

Let's look at llama2 7b for an example:

params = 7*10^9

p = 32   #precision
b = 1    #batchsize 
s = 2048 #sequence length
l = 32   #layers
a = 32   #attention heads
h = 4096 #hidden dimension

activations => 10,880,024,576
p * (activations + params) => about 66 GB

Note you can drastically reduce the memory needed by quantization. At bit quantization you get that down to a little over 8GB.

I hope that helps and that I didn't miss anything important.

Edit: I found this resource, not sure how accurate it is, but looks nice. https://vram.asmirnov.xyz