Computation

  • In full precision float32: every parameter = 32 bits or 4 bytes
    • 7B model = 7B * 4 bytes = 28 billion bytes = 28 GB of GPU memory required, for inference.
  • In half precision: every parameter = 32/2=16 bits or 2 bytes
    • 7B model = 14 GB
  • And so on, for 8 bit = 7 GB and 4 bit = 3.5 GB for 7B model

For training, it depends on the optimizer you use.

  • In case you use regular AdamW, then you need 8 bytes per parameter (as it not only stores the parameters, but also their gradients and second order gradients). Hence, for a 7B model you would need 8 bytes per parameter * 7 billion parameters = 56 GB of GPU memory.
  • If you use AdaFactor, then you need 4 bytes per parameter, or 28 GB of GPU memory.
    • With the optimizers of bitsandbytes (like 8 bit AdamW), you would need 2 bytes per parameter, or 14 GB of GPU memory.