LLMs at 1/50th the power consumption

Reading: Scalable MatMul-free Language Modeling, Zhu et al. (2024), pre-print (via Tom’s Hardware).

There’s a bunch of things in there, including hardware, but I wanted to understand this one feature:

By constraining the weights to the set {−1,0,+1} and applying additional quantization techniques, MatMul [matrix multiplication] operations are replaced with addition and negation operations. This reduces computational cost and memory utilization, while preserving the expressiveness of the network.

Weights in neural networks are typically floating point numbers, and you multiply them by other floats, during inference and learning. That is, you multiply an activation (float) by weights (floats) and add them up to get an activation in the next layer (a float).

With weights being -1, 0 or 1, that multiplication of activation and weights is replaced with: the sum of activations where the weight is 1, minus the sum of activations where the weight is -1. This is a big deal as there are a lot of weights. Billions or tens of billions in an LLM.

That you can train from scratch with this architecture — and it’s not a compromise — blows my mind.