Demystifying 6ND FLOPs

Sep 14, 2025

\(\mathrm{FLOPs} = 6 \cdot \mathrm{params} \cdot \mathrm{tokens}.\) This is a nearly magical formula that approximates the floating-point operations required to train neural networks. Maybe you’ve heard of it and wondered when to use it, or how something as complex as training a neural network can be boiled down to such a simple expression.

Let me look under the hood and show that this formula is not magical for at least two reasons.

  1. With some thought, it’s not hard to derive.
  2. It does not always work (architectures can become too complex, making the simplifications going into the $6ND$ formula too big).

Let’s start with point 1 and derive the formula. Point 2 will naturally fall from the derivation.

FLOPs for matmul

Saying that modern GPUs are popular for training neural networks because they are great at parallelizing mathematical operations is a bit of a simplification. Yes, they can do a lot of things at once, but there is one thing that they do much better than anything else: multiplying matrices. Matrix multiplication (or matmul for short) is also the single most important building block of any neural network. So let’s focus on it.

Let $A, B$ be matrices with shapes $(p, q), (q, r)$. Our goal is to understand the number of FLOPs required to evaluate the product $AB$. We know that the result has $pr$ elements. Each element of the resulting matrix corresponds to a dot product of vectors with dimension $q$. A dot product needs roughly $2q$ FLOPs (more precisely $q$ multiplications and $q-1$ additions). Therefore, in total, we need $2pqr$ FLOPs to do a matrix multiplication of matrices with shapes $(p, q)$ and $(q, r)$.

To get matmul FLOPs, we just need to multiply the shapes! There is something beautiful in the result. Also, neural networks are fundamentally based on matmuls. If you understand where the $2pqr$ came from, you’re good to continue.

But you promised neural nets, not matmuls!

That’s right, I promised nets. So let’s deliver on that and start with something simple: calculating the FLOPs of a fully connected layer. By that, I mean the following

\[W x + b.\]

Now, let’s sweep bias under the rug. It only adds a number of FLOPs equal to the output dimension, which is insignificant to the whole calculation. We have

$W x.$

Ha! Our old friend matmul. Assuming that $W$ has shape $(p, q)$ and $x$ is a column vector with shape $(q, 1)$, we conclude that the FLOPs needed for the forward pass are $2pq$.

When training neural networks, we typically do not work with one example but process many of them. Let’s denote the total number of examples $D$. We have

\[\mathrm{FLOPs} = 2pqD.\]

Now, we just need to realize that all parameters of our layer are stored in the matrix $W$, which has $pq$ parameters. Let’s denote the number of parameters as $N$. We obtain

\[\mathrm{FLOPs} = 2ND.\]

Two is not six, so we are not quite there yet. However, so far we’ve ignored the backward pass. The above formula counts only computation going forward.

Going backward

For now, let’s assume that the layer we are dealing with is somewhere inside the network. In a backward pass, we need to solve two problems:

  1. What are the partial derivatives with respect to the weights
  2. And what are the partial derivatives with respect to the inputs of the layer (we need this to backpropagate information to a preceding layer).

Let us denote the output of our layer as $o$, so the layer is \(o = W x.\)

Assume we already know the partial derivatives of the loss with respect to all elements of $o$. By the chain rule, we would like to know how $o$ changes with respect to $W$ and $x$. This gives us

\(o x^T\) for $W$ and

\(W^T o\) for $x$.

Both of these are matrix multiplications. Based on the shapes, we can derive that both multiplications will take $2pq$ FLOPs. To be completely fair, we are overcounting somewhat, as $o x^T$ is a degenerate version of matrix multiplication in which no additions occur.

In total, pushing one example forward-backward through the layer requires $3 (2pq) = 6pq$ FLOPs. Now let us call $D$ the number of inputs to our network. In total, all examples require $6pqD$ FLOPs. How many parameters are in our layer? We have just one $p\times q$ matrix. Therefore, $pq$ parameters, which we will represent with $N$. Voilà, $6ND$ is born!

How much is under the rug?

We swept some things under the rug, but how much?

We ignored bias, but this shouldn’t be a big problem. Naive matrix-vector multiplication is $\mathcal{O}(n^2)$. Compared to it, a vector-vector addition is only $\mathcal{O}(n)$, which is negligible for any reasonably sized layer.

In the backward pass, we assumed that the optimizer is purely evaluating the gradient, but we totally ignored the fact that the gradient needs to be multiplied by the learning rate, added to the weights, and that modern optimizers tend to track many more statistics about the variables. In this case, we undercount—and possibly quite a lot. On the other hand, $6ND$ is really just a rough approximation, and accounting for different optimizers would introduce a lot of additional work.

Similarly, we ignore activations. This could also bite us. However, taking all the different activations into account is probably even harder than working with optimizers, and we want to keep things simple.

One more issue is that we are always assuming that derivatives with respect to the input of the layer are evaluated, which is not needed for the first layer.

Overall, we have some undercounting and overcounting tendencies at the same time. Bias is probably safe to ignore; the rest might need some thought based on the specific network in question and how precise FLOPs are required.

Adding more layers

For now, let’s stay in the realms of fully connected layers. Stacking $l$ of them together, we get

\[\mathrm{FLOPs} = D (6p_1q_1 + 6p_2q_2 + \dots + 6p_lq_l).\]

Once more, $D$ is the number of examples pushed through the network. $(p_i, q_i)$ are shapes of the weight matrices. Using $N = \sum_{i=1}^l p_i q_i$, we end up with \(\mathrm{FLOPs} = 6ND.\)

Nice! We’ve shown that for networks with fully connected layers, the $6ND$ formula is reasonable.

Sadly, not all networks are fully connected…

The Tale of Transformers

In the rest of this blog, I will talk about transformers. There are certainly other interesting layers, but transformer layers are now at the root of so many systems, and I believe that looking into them is certainly beneficial.

By a transformer, we mean a somewhat simplified self-attention layer followed by a fully connected layer. Standard attention is

\[\mathrm{softmax}(QK^T) V.\]

The above is computed by each head. In our simplified version, we will only assume one head. The whole problem then boils down to understanding FLOPs of

  • computing $QK^T$, applying \mathrm{softmax} to get $A = \mathrm{softmax}(QK^T)$, and then multiplying by $V$ (i.e., $AV$),
  • projections that transform some input matrix of tokens $X$ into $Q, K, V$ matrices and the layer that is applied after attention.

The operations are suggestively split into two groups for a good reason. Notice that the first group does not use any learned parameters. This might be a problem for the $6ND$ formula; it only takes into account matmuls happening between a layer’s weight matrix and the output of the previous layer. But attention also requires some computation between $Q, K, V$ matrices, which are not weights but results of other layers. So are we screwed? The answer is: It depends…

Now, we will get into math, and that might be a bit tricky. Not because it’s hard, but because I have not yet figured out what’s the best way to minimize the number of letters we need. Nevertheless, if you understand the intuition from the paragraph above, you know the main problem with transformers, and the rest is just my attempt to show you what that depends means. Transformers operate on tokens, so let us denote the number of tokens in one example as $T$. Tokens are vectors having some specific dimension $d$. So our input is a matrix $X$ with a shape $(T, d)$. Tokens are projected into $Q$, $K$, $V$ by matrices $W_Q, W_K, W_V$ that all have shapes $(d, h)$, where $h$ is the dimension of the representation used in the attention. We know that to obtain $Q, K, V$, we need three matmuls, all requiring the same compute. For now, we will once more ignore the backward pass and assume that matmul needs just two times the product of shapes. For the three projections, we have $3 \cdot 2 T d h$.

Next, let’s turn our attention (I wonder whether everybody who writes about transformers tends to do this pun) to the operations done between $Q, K, V$. We are dealing with two matrix multiplications with a $\mathrm{softmax}$ in between. For simplicity, we are going to ignore the $\mathrm{softmax}$ FLOPs here thus, we are dealing with only two matmuls. Notice that all three matrices have the shape $(T, h)$. The multiplication $QK^T$ thus requires $2 T h T$ FLOPs and produces a result with the shape $(T, T)$. The subsequent multiplication by $V$ (i.e., $AV$ with $A=\mathrm{softmax}(QK^T)$) requires $2 T T h$ FLOPs. In total, we need $4 h T^2$ for the attention magic. Notice the square of $T$ in which all the danger lies.

To close our investigation, we must not forget about the final layer, which is however just a straightforward multiplication of the attention result, which has the shape $(T, h)$, by a matrix of the shape $(h, d)$. This requires $2dhT$ FLOPs.

Now, splitting the computation into two groups as we’ve done previously, we have

  • $4 h T^2$ FLOPs (attention part)
  • and $8 dhT$ FLOPs (projections).

Next, let us add the backward pass. For the second group, this is easy and was briefly discussed in the section on fully connected networks: we will just multiply the current number by 3. I will not go into the derivation of $\frac{\partial L}{\partial Q}, \frac{\partial L}{\partial K}, \frac{\partial L}{\partial V}$, but since our attention is so simplified, it boils down to a few matrix multiplications, and we can again multiply by $3$.

In total, we have $12 h T^2$ FLOPs for the QKV part and $24 dhT$ FLOPs for the rest. Now notice that each of the four projections required for attention has $dh$ parameters, so we can rewrite $24 dhT$ as $6 NT$, which matches the form of the $6ND$ formula. Therefore, $6ND$ can give us a lower bound for the total compute, but it misses part of the compute. As $\frac{24 dhT}{12 h T^2} \approx \frac{d}{T}$, the problem boils down to the ratio between the embedding dimension $d$ and the number of tokens $T$.

So, can we use $6ND$ to calculate transformer FLOPs? We surely can; however, it’s important to keep in mind that as the number of tokens increases, the formula will become less and less accurate.

One more remark: To use $6ND$ for transformers, we must be careful about what $D$ means. Previously, I stated that it is the number of examples the model sees, but not here. For transformers, the number of tokens in each example matters, so $D$ should be the total number of tokens passed through the model.