> It is also worth to note that, generally speaking, an Encoder-Decoders of 2N p...

phillypham · on July 19, 2024

You can break your sequence into two parts. One part goes through the encoder and the other goes through the decoder, so each token only goes through one transformer stack.

dfeng · on July 22, 2024

I think it's because most of the compute comes from the decoding, since you're doing it autoregressively, while the encoder you just feed it through once and get the embedding. So really all it's saying is that the decoder, with N parameters, is the compute bottleneck; hence encoder-decoder with N+N has similar order compute cost as decoder with N.