The Future of AI is Serial
The dominance of the transformer architecture lies not only in its performance, but also its efficiency. Unlike its most successful predecessor, the LSTM, training the transformer is simple. There is no iterating across the sequence. There is no backpropagation through time. The transformer ingests the entire sequence at once, feeding hungry GPU cores with no interruptions.
However, this advantage may prove to be the transformer's fatal limitation.
The Serial Scaling Hypothesis
The serial scaling hypothesis formalizes this concern.
Recent Developments
- COT reasoning
- Iterative models for ARC
- TTT sequence modelling architectures
The Future
- scaling along the batch dimension
- training for longer amounts of time with fewer machines
- a new computing architecture?