Architecture Without Crutches: Initialization Dynamics in Normalization-Free Transformers

Architecture Without Crutches: Initialization Dynamics in Normalization-Free Transformers
Getting rid of the normalization layer (LayerNorm) remains one of the main goals when optimizing the training of heavy LLMs. On April 16, 2026, the paper "Subcritical Signal Propagation at Initialization in Normalization-Free Transformers" appeared on arXiv.

The authors dive into the mechanics of vanishing and exploding gradients during the network's initialization phase. It is proven that in architectures without normalization, the signal behaves subcritically, which limits the depth of effective activation propagation. This fundamental limitation explains the convergence difficulties when training ultra-deep networks. Understanding this signal physics paves the way for new weight initialization schemes that will allow for faster training of giant transformers, saving cluster computing power.

Source: arXiv
ScienceTransformersDeep LearningOptimizationarXiv
« Back to News List
Chat