The authors dive into the mechanics of vanishing and exploding gradients during the network's initialization phase. It is proven that in architectures without normalization, the signal behaves subcritically, which limits the depth of effective activation propagation. This fundamental limitation explains the convergence difficulties when training ultra-deep networks. Understanding this signal physics paves the way for new weight initialization schemes that will allow for faster training of giant transformers, saving cluster computing power.
Source: arXiv
ScienceTransformersDeep LearningOptimizationarXiv