Abstract
State-of-the-art architectures for sequence generation and understanding, based on attention or recurrent units, are often hard to optimize and tune to optimal performance. In this talk, we focus on understanding these challenges in the recently proposed S4 model: a successful deep transformer-like architecture introduced in 2022 that utilizes linear continuous-time dynamical systems as token-mixing components. S4 achieves state-of-the-art performance in the long-range arena, a Google benchmark for sequence classification, surpassing attention models by a large margin. However, despite being motivated by the theory of optimal polynomial projections, S4's superior performance compared to simpler deep recurrent models, such as deep LSTMs, has perplexed researchers. Drawing on insights from optimization theory and Koopman operators, we are able to identify the primary sources of S4 success and to retrieve its performance with a much simpler architecture which we call Linear Recurrent Unit (LRU). The insights derived in our work provide a set of best practices for initialization and parametrization of fast and accurate modular architectures for understanding and generation of long sequence data (eg for NLP, genomics, music generation, etc.)
Our speaker
My name is Antonio, and I come from the beautiful city of Venice, Italy. I am a last-year PhD student at ETH Zurich, supervised by Thomas Hofmann. My research focuses on the intersection between mathematical optimization and large-scale machine learning. Specifically, my work centers around the theory of accelerated optimization, the dynamics and generalization properties of stochastic gradient descent in high-dimensional non-convex models, and the interaction between adaptive optimizers and the structure/initialization of deep neural networks. I am highly excited by the impact of deep learning on the future of science and technology. My goal is to assist scientists and engineers by contributing to the development of a deep-learning toolbox that is theoretically sound, clearly outlining architectural choices and the best practices for training. To this end, my research experiences at DeepMind, Meta, MILA, and INRIA Paris allowed me to gain valuable insights into the practical challenges associated with training modern deep networks. In the future, I aim to further combine my passion for applications and theoretical background by developing innovative, architecture-aware optimizers that can enhance generalization and accelerate training in various classes of deep learning models, with a particular focus on sequence modeling and understanding.
To become a member of the Rough Path Interest Group, register here for free.