Recently, the theory of infinite-width neural networks led to the first technology, muTransfer, for tuning enormous neural networks that are too expensive to train more than once. For example, this allowed us to tune the 6.7 billion parameter version of GPT-3 using only 7% of its pretraining compute budget, and with some asterisks, we get a performance cmparable to the original GPT-3 model with twice the parameter count. In this talk, I will explain the core insight behind this theory. In fact, this is an instance of what I call the *Optimal Scaling Thesis*, which connects infinite-size limits for general notions of “size” to the optimal design of large models in practice, illustrating a way for theory to reliably guide the future of AI. I'll end with several concrete key mathematical research questions whose resolutions will have incredible impact on how practitioners scale up their NNs.
Greg Yang is a researcher at Microsoft Research in Redmond, Washington. He joined MSR after he obtained Bachelor's in Mathematics and Master's degrees in Computer Science from Harvard University, respectively advised by ST Yau and Alexander Rush. He won the Hoopes prize at Harvard for best undergraduate thesis as well as Honorable Mention for the AMS-MAA-SIAM Morgan Prize, the highest honour in the world for an undergraduate in mathematics. He gave an invited talk at the International Congress of Chinese Mathematicians 2019.
To become a member of the Rough Path Interest Group, register here for free.