The AI Scaling Hypothesis
How far will this go?
The past decade of progress in AI can largely be summed up by one word: scale. The era of deep learning that started around 2010 has witnessed a continued increase in the size of state of the art models. This has only accelerated over the past several years, leading many to believe in the “AI Scaling Hypothesis”: the idea that more computational resources and training data may be the best answer to achieving the AI field’s long term goals. This article will provide an overview of what the scaling hypothesis is, what we know about scaling laws, and the latest results achieved by scaling.
The Path to the Scaling Hypothesis
In March of 2019, the pioneering AI researcher Rich Sutton published The Bitter Lesson, with the lesson being this:
“The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.”
This came at the end of a decade in which much of the field of AI made enormous strides by relying on Deep Learning, or in other terms by “scaling computation by learning”. This also came in the midst of a more recent trend in the AI subfield of Natural Language Processing, with ever larger (more computationally intensive) models being ‘pre-trained’ on vast swaths of data for the task of language modeling and then ‘fine-tuned’ for downstream tasks such as translation or question answering. This was deemed NLP’s Imagenet Moment, meaning that NLP was adopting a paradigm that Computer Vision had relied on for much of the 2010s already with fine-tuning of large models pre-trained on the large Imagenet dataset.
Perhaps inspired by this trend, in January of 2020 AI researchers at OpenAI released the paper Scaling Laws for Neural Language Models. It presented an analysis of how the performance of AI systems optimized for language modeling changes depending on the scale of the parameters, data, or computation used. The researchers’ conclusion was this:
“These results show that language modeling performance improves smoothly and predictably as we appropriately scale up model size, data, and compute. We expect that larger language models will perform better and be more sample efficient than current models.”
Several months later, OpenAI showed that prediction to be true with the large language model GPT-3. Introduced in the paper Language Models are Few-Shot Learners, GPT-3 is the same as GPT-2 in all but one way: it is ten times bigger. This made GPT-3 by far the largest AI model to have been trained up to that point. Its size manifested not only in substantial quantitative performance increases, but also in an important qualitative shift in capabilities; it turned out that scaling to such an extent made GPT-3 capable of performing many NLP tasks (translation, questions answering, and more) without additional training, despite having not been trained to do those tasks – the model just needed to be presented with several examples of the task as input. This emergent “few-shot learning” behavior was an entirely new discovery and had major implications; what other capabilities might models attain if they were just scaled up more?
“Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. “
Although scaling had always been a part of the paradigm of Deep Learning, GPT-3 marked a shift in how scaling was perceived: it could not only enable better ‘narrow’ AI systems that master a single task, but moreover lead to ‘general’ AI systems capable of many tasks such as GPT-3. As Gwern Branwen wrote in his The Scaling Hypothesis:
“GPT-3, announced by OpenAI in May 2020, is the largest neural network ever trained, by over an order of magnitude. Trained on Internet text data, it is the successor to GPT-2, which had surprised everyone by its natural language understanding & generation ability. To the surprise of most (including myself), this vast increase in size did not run into diminishing or negative returns, as many expected, but the benefits of scale continued to happen as forecasted by OpenAI. These benefits were not merely learning more facts & text than GPT-2, but qualitatively distinct & even more surprising in showing meta-learning: while GPT-2 learned how to do common natural language tasks like text summarization, GPT-3 instead learned how to follow directions and learn new tasks from a few examples.”
Based on this, Branwen coined the notion of the “scaling hypothesis”:
“The strong scaling hypothesis is that, once we find a scalable architecture like self-attention or convolutions, which like the brain can be applied fairly uniformly … we can simply train ever larger NNs and ever more sophisticated behavior will emerge naturally as the easiest way to optimize for all the tasks & data. More powerful NNs are ‘just’ scaled-up weak NNs, in much the same way that human brains look much like scaled-up primate brains.”
Keep reading with a 7-day free trial
Subscribe to Last Week in AI to keep reading this post and get 7 days of free access to the full post archives.