The AI Scaling Hypothesis
How far will this go?
The past decade of progress in AI can largely be summed up by one word: scale. The era of deep learning that started around 2010 has witnessed a continued increase in the size of state of the art models. This has only accelerated over the past several years, leading many to believe in the “AI Scaling Hypothesis”: the idea that more computational resources and training data may be the best answer to achieving the AI field’s long term goals. This article will provide an overview of what the scaling hypothesis is, what we know about scaling laws, and the latest results achieved by scaling.
Last Week in AI is a reader-supported publication.Please consider becoming a free or paid subscriber :)
The Path to the Scaling Hypothesis
In March of 2019, the pioneering AI researcher Rich Sutton published The Bitter Lesson, with the lesson being this:
“The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.”
This came at the end of a decade in which much of the field of AI made enormous strides by relying on Deep Learning, or in other terms by “scaling computation by learning”. This also came in the midst of a more recent trend in the AI subfield of Natural Language Processing, with ever larger (more computationally intensive) models being ‘pre-trained’ on vast swaths of data for the task of language modeling and then ‘fine-tuned’ for downstream tasks such as translation or question answering. This was deemed NLP’s Imagenet Moment, meaning that NLP was adopting a paradigm that Computer Vision had relied on for much of the 2010s already with fine-tuning of large models pre-trained on the large Imagenet dataset.
Perhaps inspired by this trend, in January of 2020 AI researchers at OpenAI released the paper Scaling Laws for Neural Language Models. It presented an analysis of how the performance of AI systems optimized for language modeling changes depending on the scale of the parameters, data, or computation used. The researchers’ conclusion was this:
“These results show that language modeling performance improves smoothly and predictably as we appropriately scale up model size, data, and compute. We expect that larger language models will perform better and be more sample efficient than current models.”
Several months later, OpenAI showed that prediction to be true with the large language model GPT-3. Introduced in the paper Language Models are Few-Shot Learners, GPT-3 is the same as GPT-2 in all but one way: it is ten times bigger. This made GPT-3 by far the largest AI model to have been trained up to that point. Its size manifested not only in substantial quantitative performance increases, but also in an important qualitative shift in capabilities; it turned out that scaling to such an extent made GPT-3 capable of performing many NLP tasks (translation, questions answering, and more) without additional training, despite having not been trained to do those tasks – the model just needed to be presented with several examples of the task as input. This emergent “few-shot learning” behavior was an entirely new discovery and had major implications; what other capabilities might models attain if they were just scaled up more?
“Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. “
Although scaling had always been a part of the paradigm of Deep Learning, GPT-3 marked a shift in how scaling was perceived: it could not only enable better ‘narrow’ AI systems that master a single task, but moreover lead to ‘general’ AI systems capable of many tasks such as GPT-3. As Gwern Branwen wrote in his The Scaling Hypothesis:
“GPT-3, announced by OpenAI in May 2020, is the largest neural network ever trained, by over an order of magnitude. Trained on Internet text data, it is the successor to GPT-2, which had surprised everyone by its natural language understanding & generation ability. To the surprise of most (including myself), this vast increase in size did not run into diminishing or negative returns, as many expected, but the benefits of scale continued to happen as forecasted by OpenAI. These benefits were not merely learning more facts & text than GPT-2, but qualitatively distinct & even more surprising in showing meta-learning: while GPT-2 learned how to do common natural language tasks like text summarization, GPT-3 instead learned how to follow directions and learn new tasks from a few examples.”
Based on this, Branwen coined the notion of the “scaling hypothesis”:
“The strong scaling hypothesis is that, once we find a scalable architecture like self-attention or convolutions, which like the brain can be applied fairly uniformly … we can simply train ever larger NNs and ever more sophisticated behavior will emerge naturally as the easiest way to optimize for all the tasks & data. More powerful NNs are ‘just’ scaled-up weak NNs, in much the same way that human brains look much like scaled-up primate brains.”
In other words, we can keep improving AI by just making things (amount of training data and/or the size of our neural nets) bigger, possibly all the way to achieve human-level or even superhuman AI. Since then, the popularity of the scaling hypothesis – and ideas related to it – have only grown.
From Language to Everything Else
GPT-3 was quickly followed by a flurry of activity in training similar massive language models.
In fact, the paradigm of training gigantic models with massive amounts of data scraped from the internet’ was extended to other modalities besides language:
Facebook’s SEER, a model with 1.3 billion parameters trained on 1 billion images sampled from Instagram, enabled groundbreaking results for computer vision tasks.
Researchers at Amazon Alexa developed scaling laws for acoustic data.
OpenAI trained CLIP to match images with text descriptions, which can be used to select the correct caption among a number of options for an input image.
OpenAI also built the text-image models DALL-E and its successor DALL-E 2, which are instead able to generate images based on text: you can ask them to generate an image of an “avocado armchair” or any number of other scenes. Parti, a similar text-to-image model recently released by Google, was actually used to visually demonstrate the benefits of scaling:
By August of 2021 this trend was so pronounced researchers at Stanford deemed it necessary to coin a new term: Foundation Models. In the paper “On the Opportunities and Risks of Foundation Models,” released by the newly-dubbed Center for Research on Foundation Models at Stanford, the researchers discussed the emergence, capabilities and applications of these models in addition to technical and societal aspects. They did so under the belief that we are now witnessing “the beginning of a paradigm shift: foundation models have only just begun to transform the way AI systems are built and deployed in the world.”
Since the report, more recent developments have taken the idea of expanding into different modalities even further. So far all the examples we’ve given have been of passive models (programs capable of processing some data when invoked), and not of agents (programs that use a model to continuously process observations of the world and choose on actions in order to fulfill a goal). On May 12, DeepMInd released the paper “A Generalist Agent.” The paper describes Gato, “a multi-modal, multi-task, multi-embodiment generalist policy”:
The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm, and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.
In other words, this paper presents a single huge model that was trained to perform hundreds of tasks on a variety of modalities, making it far more general than prior models or agents. Notably, this was done with a ‘Decision Transformer’, a method that is effectively based on language modeling and that therefore is strongly related to GPT-3.
While Gato itself does not exploit scaling in the way GPT-3 did – the neural network itself is only 1.2 billion parameters – it does assume the scaling hypothesis: the paper’s introduction notes that the 1.2B size was chosen as “the operating point of model scale that allows real-time control of real world robots,” but that with future improvements in hardware and model architecture this operating point would “increase the feasible model size, pushing generalist models higher up the scaling law curve.” So, while the 1.2B parameter Gato model can do many tasks somewhat well (it does not significantly outperform prior results on many tasks), bigger versions of Gato might.
And indeed, the paper includes its own experiment displaying a novel ‘scaling law’:
“Multi-Game Decision Transformers,” also released this May and later covered in the blog post “Training Generalist Agents with Multi-Game Decision Transformers”, does something very similar in developing a generalist agent by “scaling up transformer models and training them on large, diverse datasets.” The researchers likewise found that scaling the Decision Transformer model size substantially and predictably improved the performance of their agent:
And that’s where we are today.
The live question at this point is what’s next for scaling? The short answer is: probably more of it, and in more ways. Naturally, this has revived an important question for the field: will this path lead to AGI (AI far more capable than humans)? Some, such as the Eleuther AI team, take this possibility seriously. Others remain skeptical and think scaling is a dead end.
Still, blindly scaling up model size will not be the route taken–Chinchilla has shown us that a smaller model trained for longer can significantly outperform larger models. Furthermore, if these models are to actually be deployed, inference latency concerns will incentivize smaller models. Lastly, research about different facts of scaling laws continues. Besides the papers we already cited, numerous others have also come out in recent years:
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments
Scaling Laws and Interpretability of Learning from Repeated Data
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?
Another trend that is picking up steam is the open-sourcing of large foundation models. Soon after GPT-3 was released, EleutherAI began working to release their own version, which has been available on HuggingFace Transformers for a while. Even more recently, Meta AI has released a suite of OPT (Open Pretrained Transformer) models, including a 175 billion parameter model that users can request access to. As researchers and engineers find more cost- and compute-efficient ways to train and deploy these models, Meta may soon be joined by others. Indeed, the BigScience Research Workshop recently released BLOOM, the world’s largest open-source multilingual language model.
Scaling is an increasingly important facet of today’s deep learning models, and we are likely to see it in more ways as time goes on. It seems that GPT-3 and its ilk are only the beginning. As models continue to scale in different ways, surprising and exciting results will likely abound.
But to conclude this article, let’s remember that there are many important issues in machine learning beyond scaling. Scaling might produce the most exciting results and capabilities today, but that does not mean that the machine learning community should have a narrow-minded focus on it. I remain excited to see what scaling holds for us in the future, but hope that researchers and engineers continue to pursue and invent other paths of inquiry that push the community forward.
About the Author
Daniel Bashir (@spaniel_bashir) is a machine learning engineer at an AI startup in Palo Alto, CA. He graduated with his Bachelor’s in Computer Science and Mathematics from Harvey Mudd College in 2020. He is interested in computer vision and multimodality, ML infrastructure, and information theory.
"It seems to sort of work most of the time but we don't really know why. Maybe if we throw more data and compute (i.e. $$$) at it the Holy Grail of AI will magically emerge (again, without us really understanding how or why it works)." Have I summarised correctly...?