

Discover more from Last Week in AI
How In-Context Learning Emerges
In-context learning is the most exciting capability exhibited by Large Language Models. How does it work and where does it come from?
TL;DR In-Context Learning (ICL) is an emergent capability of Large Language Models (LLMs) that allows them to learn new tasks on the fly without further training. This ability was first observed in GPT-3 and subsequently observed in other LLMs as well. Although the origins of ICL were mysterious in the beginning, recent research has shed light on the ingredients of ICL and shown how the model architecture, model scale, and training data distribution all play important roles in allowing ICL to emerge.
What is In-Context Learning (ICL)?
AI has come a long way in recent years (and months!), with systems like ChatGPT demonstrating impressive abilities in solving a wide variety of language-based tasks. Before LLMs, most AI models were still limited by the data they were trained on - they can only perform tasks they have explicitly been optimized for through training. GPT-3 and subsequent LLMs have been able to do something more powerful: learn new tasks and skills simply from new examples in the input, without any gradient updates or changes to the pretrained model. This ability is known as in-context learning (ICL).

Why is ICL exciting?
In-context learning has enormous potential to unlock more flexible, general, and human-like intelligence in AI systems. Some reasons it is generating so much interest:
Versatility - With ICL, a single model can learn a wide variety of skills at the same time, instead of needing separate training for each one.
Generalization - ICL allows models to learn underlying rules and patterns from just a few examples, and generalize them to new situations.
Efficiency - No lengthy or costly re-training of models is needed. Skills can be acquired instantly.
Accessibility - ICL enables AI systems that can be taught by everyday users through simple demonstrations of the task.
In short, ICL enables LLMs to become powerful systems that can continually learn, reason, and adapt to new tasks. But how does ICL work and where does it come from?
How Does In-Context Learning Work?
Recent research has revealed 3 key factors that enable and enhance in-context learning abilities in large language models, and we’ll go through each one.
1) Model Architecture
The basic model architecture plays a critical role in enabling ICL. In particular, Transformers, a type of neural network architecture that uses a learned attention mechanism, can exhibit ICL, while models using prior architectures like Recurrent Neural Networks or Multilayer Perceptrons do not.

What’s so special about Transformers that leads to the emergence of ICL? Recent research discovered that the Transformers can perform optimization algorithms, like linear regression and gradient descent, when the right data and scale conditions are met (see below). In other words, transformers can be trained to learn how to solve new tasks, instead of just learning how to solve a particular task at training time. This ability to “learn to learn” is what enables in-context learning in LLMs that are powered by Transformers.
How is this possible? Recent theoretical analyses have shown that Transformers have a dual-form of gradient descent. What this means in simple terms is that in some ways, under certain conditions, the mathematics of the Transformer’s attention mechanism is equivalent to the mathematics of performing gradient descent, a general optimization algorithm (it’s the high-level algorithm that trains neural nets!). The process of calculating attention is in some ways the same as the process of performing gradient-based learning. So, in theory, it is not surprising that Transformers can learn to learn.
But having the right model architecture is not enough - the scale of the Transformer and the type of data it’s trained on also matter.
2) Model Scale
Recent research has shown that bigger models (in terms of how many parameters the neural network model has) learn faster, more robustly, and can handle more complex tasks when learning in context. Beyond just observing that bigger LLMs perform better at various ML benchmarks, researchers have observed 4 concrete benefits of improving model size:

Bigger models give more rule-based generalization instead of exemplar-based generalization. What does this mean? Recall that ICL works by a user giving the model a bunch of examples of questions and answers at runtime in order for the model to learn the task. Although transformers can perform ICL, smaller Transformers seem to answer new questions by finding the closest question(s) from the list of provided examples, then predicting the answer from the answers of those matching questions. This is great, but it could be more intelligent. By contrast, large Transformers can figure out the rules/reasons of how to arrive at the answers from the questions. Then, given a new question, large Transformers will apply this learned rule to predict the answer. This is much more capable and leads to better performance.
Bigger models can override prior semantic knowledge during ICL. LLMs pick up a lot of knowledge about the world from the data it was trained on. After training, LLMs can leverage their learned semantics of language (i.e. the widely accepted meanings of words) to perform tasks. However, for some tasks, it may be preferable for the LLM to learn updated/new semantics present in the in-context examples to perform a task. Experiments have shown that using in-context examples to override existing semantic knowledge is an emergent property of LLMs that only exists for bigger models, not smaller ones. This means that bigger models are much more versatile than smaller models when it comes to learning new tasks.
Bigger models allow more complex ICL algorithms. Researchers have “reverse-engineered” the specific type of machine learning algorithms that transformers learn to perform in-context. What they found is that bigger models can implement more complex learning algorithms in-context, thereby solving more complex tasks. In a linear regression setting, transformers with 1~2 layers can learn to perform 1 step of gradient descent. With 4~8 layers they can perform a linear regression algorithm called Ridge Regression. With 12+ layers they can learn to perform Ordinary Least Squares. I won’t go into what these algorithms in detail, but it suffice to say that we have ways of understanding how exactly transformers are doing ICL and have concrete evidence that how they do it become more sophisticated as the model size scales.
Bigger models are more robust to noise. Bigger Transformers, when trained on noisy data, automatically learns how to perform learning algorithms that are robust to this noise. In the previous paper that studied what algorithms transformers implement for linear regression, when the authors trained bigger models on noisy data, they found the transformers learned how to optimize like Bayesian estimators, which takes the uncertainty of the underlying data distribution into account when making predictions. This ability to be robust to noise is much better in larger Transformers.
In short, large-scale Transformers can be trained to do ICL. However, we need some data to do this training, and it turns out it can’t be just any data, and the distributional properties of the training data are also critical in unlocking these ICL capabilities.
3) Data Distribution
Finally, the model's training data itself needs certain properties for in-context learning to emerge. This is sort of hinted at in the previous point, where training on noisy data enabled ICL that is robust to noise. In this paper, the authors discover 3 specific properties of the underlying data distribution necessary for Transformers to learn ICL:

Data distribution needs to be long-tailed - Specificaly this means the dataset should contain a lot of rare tokens. LLMs work by ingesting and outputting tokens. You can think of each token as a word, although this is not quite accurate. The input to the Transformers in LLMs is a sequence of such tokens, and the output is the predicted most likely next token (e.g. give the start of a sentence, predict the next word). Using this analogy, having a long-tailed distribution of tokens means that the dataset to train LLMs should contain a lot of words that appear infrequently (the technical term is that the data distribution should be Zipfian). This is easily satisfied by natural language datasets, where there is a lot of rare words that are very infrequently used.
Rare tokens need to appear in clusters - Having a lot of rare tokens is not enough. When these tokens appear in the training data, they need to do so in clusters. The paper calls this property “burstiness.” The opposite of burstiness is having these rare tokens uniformly sprinkled throughout the dataset, which is not as helpful in encouraging the model to learn ICL. Again, we got lucky here with natural language datasets, as it is often the case with natural language that rare words are used together.
Tokens need to be dynamic - The authors use “dynamic” to mean that tokens need to have different meanings given different contexts. This is also directly satisfied with natural language data, where words can have different meanings depending on the words that come before. Having dynamic tokens seems very important in forcing the model to not merely memorize the meaning of each token and instead learn to infer from context - hence in-context learning.
You can think of these data requirements as requirements on the fundamental task complexity that is being used to train Transformers. If the data distribution is not complex enough, Transformers don’t have to learn ICL to do a good job at predicting that data. It’s only when the underlying data is rich and complex enough do we “force” Transformers to learn ICL, where richness and complexity are expressed through the above properties.
What’s notable is that ICL was never the goal of the body of language model research work that led to ICL. We simply got lucky with natural language data, that they just happened to exhibit the necessary distributional properties that enable large Transformers to learn ICL. However, now that we understand these fundamental principles, we can now look for and apply other types of data to train Transformers, maybe even entirely synthetic data, that could also lead to emergent ICL behaviors.
Conclusion
The ability to learn in context has profound implications for the future of AI. It provides the prospect of designing models that can adapt to new tasks with minimal explicit retraining, greatly expanding the potential applications and effectiveness of AI systems. While the concept of in-context learning is still relatively new, ongoing research continues to deepen our understanding of this fascinating ability, revealing the necessary ingredients in model architecture, model scale, and data distributions for ICL to emerge. The most exciting opportunity now is that given what we’ve recently discovered about ICL, the field can leverage this knowledge to build more capable ICL systems in the future.