Foundation Models and the Future of Multi-Modal AI
Recent advances that combine large language and vision models can perform impressive tasks surprisingly well, and this direction holds a lot of promise for the future of AI
Quick thing before we get to the article. Editorials will be exclusive to paid subscribers for the first month and then made available to everyone afterward. Here are two past editorials recently made free:
TL;DR Foundation models, which are large neural networks trained on very big datasets, can be combined with each other to unlock surprising capabilities. This is a growing trend in AI research these past couple of years, where researchers combine the power of large language and vision models to create impressive applications like language-conditioned image generation. There are likely a lot more low-hanging fruits in such large-scale multi-modal AIs, where vision helps to ground AI in real-world concepts while language increasingly acts as an interface layer between humans and AI models as well as among AI models themselves. With these advances, the future of highly flexible AI assistants that can robustly parse information from the visual world and interact with humans through language may be here sooner than many realize.
The Paradigm Shift of Foundation Models
The term “foundation models” was coined by Stanford researchers last year in the paper On the Opportunities and Risks of Foundation Models. At the time, the release of this paper with its long list of co-authors stirred up some controversy as many in the field felt like Stanford was just renaming a phenomenon that was already widely known. This naming does bring value as it helped to mark a paradigm shift in AI that was only beginning to be recognized. Specifically, the rise of large-scale deep learning (models with hundreds of billions, even trillions, of parameters) on very large datasets (hundreds of billions of language tokens) created AI models with “emergent capabilities” that were effective across many downstream tasks with little to no additional task-specific training, and that such large-scale AI training encouraged the homogenization of AI techniques.
This is a paradigm shift because AI models in the past were often thought to perform one task at a time. While techniques in task-transfer and multi-task AI models were researched, one model was not expected to perform tasks that are very different from one another, even with transfer learning. This also meant that research in different tasks and modalities, like vision and language, were often done with different technical approaches — vision researchers studied topics like signal processing, while language researchers studied language grammar. Foundation models with deep learning leveraged incredible scaling to train one model that can perform many tasks. For example, Google’s latest large language model (LLM), Pathways Language Model, can perform tasks as diverse as code completion, translation, and joke explanation, at the same time:
As impressive as these LLMs are, they are just the beginning. With recent works that incorporate multiple modalities into large models or directly combine multiple models themselves, we’re beginning to see a form of AI that is much more capable and much easier to “use.” The rest of this editorial will explain in high-level terms some examples of these advances made in recent months that combine language with vision models. The main takeaway is that vision grounds AI models in the real world, and natural language acts as an interface between humans and AI models and among AI models themselves. These works also show that AI based on deep learning shows no signs of slowing down, and we may actually be quite far from the next plateau in AI development.