

Discover more from Last Week in AI
Foundation Models and the Future of Multi-Modal AI
Recent advances that combine large language and vision models can perform impressive tasks surprisingly well, and this direction holds a lot of promise for the future of AI
TL;DR Foundation models, which are large neural networks trained on very big datasets, can be combined with each other to unlock surprising capabilities. This is a growing trend in AI research these past couple of years, where researchers combine the power of large language and vision models to create impressive applications like language-conditioned image generation. There are likely a lot more low-hanging fruits in such large-scale multi-modal AIs, where vision helps to ground AI in real-world concepts while language increasingly acts as an interface layer between humans and AI models as well as among AI models themselves. With these advances, the future of highly flexible AI assistants that can robustly parse information from the visual world and interact with humans through language may be here sooner than many realize.
The Paradigm Shift of Foundation Models
The term “foundation models” was coined by Stanford researchers last year in the paper On the Opportunities and Risks of Foundation Models. At the time, the release of this paper with its long list of co-authors stirred up some controversy as many in the field felt like Stanford was just renaming a phenomenon that was already widely known. This naming does bring value as it helped to mark a paradigm shift in AI that was only beginning to be recognized. Specifically, the rise of large-scale deep learning (models with hundreds of billions, even trillions, of parameters) on very large datasets (hundreds of billions of language tokens) created AI models with “emergent capabilities” that were effective across many downstream tasks with little to no additional task-specific training, and that such large-scale AI training encouraged the homogenization of AI techniques.
This is a paradigm shift because AI models in the past were often thought to perform one task at a time. While techniques in task-transfer and multi-task AI models were researched, one model was not expected to perform tasks that are very different from one another, even with transfer learning. This also meant that research in different tasks and modalities, like vision and language, were often done with different technical approaches — vision researchers studied topics like signal processing, while language researchers studied language grammar. Foundation models with deep learning leveraged incredible scaling to train one model that can perform many tasks. For example, Google’s latest large language model (LLM), Pathways Language Model, can perform tasks as diverse as code completion, translation, and joke explanation, at the same time:
As impressive as these LLMs are, they are just the beginning. With recent works that incorporate multiple modalities into large models or directly combine multiple models themselves, we’re beginning to see a form of AI that is much more capable and much easier to “use.” The rest of this editorial will explain in high-level terms some examples of these advances made in recent months that combine language with vision models. The main takeaway is that vision grounds AI models in the real world, and natural language acts as an interface between humans and AI models and among AI models themselves. These works also show that AI based on deep learning shows no signs of slowing down, and we may actually be quite far from the next plateau in AI development.
Grounding Large Language Models with Vision
LLMs alone can’t ground language to corresponding concepts in the real world, but combining LLMs with vision models is a promising first step. When we say “grounding” in the context of AI, we mean connecting some representation of external sensory input of a concept (e.g. seeing the image of a corgi) with the internal representation of that concept (e.g. the word “corgi”). Grounding is important for AI because training data is often “disconnected” from reality - they are just a bunch of numbers in a computer. This is an especially common criticism of LLMs, which are trained only to fit statistical patterns with words and their relations in a language, absent of what those words truly “mean.”
Combining LLMs with vision models is one method of introducing a type of grounding to these models, where a sentence can be mapped to an image representation of that sentence. If the image is realistic, then this is one way of grounding language in real-world concepts. This can be done in a generative fashion where an AI model is used to draw photos that best match a text description. It can also be done in other applications, where a language model can be prompted to describe different aspects of images. We will highlight some examples below.
Language-Conditioned Image Generation
OpenAI’s DALLE-2 is perhaps the most famous example of language-conditioned image generation. It can produce amazingly detailed and specific images:

The way DALLE-2 works is summarized in the following diagram from the paper:
It includes 4 neural networks — the text encoder, the image encoder, the prior, and the image decoder. The text encoder and image encoder are trained, on a very large dataset of image, caption pairs, to produce “embeddings” such that text and image embeddings of matching image-caption pairs is close to one another, and embeddings of pairs that don’t match are farther apart. An embedding in the context of neural networks is just a vector of numbers. This system is called CLIP.
Once these 2 neural networks are trained, we can proceed to train the next two which actually generate images. Given a text description of a picture, we first run the CLIP text encoder to generate a text embedding. Then, the neural network “prior” converts this text embedding to an image embedding, and a neural network image decoder turns the image embedding into the actual image.
We’re simplifying a lot here in this description of DALLE-2, but the takeaway is that we now have the tools to ground language concepts to visual concepts through large-scale deep learning. But image generation isn’t the only task we can do with this type of language-vision fusion. We’ll give two more recent examples below:
Language-Conditioned Visual Tasks
DeepMind’s recent paper Tackling multiple tasks with a single visual language model demonstrates that the idea of prompting large models to accomplish different tasks can be done with vision models as well.
Prompting was first demonstrated with LLMs that perform next-word predictions. Given a prompt, which may contain a couple of examples of the desired language task, the LLM can be queried to complete that task by merely trying to predict the most likely words that follow. The important thing is that through prompting, the LLM can accomplish new tasks without any additional training. Here’s an example of prompting in the GPT-3 paper for an English to French translation task:
In the DeepMind paper, this technique of prompting was applied to a model that operates on both language and vision embeddings. For example, the paper demonstrates the model doing the task of identifying, from pictures of animals, the species and their natural habitats:
Again, the thing to note here is that this model is not trained specifically for animal classification, yet through few-shot prompting the large model can be “led” into performing many different downstream tasks without additional training.
Language as a Flexible Output Representation
The previous example used language as a way to prompt vision models. A new paper from Google, Pix2Seq: A New Language Interface for Object Detection, shows how language can also be used as the output of vision tasks. Researchers trained a vision model to detect, localize, and identify objects in a given picture. Traditionally, the output representations of such object detection networks are bounding boxes. In this paper, however, the output representation is instead “sentences” that give the coordinates of the bounding boxes and their object types:
It is not hard to see that such language output techniques can be combined with the language prompting techniques to build very capable and flexible language-vision models, where the flexibility comes from the richness of language itself. For example, we can imagine prompting vision-language models with a few examples of performing object localization, and then the model would continue on to perform the object identification task without ever being trained to do that. We can also imagine a prompt-based model that uses both vision and language in both input and output representations. A user can ask the model “a car turning right at this intersection is safe if ___” with an image of a busy intersection, and the model may respond with “if the light was green or if there are no incoming traffic and crossing pedestrians”, along with a generated image of what the intersection would look like when it is safe for a car to turn right.
Language as an Interface among AI Models
So far we’ve seen how combining language with vision can help ground language models in real-world concepts, allowing users to interact with AI models by using language as the interface.
Perhaps not so surprisingly, we can also use language as an intermediate interface among large models to leverage the combinations of their capabilities in a way that exceeds what an individual large model can do. This is demonstrated in Google’s recent paper Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language, where researchers combined language, vision-language, and audio-language models through clever prompts to perform complex, multi-modal tasks:
To illustrate how one might combine these different large models with language to perform an interesting task that the models were not trained on, we can use the relatively simple example of activity recognition. Given a video, we can run the vision-language model (VLM) on individual frames to identify objects in the video. A VLM represents its output as natural language tokens. Say it gives the outputs of “grill, people, table, food.” Then, the audio language model (ALM) can take the audio of the video and tell us the types of sounds that are present. Say it gives the outputs of “chatter, fire, music.” Then, we can prompt the LLM with: “I see grill, people, table, and food. I hear chatter, fire, and music. The activity I am doing is most likely ____”, and through auto-completion, the LLM can guess the activity that is shown in the original video, such as “hanging out with friends at a BBQ.”
This is a powerful paradigm that uses language as a “glue layer” among different large models that use different modalities, and large models can be prompted to accomplish new tasks with the aid of the other large models. We’re only seeing the beginnings of such approaches that treat natural language as an interface for AI, and it will be very exciting to see what applications these approaches can bring in the future.
Conclusion
Foundation models are large neural networks trained on very large datasets, and recent advances show how they can be applied to many different downstream tasks with little to no additional training, especially with prompting for language-based models. Combining language with other modalities, like vision and sound, can unlock new capabilities by 1) grounding language in real-world concepts and 2) using language as a flexible and powerful input and output interface for both humans using AI models and for combining AI models themselves. This direction of research seems very promising and may result in many new AI applications in the immediate future.
About the Author
Jacky Liang (@jackyliang42) is a Ph.D. candidate at Carnegie Mellon University’s Robotics Institute. His research interests are in using learning-based methods to enable robust and generalizable robot manipulation.
Copyright © 2022 Skynet Today, All rights reserved.