Last Week in AI

Share this post
Foundation Models and the Future of Multi-Modal AI
lastweekin.ai
Editorials

Foundation Models and the Future of Multi-Modal AI

Recent advances that combine large language and vision models can perform impressive tasks surprisingly well, and this direction holds a lot of promise for the future of AI

May 1
Comment
Share

Quick thing before we get to the article. Editorials will be exclusive to paid subscribers for the first month and then made available to everyone afterward. Here are two past editorials recently made free:

Last Week in AI
Redefining "Inventor"
TL;DR creators of AI systems have attempted to file for patents/copyrights that give credit to those AI systems, claiming their generative capabilities constitute “inventor” status. They have been met with mixed success: some countries have granted these protections, but US courts have refused. The debate calls into question the current capabilities of AI systems and where they are with respect to “inventiveness…
Read more
2 months ago · Last Week in AI
Last Week in AI
Clearview AI is creating facial recognition at a massive scale, and concerns abound
Quick thing before we get to the article. Editorials will be exclusive to paid subscribers for the first month and then made available to everyone afterward. Here’s the latest one to be made free for all: Summary Clearview AI’s facial recognition platform has been awarded a US patent titled, …
Read more
2 months ago · Andrey Kurenkov

TL;DR Foundation models, which are large neural networks trained on very big datasets, can be combined with each other to unlock surprising capabilities. This is a growing trend in AI research these past couple of years, where researchers combine the power of large language and vision models to create impressive applications like language-conditioned image generation. There are likely a lot more low-hanging fruits in such large-scale multi-modal AIs, where vision helps to ground AI in real-world concepts while language increasingly acts as an interface layer between humans and AI models as well as among AI models themselves. With these advances, the future of highly flexible AI assistants that can robustly parse information from the visual world and interact with humans through language may be here sooner than many realize.

The Paradigm Shift of Foundation Models

The term “foundation models” was coined by Stanford researchers last year in the paper On the Opportunities and Risks of Foundation Models. At the time, the release of this paper with its long list of co-authors stirred up some controversy as many in the field felt like Stanford was just renaming a phenomenon that was already widely known. This naming does bring value as it helped to mark a paradigm shift in AI that was only beginning to be recognized. Specifically, the rise of large-scale deep learning (models with hundreds of billions, even trillions, of parameters) on very large datasets (hundreds of billions of language tokens) created AI models with “emergent capabilities” that were effective across many downstream tasks with little to no additional task-specific training, and that such large-scale AI training encouraged the homogenization of AI techniques.

This is a paradigm shift because AI models in the past were often thought to perform one task at a time. While techniques in task-transfer and multi-task AI models were researched, one model was not expected to perform tasks that are very different from one another, even with transfer learning. This also meant that research in different tasks and modalities, like vision and language, were often done with different technical approaches — vision researchers studied topics like signal processing, while language researchers studied language grammar. Foundation models with deep learning leveraged incredible scaling to train one model that can perform many tasks. For example, Google’s latest large language model (LLM), Pathways Language Model, can perform tasks as diverse as code completion, translation, and joke explanation, at the same time:

As impressive as these LLMs are, they are just the beginning. With recent works that incorporate multiple modalities into large models or directly combine multiple models themselves, we’re beginning to see a form of AI that is much more capable and much easier to “use.” The rest of this editorial will explain in high-level terms some examples of these advances made in recent months that combine language with vision models. The main takeaway is that vision grounds AI models in the real world, and natural language acts as an interface between humans and AI models and among AI models themselves. These works also show that AI based on deep learning shows no signs of slowing down, and we may actually be quite far from the next plateau in AI development.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2022 Skynet Today
Privacy ∙ Terms ∙ Collection notice
Publish on Substack Get the app
Substack is the home for great writing