

Discover more from Last Week in AI
TL;DR: Recent advances have enabled AI models to transform text into other modalities. This article overviews what we’ve seen, where we are now, and what’s next.
Introduction
You’re reading text right now–it’s serving as a medium for me to communicate a sequence of thoughts to you. Ever since humanity became a band of degenerates that actually wrote things down instead of using their memories, we’ve been using sets of signs to transmit information. Under some definitions, you might call all of this “text.”
Today, and over the past centuries, we have encoded our knowledge of the world, our ideas, our fantasies, into writing. That is to say, much of human knowledge is now available in the form of text. We communicate in other ways too–with body language, images, sounds. But text is the most abundant medium we have of recorded communications, thoughts, and ideas because of the ease with which we can produce it.
When GPT-3 was fed the internet, it consumed our observations about the world around us, our vapid drama, our insane arguments with one another, and much more. It learned to predict next words in sequences of the tokenized chaos of human expression. In learning how we form sequences of words to communicate, a large language model learns to mimic (or “parrot”) how we joke, commiserate, command. GPT-3 kicked off something of a “revolution” by being extremely good at “text-to-text”: prompted with examples of a task (like finishing an analogy) or the beginning of a conversation, the generative model can (often) competently learn the task or continue the conversation.
There is almost a “universality” to the ways we use text, and we have only recently reached a point where AI systems can be put together in order to exploit how we use language to describe other modalities. The progress that enabled powerful text generation also enabled text-conditioned multimodal generation. “Text-to-text” became “text-to-X.”
In “text-to-text,” you could ask a model to riff on a description of a dog. In text-to-image, you could turn that description into its visual counterpart. Text-to-image models afforded a new ability not present in existing image generation systems. Existing models like GANs were trained to, given noise inputs (and class information for class-conditional image generation) generate realistic images. But these models did not offer the level of controllability that DALL-E 2, Imagen, and their ilk provide users: you could ask for a photo of a kangaroo with sunglasses, standing in front of a particular building, holding a sign bearing a particular phrase. Your wish was the algorithm’s command.

Quickly after text-to-image became effective, it was followed by more: text-to-video was one of the first sequels. Text-to-audio already existed, but text-to-motion and text-to-3d are just a few examples of the ways in which text is being transformed into something else.


This article is about the “Year of Text-to-Everything.” Recent developments have enabled much more effective ways of converting text to other modalities at a rapid pace. This is exciting and promises to enable a great deal of applications, products, and more over the coming years. But we should also remember that there are limits to the “world of text”--the disembodied musings that merely describe the world without actually interacting with it. I will discuss the advancements that led to today’s moment, and also spend time considering the limitations of text-to-everything if the “representations” of textual information remain in the world of text alone.
Multimodality finally starts to work
Of course, things technically start with GPT-3. I’ll abbreviate the story since it’s been told so many times: OpenAI trains big language model based off of the transformer architecture. That model is much bigger and is trained with much more data than its predecessor, GPT-2 (175 billion parameters vs about 1.5 billion; 40 TB of data vs. 40 GB), which OpenAI had thought was too dangerous to release at the time. It can do things like write JavaScript code that’s not entirely horrendous. Some people say: “wow, cool.” Some people say: “wow, very not cool.” Some people say: “eh.” Startups are built on the new biggest model every, news and academic articles are written praising and criticizing the new model, countries that are not the USA develop their own big language models to compete.
In January 2021, OpenAI introduced a new AI model called CLIP, which boasted zero-shot capabilities similar to those of GPT-3. CLIP was a step towards connecting text and other modalities–it proposed a simple, elegant method to train an image and text model together so that, when queried, the full system could match an image with the corresponding caption among a selection of possible captions.
DALL-E, probably the first system that was “good” at producing images from text, was released on the same day as CLIP. CLIP was not used in DALL-E’s first iteration, but played an important role in its successor. Of course, given its ability to generate plausible images given text prompts, DALL-E made multiple headlines.
Diffusion hits the scene: DALL-E 2 and Co.
While some AI pioneers have lamented that deep learning is not the way to go if we want to achieve “actual” general intelligence, text-to-image is undoubtedly a problem that is amenable to the powers of deep neural networks. A number of complementary advances in deep learning models enabled text-to-image models to make further leaps: diffusion models were found to achieve impressive image sample quality, as papers such as “Diffusion Models Beat GANs on Image Synthesis” found.
DALL-E 2, released a little over a year after DALL-E, leveraged advances in diffusion models to create images even more photorealistic than DALL-E’s. DALL-E 2 was soon upstaged by Imagen and Parti–the former used diffusion models to achieve state-of-the-art performance on benchmarks, while the latter explored a complementary autoregressive approach to image generation.
This was not the end of the story. Midjourney, a commercial diffusion model for image generation, was released by a research lab of the same name. Stable Diffusion, a model that leveraged new research on latent diffusion models that could be trained with limited computational resources, dominated the scene with its release because Stability AI chose to make the model and its weights publicly available.
Innovation in neural network architectures was not the only thing that contributed to these improvements. The Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M) was released in 2015 and was, at the time, the largest public multimedia collection ever released. More recently, the Large-scale Artificial Intelligence Open Network (LAION) released datasets that eclipsed YFCC100M in size: LAION-400M (containing 400 million image-text pairs) was released in 2021, then was followed by LAION-5B (containing 5 billion image-text pairs) in 2022. It’s worth noting that while these datasets have enabled training image-text models at large scale, they are not without issues: The Decoder reported that LAION’s datasets contain patient images that were published without consent, and researchers have commented that the dataset’s quality is not pristine. Other ethical issues with such a large dataset are bound to stand out, and it appears that the authors and their reviewers had a meaningful exchange over these concerns in open review.
Text-to-... Everything!
If AI models can convert text to images, can they convert text to video? Of course! In October, a number of text-to-video generators were released. Make-a-Video from Meta can generate videos from text and from still images, while Google Brain’s Phenaki can generate a continuous video from a series of prompts that make up a story.
Perhaps more usefully–or worryingly–these generative models can competently write code as well. GPT-3 first started gaining notoriety in headlines when users noticed it could write decent code, and since then the capabilities of code-generating language models have advanced substantially. OpenAI’s Codex translates natural language to code, and many other similar models have followed in its stead. DeepMind’s AlphaCode can solve competitive programming problems at a reasonable level as well.
The speed at which these advances followed one another is impressive, as people like Kevin Roose commented:


And it goes further: text can be morphed into other mediums as well, including audio, motion and 3d.




And, as our own (Dr!) Jacky Liang has shown, language models can even write robot policy code from natural language instructions.

It seems like the possibilities are endless. We have only seen the beginnings of what AI models can create. I expect that text will be able to guide a panoply of inventions as we develop more and more powerful models. Sequoia’s recent Generative AI Application Landscape already boasts a number of different niches:
Within a given generative domain, there are a number of possibilities and business domains where that type of generation can be applied. Text generation can afford not just article writing but post language tuned to a platform; image generation and text-to-3d could enable creating varied artifacts for games, messaging apps, and marketing; other applications offer the ability to generate documentation. And, as the above diagram notes, applications in music, audio, and biology/chemistry are yet to come.
ChatGPT and More Text-to-Text
Even within the realm of “text-to-text,” an astounding amount can be done: the recent introduction of ChatGPT has essentially blown up the internet because of the model’s ability to comprehensively answer questions in a conversational format. You can ask it to craft you a simple workout program, write a class syllabus, suggest things to do, tell you about a philosopher’s work, and plenty besides.
It is worth noting important limitations in ChatGPT’s knowledge:
Indeed, if you ask ChatGPT to give more details about a particular topic (e.g. Proust’s thoughts on the nature of time), it begins to walk itself in circles–much what you might expect of a high school essay. And, indeed, ChatGPT’s existence might change how we understand certain aspects of the skill of writing:
> Perhaps there are reasons for optimism, if you push all this aside. Maybe every student is now immediately launched into that third category: The rudiments of writing will be considered a given, and every student will have direct access to the finer aspects of the enterprise. Whatever is inimitable within them can be made conspicuous, freed from the troublesome mechanics of comma splices, subject-verb disagreement, and dangling modifiers.
As I’ve mentioned, ChatGPT doesn’t seem to be able to go far beyond surface-level descriptions of the topics it expounds on. It can write in a fluid enough manner and give you some details about what you want to know, but it’s not ready to take your job if you can provide the in-depth analysis and deep understanding that it lacks.
Can Text Escape Itself? Opportunities and Limitations
Training models on multimodal datasets has afforded a way to understand how information encoded in text, in language, maps to images, 3d figures, and other representations of the world around us. Text-to-image showed us that we can generate images that reflect precise descriptions in text. This is not perfect: Stable Diffusion notably had issues endowing humans in its generated images with the correct number of fingers. But it is notable that improvements came from merely scaling up the language model in a text-to-image system–Imagen, using a T5 encoder (11 billion parameters) trained only on text, produced more photorealistic images than DALL-E 2 whose text encoder had been trained to produce text embeddings similar to matching image embeddings.
This is to say that the possibilities for transforming text into other modalities–what can be done and how far we can go with current methods–is not obvious. I remain sympathetic to the idea that there are real limitations: although text-image datasets can tell us an awful lot about what the world looks like, they lack the affordances we have by existing in the physical world and being able to interact with objects, with other humans, and collecting visual and non-visual information about the world around us through interaction.
But, clearly, there is a lot that can be done. Google’s recent RT-1 (Robotics Transformer) shows how transformers can leverage natural language to solve robotic tasks:

As François Chollet pointed out to me in an interview, text-to-image is a problem space where the capabilities of neural networks can shine. I am also excited about potential second-order applications, like text-guided molecule design and other less obvious ideas.
I think to truly harness some of the powers of text-to-X models, however, we do need better interfaces–we need better ways to express our meaning, the concepts and ideas we want to get across, to the models we ask to act and create on our behalf. The fact that prompt engineering has emerged as a discipline points to an inefficiency in the ways we presently communicate with models like GPT-3.
Looking forward, then, I see two driving problems for us to solve as we make “text-to-everything” even more of a reality:
How can we build interfaces that allow us to better communicate our intent to AI models?
What useful generations, actions, etc. can these models enable for us?
But beyond practical problems, I think another question is more interesting: text-to-{text, image, video, etc.} is not perfect but it is very good. These models are far better at bringing ideas to life in pictorial or video form than the average human, or even humans who are themselves quite skilled at the arts. Just as Daniel Herman asked above about ChatGPT, what does text-to-everything imply about what it means to engage in art, to engage in video-making? Will we enter a time period in which the basics of these arts become more commoditized and anyone can engage with the finer aspects of divulging meaning through different mediums? Where the skill of watercolor painting is reduced to words in a prompt and the rest is a dance between human and AI system?
As always, we shouldn’t overstate the capabilities of these systems–they can often fail in very obvious ways. But they can do a stunning job when faced with the right problems, and those problems might begin to open up the space for people to do more interesting things and engage with higher-level aspects of writing, art, and other modes of expression.
And, beyond these immediate applications, what are the less obvious, second-order applications of text-to-X models and their underlying technologies? Researchers are already thinking about how to use NLP models for predicting amino acid sequences for proteins, a clear application for predicting sequences of letters one step removed from generating text. Investor and State of AI Report author Nathan Benaich, as he mentioned in my recent conversation with him, is excited about how the diffusion models underlying state-of-the-art text-to-image models might be used for biological and chemical applications.
If there is one thing to take away from this year’s stunning developments–the Year of Text-to-Everything–it is that text is becoming more powerful as a medium of command. You no longer need artistic training and a suite of digital art tools or a painting set to turn the idea “floating city” into a visual reality. You can speak (or type) it into existence.
What will you create with your words?