GPT-3 is No Longer the Only Game in Town
GPT-3 was by far the largest AI model of its kind last year. Now? Not so much.
Welcome to the fourth editorial from Last Week in AI!
This is the last of our free editorials, and we hope you will consider subscribing to our substack to get access to future ones and to support us. We’d really appreciate your support, and will work to make it worth it for you!
TLDR: Organizations face significant challenges in creating a model similar to OpenAI’s GPT-3, but nevertheless a half dozen or so models as big or bigger than GPT-3 have been announced over the course of 2021.
GPT-3 is no Longer the Only Game in Town
It's safe to say that OpenAI’s GPT-3 has made a huge impact on the world of AI. Quick recap: GPT-3 is a huge AI model that is really good at many “text-in text-out” tasks (a lengthier explanation can be found here, here, here, or in dozens of other write ups). Since being released last year, it has inspired many researchers and hackers to explore how to use and extend it; the paper that introduced GPT-3 is now cited by more than 2000 papers (that's a LOT), and OpenAI claims more than 300 applications use it.
However, the ability of people to build upon GPT-3 was hampered by one major factor: it was not publicly released. Instead, OpenAI opted to commercialize it and only provide access to it via a paid API (although, just this past week it has also become available on Microsoft Azure). This made sense given OpenAI’s for profit nature, but went against the common practice of AI researchers releasing AI models for others to build upon. So, since last year multiple organizations have worked towards creating their own version of GPT-3, and as I’ll go over in this article at this point roughly half a dozen such gigantic GPT-3 esque models have been developed (though as with GPT-3, not yet publicly released).
Creating your own GPT-3 is nontrivial for several reasons. First, the compute power needed. The largest variant of GPT-3 has 175 billion parameters which take up 350GB of space, meaning that dozens of GPUs would be needed just to run it and many more would be needed to train it. For reference, OpenAI has worked with Microsoft to create a supercomputer with 10,000 GPUs and 400 gigabits per second of network connectivity per server. Even with this sort of computer power, such models reportedly take months to train. Then, there is the massive amount of data required, with GPT-3 having been trained on about 45 Terabytes of text data from all over the internet, which translates to 181014683608 english words and many more in other languages (though this came from filtered publicly available datasets), further exacerbating the need for expensive computer power to handle it all.
Taken together, these factors mean that GPT-3 could have easily cost 10 or 20 million dollars to train (exact numbers are not available). Previous large (though, not as large as GPT-3) language models such as GPT-2, T5, Megatron-LM, and Turing-NLG were similarly costly and difficult to train.
Nevertheless, it was only a matter of time before GPT-3 was successfully recreated (with some tweaks) by others. Surprisingly, one of the earlier efforts to release results was done by a grassroots effort of volunteers, instead of a company with immense amounts of money like OpenAI. The group in question is EleutherAI, “a grassroots collective of researchers working to open source AI research.“ They first released a dataset similar to the one OpenAI used to train GPT-3, which they named The Pile. Next came GPT-Neo 1.3B and 2.7B (B meaning billions), smaller scale versions of GPT-3, followed most recently by a 6 billion parameter version called GPT-J-6B. All this, done by volunteers working together over Discord (and some generous donations of credits for cloud computing).
Meanwhile, other groups were also working towards their own versions of GPT-3. A group of Chinese researchers from Tsinghua University and BAAI released the Chinese Pretrained Language Model (CPM) about 6 months after GPT-3 came out. This is a 2.6 billion parameter model trained on 100GB of Chinese text, still far from the scale of GPT-3 but certainly a step towards it. Notably, GPT-3 was primarily trained on English data, so this represented a model more fitted for use in China. Soon after, researchers at Huawei announced the 200 billion parameter PanGu-α, which was trained on 1.1 terabytes of Chinese text.
And so it went on: South Korean company Naver released the 204 billion parameter model HyperCLOVA, Israeli company AI21 Labs released the 178 billion parameter model Jurassic-1, and most recently NVIDIA and Microsoft teamed up to create the 530 billion parameter model Megatron-Turing NLG. These increases in size do not necessarily make these models better than GPT-3, given various aspects affect performance and notable improvements may not be seen until a model has an order of magnitude more parameters. Nevertheless, the trend is clear: more and more massive models similar in nature to GPT-3 are getting created, and they are only likely to grow bigger in the coming years.
This trend of massive investments of dozens of millions of dollars going into training ever more massive AI models appears to be here to stay, at least for now. Given these models are incredibly powerful this is very exciting, but the fact that primarily corporations with large monetary resources can create these models is worrying, and in general there are many implications to this trend. So much so that earlier this year a large number of AI researchers at Stanford worked together to release the paper On the Opportunities and Risks of Foundation Models, which gave GPT-3 and other massive models of its kind the name Foundation Models and presented a detailed analysis of their possibilities and implications.
So, this is a big deal, and developments are happening faster and faster. It’s hard to say how long this trend of scaling up language models can go on for and whether any major discoveries beyond those of GPT-3 will get made, but for now we are still very much in the middle of this journey, and it’s very interesting to see what happens in the coming years.
About the Author:
Andrey Kurenkov (@andrey_kurenkov) is a PhD student with the Stanford Vision and Learning Lab working on learning techniques for robotic manipulation and search. He is advised by Silvio Savarese and Jeannette Bohg.