Robots That Write Their Own Code
Code-writing language models enable robots to follow language instructions and perform diverse tasks without task-specific learning
Beyond generic code completion, code-writing language models can also write domain-specific code according to natural language instructions. In our recent work, Code as Policies (CaP), we explore this idea in the context of robotics and show how we can prompt language models to directly write code that controls robots to perform tasks according to language instructions. Code allows the language model to perform precise arithmetic, spatial-geometric reasoning, and express a degree of behavioral common sense. We deployed CaP on various robot platforms and showed it can perform diverse tasks, from tabletop object manipulation to 2D drawing, all without any additional model training. CaP represents a new way of programming robots and points to a promising future of using language models to write code to do tasks. For more, see our website, coverage by TechCrunch, and Twitter thread:
Large language models (LLMs) have shown impressive capabilities not just in natural language understanding, but also in “reasoning” tasks, from reading comprehension to answering difficult math questions. Recent advances in robotics also leverage this capability to use LLMs for robot planning (see video above and gif below). Essentially, we want to build a system that takes as input high-level natural language instructions of a task (e.g. cleaning up a coffee spill) and outputs actions the robot can execute. In PaLM-Saycan (video above and gif below), this is done by having the LLM generate a sequence of low-level actions, described by natural language, that the robot knows how to do (e.g. go to the trashcan, pick up the sponge, close cabinet drawer, etc).
This is a powerful paradigm that lets us build robot systems that 1) interface with users through language, something non-experts can do, 2) leverage LLMs’ reasoning capabilities for task planning, and 3) enable the robot to plan new tasks without task-specific model training (everything is done through prompting).
While using natural language as the input is an incredibly expressive way to specify robot tasks, there are some limitations to using natural language as the action output for robot task planners. It’s difficult for LLMs to reliably reason about spatial-geometric relationships, perform vector arithmetic, and use logic structures like if conditions and loops. It’s also hard to provide visual feedback to the language model directly through natural language (e.g. describing object coordinates, bounding boxes, and segmentation masks), and feedback is crucial for robot decision-making. It turns out these types of operations are very naturally expressed in code, and this is the question I explored in my internship this past summer at Google - what if we just get language models to write robot code? This seems like a promising direction, but how do we exactly do that, and how well does this actually work in practice?
Language Model Programs
Here I’ll give a quick background on prompting language models to generate code, and how we can turn this into what we call Language Model Programs. Briefly, what language models try to do is to predict the most likely next token (~a letter/a word) given all the tokens that came before. Once this new token is predicted, it will be appended to the sequence, and we re-run the language model again to predict the next token. In this way, what comes after (e.g. responses to new instructions) depends on what comes before (the prompt). Here we give a very short example of a prompt (highlighted in yellow) that can get a GPT-3 model to start writing Python code:
In our work, we think of prompts as having two parts: 1) Hints, which are information that only needed to be mentioned once, and 2) Examples, which are pairs of instructions (formatted as comments) and code responses that perform the task described by those instructions.
If you’re unfamiliar with coding - don’t worry - just imagine the prompt as essentially showing the language model a guide on how to best write code for new instructions.
Even just with the super short 3 line prompt above, we can already get the language model to write new Python code for new tasks (you can try it yourself by using the OpenAI GPT-3 Playground):
Note the role of each part of the prompt. The Hints tell the model that we’re writing in Python, and the Examples tell the language model that the result of such computations should be stored in a variable called ret_val. This latter part is important for code execution - we will directly execute the generated code in Python and look for the value of the variable named ret_val, allowing us to get the result of this Language Model Program (LMP), which is any program that as part of its code being generated by a language model.
Using language models to write code is not new - we’ve known for a while that these models can achieve decent performance on generic code-generation benchmarks, and AI-assisted code-writing has become more and more mainstream. There are a few key additions we make in our work that make domain-specific code generation both sufficiently practical and performant to be deployed on real robots. We describe some of them below:
Code as Policies
We call the application of LMPs to robotics Code as Policies. In robotics, a policy is a function that maps some perception inputs to action outputs, directing a robot to perform the intended task. Doing this in our context means giving the language model the ability to generate code that calls robot perception and action APIs. This is something we can achieve by including the usage of such first-party libraries in our prompts:
Using First-Party Libraries
Here, we show how to use the perception API (getting the position of objects by their name) and an action API (a scripted pick-place motion primitive) to complete some tabletop manipulation tasks. Even though the prompt only has two examples, this already enables the language model to write new code for new tasks (highlighted in green).
I’d like to highlight two aspects of the new tasks that showcase the language model’s generalization capabilities. For the first command, note we’re asking to move the object to the right when the prompt has only shown what it means to go to the left. It also says “a little bit”, which the model correctly interpreted and used a movement magnitude less than what was shown in the prompt (5cm instead of 10cm). Both of these demonstrate how there is a degree of behavioral common sense imbued in these language models trained on Internet-scale data, and we can effectively “extract” such knowledge in the form of code.
The second command highlights more of the language model’s capability to do language-based reasoning - getting a bowl of the same color (blue) without being explicitly told what the color is. We will see more examples like this in a second.
Using Third-Party Libraries
What’s really neat about code generation is that we can get the language model to use popular third-party libraries, like Python’s NumPy (helpful for performing numerical computations and vector math), so it doesn’t have to write all the code itself. Because these libraries are popular, the language model has likely already seen examples of their use in the training data, so we don’t have to explicitly prompt these. The example above showcases the usage of third-party libraries, the usage of which dramatically improves the potential capabilities of LMPs. In our experiments, our Python code can use Shapely (parsing 2D geometric shapes), Scipy (more scientific computation utilities), and even low-level robot control libraries like rtde (for controlling robots from Universal Robots).
Reasoning about Context
Another great example of language-based reasoning is the ability to take context into account. In this gif, we show how the language model correctly interprets “the other blocks” and generates the correct code for that.
This context reasoning could be applied in other ways as well, such as undo that:
and noun assignment:
Composing Specialized LMPs
You may have noticed from the previous example that the generated code is calling a function parse_position with natural language arguments “a point 5cm below the red block.” This is actually another LMP that is specialized in computing spatial coordinates given natural language descriptions. We call this LMP composition, and it allows us to combine multiple specialized LMPs via function calls to perform more complex tasks.
LMP composition makes applying language models to more diverse tasks more practical, because language models tend to perform better when their prompts are specialized to do one type of task. This also makes prompt engineering a bit more structured and allows reusing LMPs across different task domains.
Finally, an important contribution we make to LMPs is Hierarchical Code-Generation. Basically, once the LMP generates a piece of code, we inspect it and check for any functions that are yet to be defined. For these undefined functions, we ask the language model to generate them using the function signatures. Then, the process is repeated recursively, until there are no more undefined functions. In the example above, the generated function get_objs_bigger_than_area_th calls a yet-to-be-defined function, get_obj_bbox_area, which gets defined below.
Hierarchical Code-Generation brings a number of benefits that improve the capabilities of LMPs. For one, it allows the generated code to follow abstraction practices found in the training data (most code is not written as one big flat file but rather as hierarchical compositions of function calls), so the generated code is more likely to be correct. Second, the prompt engineer would no longer have to provide all the low-level implementations of all the functions in the prompt. They can simply provide a “rough sketch” of what the code responses should look like, and the language models can fill in the missing functions at runtime. Lastly, in our experiments, we found that Hierarchical Code-Generation consistently improves code-generation performance and generalization performance to new instructions:
To deploy Code as Policies, we just need to specify the low-level robot APIs and provide the appropriate prompts, and no model training is needed. This is what we mean by few-shot prompting and zero-shot training. This allows us to apply CaP to a variety of robot domains and tasks, and I’ll show some of them below (see full videos and more demos at code-at-policies.github.io):
Note that all tasks demonstrated here are new tasks that are not described in the prompt, and in many cases, even the high-level concepts those instructions require are not in the prompt.
Code as Policies is not without its drawbacks and limitations, and we’re excited for these limitations to be addressed in future works:
Lack of feedback to the language model to improve generated code over time. Giving feedback to the language model has been studied in the past, and I think it’s likely that these new language-video models and possibly be reused in the future to describe what the robot is doing, and what it might be doing wrong, and using this to iteratively improve the generated code.
Prompt engineering is more art than science. While we provide some guidelines on prompt engineering for Code as Policies, there are no hard-and-fast rules, and some exploration is needed to deploy CaP on new domains. Recent works suggest that it may be possible to optimize or automate prompt engineering, and I expect these advances to carry over to code generation.
Hard to make guarantees about outputs. This one has real implications about potential safety risks when it comes to deploying language-model-written code in real-world scenarios. Until we can figure out how to reliably test and bound language model outputs, it’s unlikely these systems would be deemed safe and trustworthy enough to be used in real products.
As my co-author puts it, to date, Code as Policies enabled probably the most diverse set of robot tasks a single set of neural network weights is able to accomplish:
Broadly speaking, I’m very optimistic about a near future where improvements in language → code continue to empower language-based user interfaces and code-based actions. Our LMP approach can be applied beyond robotics to any domain where we can reliably provide low-level APIs for the language model to interact with. It’s very hard for me to imagine a future where language models do not play a pivotal role in robotics. I believe, at the very least, language models will act as the interface layer that communicates back and forth with humans and orchestrates low-level software interfaces. If you’re interested in this area, please check out our website where we have open-sourced experiment code and a colab demo.
It’s truly an exciting time in the world of generative AI, especially code generation, and I can’t wait to see where the field is a year from now.
About the Author
Jacky Liang (@jackyliang42) is a Ph.D. candidate at Carnegie Mellon University’s Robotics Institute. His research interests are in using learning-based methods to enable robust and generalizable robot manipulation.
Copyright © 2022 Skynet Today, All rights reserved.