Getting Data for Robot AI — Hard, but Possible
Embodied AI agents need data of interactions with the real world, which is far more difficult to get than data of other AI domains like vision and language.
This is the third of a series on the challenges and opportunities of applying AI to robotics, specifically in the setting of autonomous service robots that can assist humans in everyday tasks. See the overview here.
TL;DR Deep-learning-driven AI needs a lot of data. Robot AIs need data of robots interacting with the world, which is a lot harder to get than images and texts that can be directly scraped off the Internet. Learning future robot AIs will need to combine multiple sources of interaction data, from human demonstrations to reinforcement learning.
Getting Robot Data is Hard
Modern successes in deep learning require large amounts of data. Current learning-based solutions to complex tasks in domains like vision and language are built on large and high-quality datasets. ImageNet contains more than a million annotated images. GPT-3 was trained on 45TB of text data. Large datasets are needed to cover the diversity in the real world, which has both many dimensions (e.g. images) and long tails. Collecting such datasets is often the most important and costly part of building a deep learning AI system.
Training AI for robots needs interaction data, or what happens when a robot physically interacts with things in the world. While large amounts of images and texts can be directly scraped off the Internet, it’s not clear how this can be done for physical interactions. To learn AIs that power intelligent robots, we have to ask: where do robot interaction data come from?
One answer that some might give is simulations. Why can’t we just simulate many robots in many environments doing many tasks in large compute clusters, collecting “infinite” data that way?
There are 3 problems with this. First, engineering a realistic environment and task requires a lot of work from humans and can’t be easily scaled. Second, simulators are not the real world. In some scenarios the data generated might not be accurate; in others, we don’t have good simulators at all (e.g. simulating fluids, deformable objects, zippers, books, cardboard boxes). Third, and perhaps most importantly, simulators do not magically give us useful interaction data. Somehow we still have to get the simulated robots to interact with the simulated world in meaningful ways, as controlling the robot randomly will just generate useless and uninteresting interaction data most of the time.
Simulator or not, how do we actually get interaction data to train robot AIs?
Sources of Interaction Data and their Trade-offs
The most direct way to generate interaction data is from human demonstrations, where a human performs a task, typically by controlling a robot, and we record what happened between the robot and the environment during the demonstration. There are many nuances here on what kind of data to collect, how to deal with noisy and suboptimal human demonstrations, how to make these demonstrations easy to collect and process, and how to pick what exactly to learn from such data. Learning from demonstrations has been demonstrated on self-driving cars as far back as 1989, and it remains popular to this day.
In theory, any physical task a human controlling a robot can do, the robot can learn to do. In practice, however, there are limitations. Human time is expensive and can’t be easily scaled, especially for complex tasks and robot morphologies that are different from humans. Key developments here will likely rely on having a demonstration interface that can be easily used by non-experts, something we have seen progress on in recent years.
We can still guide data generation with human experts without directly collecting human demonstrations by collecting human feedback. This is similar in spirit to labeling datasets for supervised learning. Here, the robot interacts with the world in some way, and we ask the humans for feedback. For particular tasks, we can ask human experts to label whether or not the interaction was successful, score the interaction, or provide limited corrective actions. This type of feedback is a lot easier to obtain, but they also contain less information than full-on demonstrations.
Another type of learning from human experts is simply having a human do a task (like watching YouTube videos) instead of a human controlling a robot to do a task. Recent developments suggest this is possible, to some extent (see gif below). However, this type of data is a lot harder to learn from than if we directly had robot interaction data, since we have to figure out what parts of the human-environment interactions are relevant to a robot performing the same task, and what parts are not.
Instead of requiring humans, we can instead obtain algorithmic demonstrations and algorithmic feedback. Here, an algorithm controls a robot, typically in simulation, to generate useful demonstration data, or provides the simpler type of feedback discussed above. You might ask - if we have an algorithm that can generate useful interactions, why not just use that to control the robot? Why collect data and do learning?
The trick is that the algorithmic experts have access to privileged abilities, such as precise models about the world, perfect sensing and controls, and high storage or compute limits, that the robot might not actually have during deployment in real scenarios. For example, in simulation, we know where everything is, but in reality, we only have access to sensors such as cameras. The learning-based AI will try to mimic the behaviors of the algorithmic expert while only using the nonprivileged abilities.
Collecting data with algorithmic experts is much easier to scale than with human experts. The limitation is that we can’t easily write algorithms for all tasks, even if we have privileged information, and sometimes directly computing what to do just takes too long.
The most flexible, powerful, but perhaps the most difficult way to collect interaction data is via Reinforcement Learning (RL), which can be done in both simulations and the real world. In RL, we train an AI that continuously interacts with the environment and learns from these interactions to better complete a task or set of tasks. There are many flavors of RL and many specific challenges for applying RL to robotics. Different RL algorithms have different strategies that determine how to interact with the world when little is known, and how to balance learning new things about the world (exploration) versus improving what’s already learned to complete the task of interest (exploitation).
RL is flexible and powerful because it often requires less domain-specific human effort than the previous methods of collecting interaction data. It’s also really difficult to do, because efficient exploration is difficult, and efficient exploration to generate diverse data that leads to generalization is even more difficult. For a lot of tasks, engineering a “reward” function that speeds up exploration takes a lot of effort, and engineering simulation or real-world environments to do continuous learning also takes a lot of time. For example, in a lot of real-world RL setups, there will be carefully designed mechanisms, like the automated tray flippers in the gif above, that reset the environment, so the robots can continuously interact with the world without human intervention.
Learning AIs for robots is different and more challenging than doing so for specific domains, like vision and language. On the data collection side, there are many open questions that remain to be answered, such as: How to best leverage data from different robots, environments, and tasks? How to weigh the costs and benefits of different data sources? How to make the best use of such data to enable efficient and generalizable learning?
Looking forward, mature and capable data-driven robot learning systems will likely use multiple sources of interaction data. DeepMind recently outlined one such approach, where it combines real-world human demonstrations, human feedback, and real-world RL. Of course, this is not the only way to go about it, but it is a sign of things to come. There is no one best way to obtain robot interaction data, and successful systems will need to combine multiple sources and leverage their different strengths.
About the Author
Jacky Liang (@jackyliang42) is a PhD candidate at Carnegie Mellon University’s Robotics Institute. His research interests are in using learning-based methods to enable robust and generalizable robot manipulation.
Copyright © 2021 Skynet Today, All rights reserved.