How to train your robot with deep reinforcement learning: lessons we have learned

Abstract

Deep reinforcement learning (RL) has emerged as a promising approach for autonomously acquiring complex behaviors from low-level sensor observations. Although a large portion of deep RL research has focused on applications in video games and simulated control, which does not connect with the constraints of learning in real environments, deep RL has also demonstrated promise in enabling physical robots to learn complex skills in the real world. At the same time, real-world robotics provides an appealing domain for evaluating such algorithms, as it connects directly to how humans learn: as an embodied agent in the real world. Learning to perceive and move in the real world presents numerous challenges, some of which are easier to address than others, and some of which are often not considered in RL research that focuses only on simulated domains. In this review article, we present a number of case studies involving robotic deep RL. Building off of these case studies, we discuss commonly perceived challenges in deep RL and how they have been addressed in these works. We also provide an overview of other outstanding challenges, many of which are unique to the real-world robotics setting and are not often the focus of mainstream RL research. Our goal is to provide a resource both for roboticists and machine learning researchers who are interested in furthering the progress of deep RL in the real world. Reinforcement Learning Goes Beyond Gaming & Robotics

Reinforcement Learning Goes Beyond Gaming & Robotics

1. Introduction

Robotic learning lies at the intersection of machine learning and robotics. From the perspective of a machine learning researcher interested in studying intelligence, robotics is an appealing medium to study as it provides a lens into the constraints that humans and animals encounter when learning, uncovering aspects of intelligence that might not otherwise be apparent to study when we restrict ourselves to simulated environments. For example, robots receive streams of raw sensory observations as a consequence of their actions, and cannot practically obtain large amounts of detailed supervision beyond observing these sensor readings. This makes for a challenging but highly realistic learning problem. Further, unlike agents in video games, robots do not readily receive a score or reward function that is shaped for their needs, and instead need to develop their own internal representation of progress towards goals. From the perspective of robotics research, using learning-based techniques is appealing because it can enable robots to move towards less-structured environments, to handle unknown objects, and to learn a state representation suitable for multiple tasks.

Despite being an interesting medium, there is a significant barrier for a machine learning researcher to enter robotics and vice versa. Beyond the cost of a robot, there are many design choices in choosing how to set-up the algorithm and the robot. For example, reinforcement learning (RL) algorithms require learning from experience that the robot autonomously collects itself, opening up many choices in how the learning is initialized, how to prevent unsafe behavior, and how to define the goal or reward. Likewise, machine learning and RL algorithms also provide a number of important design choices and hyperparameters that can be tricky to select.

Motivated by these challenges for the researchers in the respective fields, our goal in this article is to provide a high-level overview of how deep RL can be approached in a robotics context, summarize the ways in which key challenges in RL have been addressed in some of our own previous work, and provide a perspective on major challenges that remain to be solved, many of which are not yet the subject of active research in the RL community.

There have been high-quality survey articles about applying machine learning to robotics. Deisenroth et al. (2013) focused on policy search techniques for robotics, whereas Kober et al. (2013) focused on RL. More recently, Kroemer et al. (2019) reviewed the learning algorithms for manipulation tasks. Sünderhauf et al. (2018) identified current areas of research in deep learning that were relevant to robotics, and described a few challenges in applying deep learning techniques to robotics.

Instead of writing another comprehensive literature review, we first center our discussion around three case studies from our own prior work. We then provide an in-depth discussion of a few topics that we consider especially important given our experience. This article naturally includes numerous opinions. When sharing our opinions, we do our best to ground our recommendations in empirical evidence, while also discussing alternative options. We hope that, by documenting these experiences and our practices, we can provide a useful resource both for roboticists interested in using deep RL and for machine learning researchers interested in working with robots.

2. Background

In this section, we provide a brief, informal introduction to RL, by contrasting it with classical techniques of programming robot behavior. A robotics problem is formalized by defining a state and action space, and the dynamics which describe how actions influence the state of the system. The state space includes internal states of the robot as well as the state of the world that is intended to be controlled. Quite often, the state is not directly observable–instead, the robot is equipped with sensors, which provide observations that can be used to infer the state. The goal may be defined either as a target state to be achieved, or as a reward function to be maximized. We want to find a controller, (known as a policy in RL parlance), that maps states to actions in a way that maximizes the reward when executed.

If the states can be directly or indirectly observed, and a model of the system dynamics is known, the problem can be solved with classical methods such as planning or optimal control. These methods use the knowledge of the dynamics model to search for sequences of actions that when applied from the start state, take the system to the desired goal state or maximize the achieved reward. However, if the dynamics model is unknown, the problem falls into the realm of RL (Sutton and Barto 2018). In the paradigm of RL, samples of state-action sequences (trajectories) are required in order to learn how to control the robot and maximize the reward. In model-based RL, the samples are used to learn a dynamics model of the environment, which in turn is used in a planning or optimal control algorithm to produce a policy or the sequence of controls. In model-free RL, the dynamics are not explicitly modeled, but instead the optimal policy or value function is learned directly by interaction with the environment. Both model-based and model-free RL have their own strengths and weaknesses, and the choice of algorithm depends heavily on the properties required. These considerations are discussed further in Sections 3 and 4. Reinforcement learning for robot research: A comprehensive review and ...

Reinforcement learning for robot research: A comprehensive review and ...

3. Case studies in robotic deep RL

In this section, we present a few case studies of applications of deep RL to various robotic tasks that we have studied. The applications span manipulation, grasping, and legged locomotion. The sensory inputs used range from low-dimensional proprioceptive state information to high-dimensional camera pixels, and the action spaces include both continuous and discrete actions.

By consolidating our experiences from those case studies, we seek to derive a common understanding of the kinds of robotic tasks that are tractable to solve with deep RL today. Using these case studies as a backdrop, we point readers to outstanding challenges that remain to be solved and are commonly encountered in Section 4.

3.1. Learning manipulation skills

Reinforcement learning of individual robotic skills has a long history (Daniel et al., 2013; Ijspeert et al., 2002; Kober et al., 2013; Konidaris et al., 2012; Manschitz et al., 2014; Peters et al., 2010; Peters and Schaal, 2008). Deep RL provides some appealing capabilities in this regard: deep neural network policies can alleviate the need to manually design policy classes, provide a moderate amount of generalization to variable initial conditions and, perhaps most importantly, allow for end-to-end joint training for both perception and control, learning to directly map high-dimensional sensory inputs, such as images, to control outputs. Of course, such end-to-end training itself presents a number of challenges, which we will also discuss. We discuss a few case studies on single-task deep robotic learning with a variety of different methods, including model-based and model-free algorithms, and with different starting assumptions.

3.1.1. Guided policy search

Guided policy search methods (Levine et al., 2016) were among the first deep RL methods that could be tractably applied to learn individual neural network skills for image-based manipulation tasks. The basic principle behind these methods is that the neural network policy is “guided” by another RL method, typically a model-based RL algorithm. The neural network policy is referred to as a global policy, and is trained to perform the task successfully from raw sensory observations and under moderate variability in the initial conditions. For example, as shown in Figure 1, the global policy might be required to put the red shape into the shape sorting cube at different positions. This requires the policy to implicitly determine the position of the hole. However, this is not supervised directly, but instead the perception mechanism is learned end-to-end together with control. Supervision is provided from multiple individual model-based learners that learn separate local policies to insert the shape into the hole at a specific position. In the case of the experiment illustrated in Figure 1, nine local policies were trained for nine different cube positions, and a single global policy was then trained to perform the task from images. Typically, the local policies do not use deep RL, and do not use image inputs. They instead use observations that reflect the low-dimensional, “true” state of the system, such as the position of the shape-sorting cube in the previous example, in order to learn more efficiently. Local policies can be trained with model-based methods such as LQR-FLM (Levine and Abbeel, 2014; Levine et al., 2016), which uses linear quadratic regulator (LQR) with fitted time-varying linear models, or model-free techniques such as PI2 (Chebotar et al., 2017a,b).

A full theoretical treatment of the guided policy search algorithm is outside the scope of this article, and we refer the reader to prior work on this topic (Levine and Abbeel, 2014; Levine et al., 2016; Levine and Koltun, 2013).

An important point of discussion for this article, however, is the set of assumptions underlying guided policy search methods. Typically, such methods assume that the local policies can be optimized with simple, “shallow” RL methods, such as LQR-FLM or PI2. This assumption is reasonable for robotic manipulation tasks trained in laboratory settings, but can prove difficult in (1) open-world environments where the low-level state of the system cannot be effectively measured and in (2) settings where resetting the environment poses a challenge. For example, in the experiment in Figure 1, the robot is holding the cube in its left arm during training, so that the position of the cube can be provided to the low-level policies and so that the robot can automatically reposition the cube into different positions deterministically. We discuss these challenges in more detail in Sections 4.12 and 4.2.3.

Nonetheless, for learning individual robotic skills, guided policy search methods have been applied widely and to a broad range of behaviors, ranging from inserting objects into containers and putting caps on bottles (Levine et al., 2016), opening doors (Chebotar et al., 2017b), and shooting hockey pucks (Chebotar et al., 2017a). In most cases, guided policy search methods are very efficient in terms of the number of samples, particularly as compared to model-free RL algorithms, since the model-based local policy learners can acquire the local solutions quickly and efficiently. Image-based tasks can typically be learned in a few hundred trials, corresponding to 2–3 hours of real-world training, including all resets and network training time (Chebotar et al., 2017a; Levine et al., 2016).

3.1.2. Model-free skill learning

Model-free RL algorithms lift some of the limitations of guided policy search, such as the need to decompose a task into multiple distinct and repeatable initial states or the need for a model-based optimizer that typically operates on a low-dimensional state representation, but at the cost of a substantial increase in the required number of samples. For example, the Lego block stacking experiment reported by Haarnoja et al. (2018a) required a little over 2 hours of interaction, whereas comparable Lego block stacking experiments reported by Levine et al. (2015) required about 10 minutes of training. The gap in training time tends to close a bit when we consider tasks with more variability: guided policy search generally requires a linear increase in the number of samples with more initial states, whereas model-free algorithms can better integrate experience from multiple initial states and goals, typically with sub-linear increase in sample requirements. As model-free methods generally do not require a lower-dimensional state for model-based trajectory optimization, they can also be applied to tasks that can only be defined on images, without an explicit representation learning phase.

Although there is a long history of model-free RL in robotics (Daniel et al., 2013; Ijspeert et al., 2002; Kober et al., 2013; Konidaris et al., 2012; Manschitz et al., 2014; Peters et al., 2010; Peters and Schaal, 2008), modern model-free deep RL algorithms have been used more recently for tasks such as door opening (Gu et al., 2017) and assembly and stacking of objects (Haarnoja et al., 2018a) with low-dimensional state observations. These methods were generally based on off-policy actor–critic designs, such as DDPG or NAF (Gu et al., 2016; Lillicrap et al., 2015), soft Q-learning (Haarnoja et al., 2018a,b), and soft actor–critic (SAC; Haarnoja et al., 2019). An illustration of some of these tasks is shown in Figure 2. From our experiences, we generally found that simple manipulation tasks, such as opening doors and stacking Lego blocks, either with a single position or some variation in position, can be learned in 2–4 hours of interaction, with either torque control or end-effector position control. Incorporating demonstration data and other sources of supervision can further accelerate some of these methods (Riedmiller et al., 2018; Večcerík et al., 2017). Section 4.2 describes other techniques to make those approaches more sample efficient.

Although most model-free deep RL algorithms that have been applied to learn manipulation skills directly from real-world data have used off-policy algorithms based on Q-learning (Gu et al., 2017; Haarnoja et al., 2018a) or actor–critic designs (Haarnoja et al., 2018b), on-policy policy gradient algorithms have also been used. Although standard configurations of these methods can require around 10 times the number of samples as off-policy algorithms, on-policy methods such as TRPO (Schulman et al., 2015), NPG (Kakade, 2002), and PPO (Schulman et al., 2017) can be tuned to only be two or three times less efficient than off-policy algorithms in some tasks (Peng et al., 2019). In some cases, this increased sample requirement may be justified by ease of use, better stability, and better robustness to suboptimal hyperparameter settings. On-policy policy gradient algorithms have been used to learn tasks such as peg insertion (Lee et al., 2019), targeted throwing Ghadirzadeh et al. (2017), and dexterous manipulation (Zhu et al., 2019) directly on real-world hardware, and can be further accelerated with example demonstrations (Zhu et al., 2019).

Although, in principle, model-free deep RL algorithms should excel at learning directly from raw image observations, in practice this is a particularly difficult training regime, and good real-world results with model-free deep RL learning directly from raw image observations have only been obtained recently, with accompanying improvements in the efficiency and stability of off-policy model-free RL methods (Fujimoto et al., 2018; Haarnoja et al., 2019, 2018b). The SAC algorithm can learn tasks in the real world directly from images (Haarnoja et al., 2019; Singh et al., 2019), and several other recent works have studied real-world learning from images (Schoettler et al., 2019; Schwab et al., 2019).

All of these experiments were conducted in relatively constrained laboratory environments, and although the learned skills use raw image observations, they generally have limited robustness to realistic visual perturbations and can only handle the specific objects on which they are trained. We discuss in Section 3.2 how image-based deep RL can be scaled up to enable meaningful generalization. Furthermore, a major challenge in learning from raw image observations in the real world is the problem of reward specification: if the robot needs to learn from raw image observations, it also needs to evaluate the reward function from raw image observations, which itself can require a hand-designed perception system, partly defeating the purpose of learning from images in the first place, or otherwise require extensive instrumentation of the environment (Zhu et al., 2019). We discuss this challenge further in Section 4.9.

3.1.3. Learning predictive models for multiple skills with visual foresight

Although there are situations where a single skill is all a robot will need to perform, it is not sufficient for general-purpose robots where learning each skill from scratch is impractical. In such cases, there is a great deal of knowledge that can be shared across tasks to speed up learning. In this section, we discuss one particular case study of scalable multi-task learning of vision-based manipulation skills, with a focus on tasks that require pushing or picking and placing objects. Unlike in the previous section, if our goal is to learn many tasks with many objects, a challenge discussed in detail in Section 4.5, it will be most practical to learn from data that can be collected at scale, without human supervision or even a human attending the robot. As a result, it becomes imperative to remove assumptions such as regular resets of the environment or a carefully instrumented environment for measuring reward.

Motivated by these challenges, the visual foresight approach (Ebert et al., 2018; Finn and Levine, 2017) leverages large batches of off-policy, autonomously collected experience to train an action-conditioned video prediction model, and then uses this model to plan to accomplish tasks. The key intuition of this approach is that knowledge learned about physics and dynamics can be shared across tasks and largely decoupled from goal-centric knowledge. These models are trained using streams of robot experience, consisting of the observed camera images and actions taken, without assumptions about reward information. After training, a human provides a goal, by providing an image of the goal or by indicating that an object corresponding to a specified pixel should be moved to a desired position. Then, the robot performs an optimization over action sequences in an effort to minimize the distance between the predicted future and the desired goal.

This algorithm has been used to complete object rearrangement tasks such as grasping an apple and putting it on a plate, reorienting a stapler, and pushing other objects into configurations (Ebert et al., 2018; Finn and Levine, 2017). Further, it has been used for visual reaching tasks (Byravan et al., 2018), object pushing and trajectory following tasks (Yen-Chen et al., 2020), for satisfying relative object positioning tasks (Xie et al., 2018), and for cloth manipulation tasks such as folding shorts, covering an object with a towel, and rearranging a sleeve of a shirt (Ebert et al., 2018). Importantly, each collection of tasks can be performed using a single learned model and planning approach, rather than having to retrain a policy for each individual task or object. This generalization precisely results from the algorithms ability to leverage broad, autonomously collected datasets with hundreds of objects, and the ability to train reusable, task-agnostic models from this data.

Despite these successes, there are a number of limitations and challenges that we highlight here. First, although the data collection process does not require human involvement, it uses a specialized set-up with the robot in front of a bin with tilted edges that ensure that objects not fall out, along with an action space that is constrained within the bin. This allows continuous, unattended data collection, discussed further in Section 4.7. Outside of laboratory settings, however, collecting data in unconstrained, open-world environments introduces a number of important challenges, which we discuss in Section 4.12. Second, inaccuracies in the model and reward function can be exploited by the planner, leading to inconsistencies in performance. We discuss these challenges in Sections 4.6 and 4.9. Finally, finding plans for complex tasks pose a challenging optimization problem for the planner, which can be addressed to some degree using demonstrations (for details, see Section 4.4). This has enabled the models to be used for tool use tasks such as sweeping trash into a dustpan, wiping objects off a plate with a sponge, and hooking out-of-reach objects with a hook (Xie et al., 2019).

3.2. Learning to grasp with deep RL

Learning to grasp remains one of the most significant open problems in robotics, requiring complex interaction with previously unseen objects, closed-loop vision-based control to react to unforeseen dynamics or situations . Indeed, most object interaction behaviors require grasping the object as the first step. Prior work typically tackles grasping as the problem of identifying suitable grasp locations (Mahler et al., 2018; Morrison et al., 2018b; ten Pas et al., 2017; Zeng et al., 2018), rather than as an explicit control problem. The motivation for this problem definition is to allow the visual problem to be completely separated from the control problem, which becomes an open-loop control problem. This separation significantly simplifies the problem. The drawback is that this approach cannot adapt to dynamic environments or refine its strategy while executing the grasp. Can deep RL provide us with a mechanism to learn to grasp directly from experience, and as a dynamical and interactive process?

A number of works have studied closed-loop grasping (Hausman et al., 2017; Levine et al., 2018; Viereck et al., 2017; Yu and Rodriguez, 2018). In contrast to these methods, which frame closed-loop grasping as a servoing problem, QT-Opt Kalashnikov et al. (2018) uses a general-purpose RL algorithm to solve the grasping task, which enables multi-step reasoning, in other words, the policy can be optimized across the entire trajectory. In practice, this enables this method to autonomously acquire complex grasping strategies, some of which we illustrate in Figure 4. This method is also entirely self-supervised, using only grasp outcome labels that are obtained automatically by the robot. Several works have proposed self-supervised grasping systems (Levine et al., 2018; Pinto and Gupta, 2016), but to the best of the authors’ knowledge, this method is the first to incorporate a multi-step optimization via RL into a generalizable vision-based system trained on self-supervised real-world data.

Related to this work, Zeng et al. (2018) recently proposed a Q-learning framework for combining grasping and pushing. QT-Opt utilizes a much more flexible action space, directly commanding gripper motion in all degrees of freedom in three dimensions, and exhibits substantially better performance and generalization. Finally, in contrast to many current grasping systems that utilize depth sensing (Mahler et al., 2018; Morrison et al., 2018a) or wrist-mounted cameras (Morrison et al., 2018a; Viereck et al., 2017), QT-Opt operates on raw monocular RGB observations from an over-the-shoulder camera that doesn’t need to be calibrated. The performance of QT-Opt indicates that effective learning can achieve excellent grasp success rates even with this rudimentary sensing set-up.

In this work, we focus on evaluating the success rate of the policy in grasping never seen during training objects in a bin using a top-down grasping (four degrees of freedom). This task definition simplifies some robot safety challenges, which are discussed more in Section 4.11. However, this problem retains the challenging aspects that have been hard to deal with: unknown object dynamics, geometry, vision-based closed-loop control, self-supervised approach as well as hand–eye coordination by removing the need to calibrate the entire system (camera and gripper locations as well as workspace bounds are not given to the policy).

For this specific task, QT-Opt can reach 86% grasp success when learning completely from data collected from previous experiments which we refer to as offline data, and can quickly reach 96% success with an additional online data of 28,000 grasps collected during a joint fine-tuning training phase. Those results show that RL can be scalable and practical on a real robotic application by either allowing to reuse past collected experiences (offline data), and potentially training purely offline (no additional robot interaction required) or a combination of offline and online approaches (called joint fine-tuning). Leveraging offline data makes deep RL a practical approach for robotics as it allows to scale the training dataset to a large enough size to allow generalization to happen, with a small robotic fleet of seven robots and over a period of a few months, or by leveraging simulation, to generalize with a collection effort of just a few days (James et al., 2019; Rao et al., 2020) (see Section 4.3 for more examples of sim-to-real techniques).

Because the policy is learned by optimizing the reward across the entire trajectory (optimizing for long-term reward using Bellman backup), and is constantly replanning its next move with vision as an input, the policy can learn complex behaviors in a self-supervised manner that would have been hard to program, such as singulation, pregrasp manipulation, dealing with a cluttered scene, learning retrial behaviors as well as handling environment disturbance and dynamic objects (Figure 4). Retrial behaviors can be learned because the policy can quickly react to the visual input, at every step, which may show in one step that the object dropped after the gripper lifted it from the bin, and thus deciding to reattempt a grasp in the new location the object fell to.

Section 4.2 describes some of the design principles we used to obtain good data efficiency. Section 4.5 discusses strategies that allowed us to generalize properly to unseen objects. Section 4.7 describes ways we managed to scale to seven robots with one human operator as well as enable 24 h/7 day operations. Section 4.4 discusses how we side-stepped exploration challenges by leveraging scripted policies.

The lessons from this work have been that: (1) a lot of varied data was required to learn generalizable grasping, which means that we need unattended data collection and a scalable RL pipeline; (2) the need for large and varied data means that we need to leverage all of the previously collected data so far (offline data) and need a framework that makes this easy is crucial; (3) to achieve maximal performance, combining offline data with a small amount of online data allows us to go from 86% to 96% grasp success. Robot Reinforcement | College of Engineering

Robot Reinforcement | College of Engineering

3.3. Learning legged locomotion

Although walking and running seems effortless activities for us, designing locomotion controllers for legged robots is a long-standing challenge (Raibert, 1986). RL holds the promise to automatically design high-performance locomotion controllers (Ha et al., 2018; Hwangbo et al., 2019; Kohl and Stone, 2004; Lee et al., 2020; Tedrake et al., 2015). In this case study, we apply deep RL techniques on the Minitaur robot (Figure 5), a mechanically simple and low-cost quadruped platform (De, 2017). We have overcome significant challenges and developed various learning-based approaches, with which agile and stable locomotion gaits emerge automatically.

Simulation is an important prototyping tool for robotics, which can help to bypass many challenges of learning on real systems, such as data efficiency and safety. In fact, most of the prior work used simulation (Brockman et al., 2016; Coumans and Bai, 2016) to evaluate and benchmark the learning algorithms (Hämäläinen et al., 2015; Heess et al., 2017; Peng et al., 2018a; Yu et al., 2018). Using general-purpose RL algorithms and a simple reward for walking fast and efficiently, we can train the quadruped robot to walk in simulation within 2–3 hours. However, a policy learned in simulation usually does not work well on the real robot. This performance gap is known as the reality gap. Our research has identified the key causes of this gap and developed various solutions. Please refer to Section 4.3 for more details. With these sim-to-real transfer techniques, we can successfully deploy the controllers learned in simulation on the robots with zero or only a handful of real-world experiments (Tan et al., 2018; Yu et al., 2019). Without much prior knowledge and manual tuning, the learning algorithm automatically finds policies that are more agile and energy efficient than the controllers developed with the traditional approaches.

Given the initial policies learned in simulation, it is important that the robots can continue their learning process in the real-world in a life-long fashion to adapt their policies to the changing dynamics and operation conditions. There are three main challenges for real-world learning of locomotion skills. The first is sample efficiency. Deep RL often needs tens of millions of data samples to learn meaningful locomotion gaits, which can take months of data collection on the robot. This is further exacerbated by the need of extensive hyperparameter tuning. We have developed novel solutions that have significantly reduced the sample complexity (Section 4.2) and the need for hyperparameter tuning (Section 4.1).

Robot safety is another bottleneck for real-world training. During the exploration stage of learning, the robot often tries noisy actuation patterns that cause jerky motions and severe wear-and-tear of the motors. In addition, because the robot has yet to master balancing skills, the repeated falling quickly damages the hardware. We discuss in Section 4.11 several techniques that we employ to mitigate the safety concerns for learning locomotion with real robots.

The last challenge is asynchronous control. On a physical robot, sensor measurements, neural network inference, and action execution usually happen simultaneously and asynchronously. The observation that the agent receives may not be the latest owing to computation and communication delays. However, this asynchrony breaks the fundamental assumption of the markovian decision process (MDP). Consequently, the performance of many deep RL algorithms drop dramatically in the presence of asynchronous control. In locomotion tasks, asynchronous control is essential to achieve high control frequency. In other words, to learn to walk, the robot has to think and act at the same time. We discuss our solutions to this challenge in Section 4.8, for both model-free and model-based learning algorithms.

With the progress to overcome these challenges, we have developed an efficient and autonomous on-robot training system (Haarnoja et al., 2019), in which the robot can learn walking and turning, from scratch in the real world, with only 5 min of data (Yang et al., 2020) and little human supervision.