Quality-Diversity based Semi-Autonomous Teleoperation using Reinforcement Learning

Korea University¹

Abstract

Recent successes in robot learning have significantly enhanced autonomous systems across a wide range of tasks. However, they are prone to generate similar or the same solutions, limiting the controllability of the robot to behave according to user intentions. These limited robot behaviors may lead to collisions and potential harm to humans. In this paper, we introduce a semi-autonomous teleoperation where the user can operate a robot by selecting a high-level command, referred to as $\textit{option}$ generated by the learned policy. To generate effective and diverse options, we propose a quality-diversity (QD) based sampling method that simultaneously optimizes both the quality and diversity of options using reinforcement learning (RL). Additionally, we propose a mixture of latent variable models to learn a policy function defined as multiple option distributions. In experiments, we show that the proposed method achieves superior performance in terms of the success rate and diversity of the options in simulation environments. We further demonstrate that our method outperforms manual keyboard control for time duration over cluttered real-world environments.

Framework

Our proposed framework, SMORE, consists of two methods. The first is for the QD-based sampling method, which selects context vector-option pairs from the replay buffer based on their high reward values and diversity scores to construct mini-batches. The second method involves a policy gradient approach utilizing a mixture of latent variables. This is designed to learn multiple modes of policy distribution over high-level options, employing the sampled mini-batches.

SMORE: Selectable Multiple Options via REinforcement learning

Our approach, SMORE, combines a QD-based sampling method with a mixture of latent variable models to learn effective and multiple latent option spaces in a two-step process: (a) sampling context vector-option pairs based on their high reward value and diversity score from the replay buffer; and (b) learning multiple modes of policy distribution over high-level options for the given task.

LA-QDPP

MLPG

Environments

We evaluate our method on four different environments. (a) Reacher-Quadrant Task: the agent's fingertip moves to the reachable space in a specific goal quadrant, (b) Place Task: The robot arm is required to place a target object in a safe area without causing any collisions, (c) Sweep Task: The robot arm needs to push the blocking objects to grasp a target object. (d) Hang Task: The robot arm hangs a target torus on a hook.

Baselines & Metrics

We compare the performance of MLPG to Proximal Policy Optimization (PPO)^[1], Soft Actor-Critic (SAC)^[2], and Deep Latent Policy Gradient (DLPG)^[3]. To ensure a fair comparison, we use stochastic distributions for both PPO and SAC.

PPO: PPO is an on-policy RL that ensures stable and efficient policy updates due to its clipped objective function.
SAC: SAC is an off-policy RL that balances exploration and exploitation by adding policy entropy to the reward.
DLPG: DLPG is a latent variable model-based RL that can design the policy to define the distribution over predefined trajectories.

Real World Demonstration

We validate the practicality of our proposed method for operating a real robot in the Place and Sweep tasks through two main experiments. First, we evaluate the time efficiency of our approach by comparing our proposed framework and manual control using a keyboard, where the user can move the end-effector in six different directions (i.e., forward, backward, right, left, up, and down). Furthermore, we assess the task performance of our approach in comparison to the baseline algorithms (i.e., SAC, DIAYN, QSD-RL, and DLPG). This comparative analysis is conducted in real-world scenarios, where we evaluate the robustness of each model in dealing with noisy depth images, which can significantly impact their performance. By examining both the success rates and the diversity of the generated options, we aim to demonstrate the effectiveness and adaptability of our proposed method in various manipulation tasks.