Efficient Learning of High Level Plans from Play

Real-world robotic manipulation tasks remain an elusive challenge, since they involve both fine-grained environment interaction, as well as the ability to plan for long-horizon goals. Although deep reinforcement learning (RL) methods have shown encouraging results when planning end-to-end in high-dimensional environments, they remain fundamentally limited by poor sample efficiency due to inefficient exploration, and by the complexity of credit assignment over long horizons. In this work, we present Efficient Learning of High-Level Plans from Play (ELF-P), a framework for robotic learning that bridges motion planning and deep RL to achieve long-horizon complex manipulation tasks. We leverage task-agnostic play data to learn a discrete behavioral prior over object-centric primitives, modeling their feasibility given the current context. We then design a high-level goal-conditioned policy which (1) uses primitives as building blocks to scaffold complex long-horizon tasks and (2) leverages the behavioral prior to accelerate learning. We demonstrate that ELF-P has significantly better sample efficiency than relevant baselines over multiple realistic manipulation tasks and learns policies that can be easily transferred to physical hardware.

Skill failure probability	Success rate ELF-P	Success rate DDQN+Prefill
0	1.0 \(\pm\) 0.00	1.0 \(\pm\) 0.00
0.05	0.99 \(\pm\) 0.01	0.96 \(\pm\) 0.01
0.1	0.96 \(\pm\) 0.02	0.92 \(\pm\) 0.00
0.2	0.91 \(\pm\) 0.02	0.84 \(\pm\) 0.03
0.5	0.75 \(\pm\) 0.03	0.54 \(\pm\) 0.04

Parameter	Value
Q-network architecture	MLP \( [128, 256]\)
Batch size	\(256\)
Exploration technique	\(\epsilon\)-greedy with exponential decay
Initial \(\epsilon\)	\(0.5\)
Decay rate for \(\epsilon\)	\(5e^{-5}\)
Discount \(\gamma\)	\(0.97\)
Optimizer	Adam (\(\beta_1=0.9, \beta_2=0.999\)) [15]
Learning rate \(\eta\)	1e-4
Episode length \(T\)	\(100\)
Experience replay size	\(1e^{6}\)
Initial exploration steps	\(2000\)
Steps before training starts	\(1000\)
Steps between parameter updates	\(50\)
Soft target update parameter \(\mu\)	\(0.995\)
Threshold \(\rho\)	\(0.01\)

Parameter	Value
Prior architecture	MLP \([200,200]\)
Prior batch size	\(500\)
Prior training steps	\(1e^{-5}\)
Dataset size	\(10000\)
Prior optimizer	Adam(\(\beta_1=0.9, \beta_2=0.999\) [15]
Learning rate	\(1e^{-3}\)

Efficient Learning of High Level Plans from Play

Abstract

Video

A.1 Environment

A.1.1 Training environment

A.2 Extended Experimental Results

A.2.1 HER

A.2.2 Sample complexity

A.2.3 Robustness to play dataset size

A.2.4 Robustness to imperfect skill execution

A.2.5 Soft prior integration

A.3 Additional experimental details

A.3.1 Evaluation protocol

A.3.2 Play-dataset collection

A.3.3 Hyperparameters and Architectures

A.4 Hardware experiments

A.5 Optimality