RL Algorithms for Robotics¶
This page covers popular reinforcement learning algorithms commonly used in robotics.
Algorithm Comparison¶
| Algorithm | Type | Action Space | Sample Efficiency | Stability | Use Case |
|---|---|---|---|---|---|
| PPO | On-policy | Continuous/Discrete | Medium | High | General purpose |
| SAC | Off-policy | Continuous | High | Medium | Complex tasks |
| TD3 | Off-policy | Continuous | High | Medium | Continuous control |
| DQN | Off-policy | Discrete | Medium | Medium | Discrete actions |
| DDPG | Off-policy | Continuous | High | Low | Continuous control |
Policy Gradient Methods¶
Proximal Policy Optimization (PPO)¶
PPO is currently the most popular RL algorithm for robotics due to its stability and ease of use.
Key Idea: Limit policy updates to stay close to previous policy.
Algorithm:
# Collect trajectories with current policy
for epoch in epochs:
trajectories = collect_rollouts(policy, num_steps)
# Compute advantages
advantages = compute_gae(trajectories, value_function)
# Multiple epochs of minibatch updates
for _ in range(K):
for minibatch in get_minibatches(trajectories):
# Compute probability ratio
ratio = policy(actions|states) / old_policy(actions|states)
# Clipped surrogate objective
L_clip = min(
ratio * advantages,
clip(ratio, 1-ε, 1+ε) * advantages
)
# Value loss
L_vf = (value_function(states) - returns)²
# Total loss
loss = -L_clip + c₁*L_vf - c₂*entropy
# Update
optimizer.step(loss)
Hyperparameters:
ppo:
learning_rate: 3e-4
n_steps: 2048
batch_size: 64
n_epochs: 10
gamma: 0.99
gae_lambda: 0.95
clip_range: 0.2
vf_coef: 0.5
ent_coef: 0.01
Implementation:
from stable_baselines3 import PPO
model = PPO(
"MlpPolicy",
env,
learning_rate=3e-4,
n_steps=2048,
batch_size=64,
n_epochs=10,
gamma=0.99,
gae_lambda=0.95,
clip_range=0.2,
verbose=1
)
model.learn(total_timesteps=1_000_000)
Pros: - Very stable - Easy to tune - Works well in most scenarios
Cons: - Sample inefficient (on-policy) - Requires many environment steps
Trust Region Policy Optimization (TRPO)¶
Predecessor to PPO with theoretical guarantees.
Key Idea: Constrain KL divergence between old and new policy.
Pros: - Strong theoretical guarantees - Monotonic improvement
Cons: - Complex implementation - Computationally expensive - PPO often works as well in practice
Q-Learning Methods¶
Soft Actor-Critic (SAC)¶
State-of-the-art off-policy algorithm for continuous control.
Key Idea: Maximum entropy RL - maximize reward while staying as random as possible.
Objective:
Algorithm:
# Q-function update
Q_target = r + γ(min_{i=1,2} Q_{target,i}(s', a') - α log π(a'|s'))
L_Q = (Q(s,a) - Q_target)²
# Policy update
L_π = α log π(a|s) - Q(s,a)
# Temperature update (automatic tuning)
L_α = -α(log π(a|s) + H_target)
Implementation:
from stable_baselines3 import SAC
model = SAC(
"MlpPolicy",
env,
learning_rate=3e-4,
buffer_size=1_000_000,
learning_starts=10000,
batch_size=256,
tau=0.005,
gamma=0.99,
train_freq=1,
gradient_steps=1,
ent_coef='auto', # Automatic temperature tuning
verbose=1
)
model.learn(total_timesteps=1_000_000)
Pros: - Sample efficient (off-policy) - Very stable - Automatic temperature tuning
Cons: - Requires more memory (replay buffer) - Slower wall-clock time per step
Twin Delayed DDPG (TD3)¶
Improved version of DDPG with several stabilization tricks.
Key Improvements over DDPG:
- Twin Q-networks: Use minimum of two Q-functions
- Delayed policy updates: Update policy less frequently
- Target policy smoothing: Add noise to target actions
Algorithm:
# Critic update
a' = clip(π_target(s') + ε, a_low, a_high)
y = r + γ min_{i=1,2} Q_{target,i}(s', a')
L_Q = (Q_1(s,a) - y)² + (Q_2(s,a) - y)²
# Delayed actor update (every d steps)
if step % d == 0:
L_π = -Q_1(s, π(s))
update_target_networks()
Pros: - More stable than DDPG - Good performance on continuous control
Cons: - More hyperparameters to tune - SAC often outperforms it
Deep Q-Networks (DQN)¶
For discrete action spaces.
Key Ideas: - Experience replay - Target network - Huber loss
Variants:
Combines multiple improvements: - Double Q-learning - Prioritized replay - Dueling networks - Multi-step returns - Distributional RL - Noisy networks
Model-Based RL¶
Dreamer¶
Learn world model and use it for planning in latent space.
Components: 1. World model: Predicts next latent states and rewards 2. Actor: Policy in latent space 3. Critic: Value function in latent space
Workflow:
# Learn world model from experience
world_model.train(replay_buffer)
# Imagine trajectories in latent space
imagined_trajectories = world_model.imagine(policy, horizon=15)
# Train policy on imagined data
policy.train(imagined_trajectories)
Pros: - Very sample efficient - Can plan ahead - Learns reusable world models
Cons: - Complex to implement - Model errors can compound
MBPO (Model-Based Policy Optimization)¶
Combines model-based and model-free RL.
Algorithm: 1. Collect data with current policy 2. Train dynamics model 3. Generate synthetic rollouts with model 4. Train policy with both real and synthetic data
Pros: - More sample efficient than model-free - More robust than pure model-based
Robotics-Specific Algorithms¶
Hindsight Experience Replay (HER)¶
For sparse reward tasks.
Key Idea: After failing to reach goal, pretend you wanted to reach where you ended up.
# Original trajectory (failed)
s₀ →a₀ s₁ →a₁ s₂ ... → s_T (didn't reach g)
# Hindsight trajectory (success!)
s₀ →a₀ s₁ →a₁ s₂ ... → s_T (reached s_T!)
# Relabel goal: g' = s_T
Implementation:
from stable_baselines3 import HerReplayBuffer, SAC
model = SAC(
"MultiInputPolicy",
env,
replay_buffer_class=HerReplayBuffer,
replay_buffer_kwargs=dict(
n_sampled_goal=4,
goal_selection_strategy='future'
)
)
Residual RL¶
Learn correction on top of classical controller.
Pros: - Safer (starts from working controller) - Faster learning - Better interpretability
Asymmetric Actor-Critic¶
Use privileged information during training.
# Training: Critic has access to privileged info
Q(s_privileged, a) # s_privileged includes ground truth state
# Deployment: Actor only uses sensors
π(s_sensors)
Use Case: Training in simulation with perfect state, deploying with noisy sensors.
Algorithm Selection Guide¶
Choose PPO if:¶
- General purpose robotics task
- You want stability and ease of use
- Sample efficiency is not critical
- You have parallel simulation
Choose SAC if:¶
- Continuous control task
- You need sample efficiency
- You have a replay buffer
- You want automatic exploration tuning
Choose TD3 if:¶
- Continuous control
- You need stability
- SAC is not performing well
Choose Model-Based (MBPO, Dreamer) if:¶
- Sample efficiency is critical
- Real-world data is expensive
- Task has learnable dynamics
Choose HER if:¶
- Sparse rewards
- Goal-conditioned task
- Reaching/manipulation task
Hyperparameter Tuning¶
General Tips¶
- Learning rate: Start with 3e-4, adjust if unstable
- Discount factor (γ): 0.99 for long horizon, 0.9 for short
- Batch size: Larger is more stable but slower
- Network architecture: [256, 256] is a good default
PPO Specific¶
# Conservative (stable)
clip_range: 0.1
ent_coef: 0.01
vf_coef: 0.5
# Aggressive (faster learning)
clip_range: 0.3
ent_coef: 0.001
vf_coef: 1.0
SAC Specific¶
# Sample efficient
buffer_size: 1_000_000
batch_size: 256
learning_starts: 10000
# Memory constrained
buffer_size: 100_000
batch_size: 128
learning_starts: 1000
Next Steps¶
- Training Guide - Train these algorithms
- Evaluation - Evaluate and compare algorithms
- Simulators - Set up training environments