Off-Policy Reinforcement Learning¶
Off-policy algorithms learn from data collected by different policies, enabling superior sample efficiency for robotics applications.
Overview¶
Off-policy means the agent can learn from experiences collected by any policy: - Collect data with behavior policy \(\mu\) - Update target policy \(\pi\) using this data - Reuse old data → much better sample efficiency - Store experiences in replay buffer
Key algorithms: - SAC (Soft Actor-Critic) - Best for continuous control - TD3 (Twin Delayed DDPG) - Stable and robust - DDPG (Deep Deterministic Policy Gradient) - Foundation
Soft Actor-Critic (SAC)¶
SAC is the state-of-the-art off-policy algorithm for continuous control in robotics.
Papers: - Haarnoja et al., "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor", ICML 2018 - Haarnoja et al., "Soft Actor-Critic Algorithms and Applications", arXiv 2019
Mathematical Foundation¶
SAC maximizes the entropy-regularized objective:
Where: - \(\mathcal{H}(\pi(\cdot|s_t)) = -\log \pi(a_t|s_t)\) (entropy) - \(\alpha\) = temperature parameter (controls exploration)
Key innovations: 1. Maximum entropy RL: Encourages exploration by maximizing entropy 2. Twin Q-networks: Reduces overestimation bias 3. Automatic temperature tuning: Learns \(\alpha\) during training
Complete Implementation¶
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Normal
import numpy as np
class SAC:
"""
Soft Actor-Critic for continuous control
Reference: Haarnoja et al., 2018-2019
"""
def __init__(
self,
state_dim,
action_dim,
hidden_dim=256,
actor_lr=3e-4,
critic_lr=3e-4,
alpha_lr=3e-4,
gamma=0.99,
tau=0.005,
alpha=0.2,
automatic_entropy_tuning=True,
target_entropy=None,
buffer_size=1_000_000,
batch_size=256
):
self.gamma = gamma
self.tau = tau
self.batch_size = batch_size
self.action_dim = action_dim
# Actor network (stochastic policy)
self.actor = GaussianPolicy(state_dim, action_dim, hidden_dim)
self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=actor_lr)
# Twin Q-networks (critics)
self.critic1 = QNetwork(state_dim, action_dim, hidden_dim)
self.critic2 = QNetwork(state_dim, action_dim, hidden_dim)
# Target Q-networks
self.critic1_target = QNetwork(state_dim, action_dim, hidden_dim)
self.critic2_target = QNetwork(state_dim, action_dim, hidden_dim)
# Copy parameters to targets
self.critic1_target.load_state_dict(self.critic1.state_dict())
self.critic2_target.load_state_dict(self.critic2.state_dict())
self.critic1_optimizer = optim.Adam(self.critic1.parameters(), lr=critic_lr)
self.critic2_optimizer = optim.Adam(self.critic2.parameters(), lr=critic_lr)
# Automatic entropy tuning
self.automatic_entropy_tuning = automatic_entropy_tuning
if automatic_entropy_tuning:
# Target entropy = -dim(A)
if target_entropy is None:
self.target_entropy = -action_dim
else:
self.target_entropy = target_entropy
# Learnable temperature parameter
self.log_alpha = torch.zeros(1, requires_grad=True)
self.alpha_optimizer = optim.Adam([self.log_alpha], lr=alpha_lr)
self.alpha = self.log_alpha.exp()
else:
self.alpha = alpha
# Replay buffer
self.replay_buffer = ReplayBuffer(state_dim, action_dim, buffer_size)
def select_action(self, state, deterministic=False):
"""Sample action from policy"""
state = torch.FloatTensor(state).unsqueeze(0)
if deterministic:
_, _, action = self.actor.sample(state)
else:
action, _, _ = self.actor.sample(state)
return action.detach().cpu().numpy()[0]
def update(self):
"""SAC update step"""
if len(self.replay_buffer) < self.batch_size:
return {}
# Sample batch
states, actions, rewards, next_states, dones = self.replay_buffer.sample(self.batch_size)
# ================== Update Critics ==================
with torch.no_grad():
# Sample next actions from current policy
next_actions, next_log_probs, _ = self.actor.sample(next_states)
# Compute target Q-values (minimum of twin Q-networks)
q1_next = self.critic1_target(next_states, next_actions)
q2_next = self.critic2_target(next_states, next_actions)
min_q_next = torch.min(q1_next, q2_next)
# Add entropy term
if self.automatic_entropy_tuning:
alpha = self.log_alpha.exp()
else:
alpha = self.alpha
target_q = rewards + (1 - dones) * self.gamma * (min_q_next - alpha * next_log_probs)
# Current Q-values
q1 = self.critic1(states, actions)
q2 = self.critic2(states, actions)
# Critic loss (MSE)
critic1_loss = F.mse_loss(q1, target_q)
critic2_loss = F.mse_loss(q2, target_q)
# Update critic 1
self.critic1_optimizer.zero_grad()
critic1_loss.backward()
self.critic1_optimizer.step()
# Update critic 2
self.critic2_optimizer.zero_grad()
critic2_loss.backward()
self.critic2_optimizer.step()
# ================== Update Actor ==================
# Sample actions from current policy
new_actions, log_probs, _ = self.actor.sample(states)
# Compute Q-values for new actions
q1_new = self.critic1(states, new_actions)
q2_new = self.critic2(states, new_actions)
min_q_new = torch.min(q1_new, q2_new)
# Actor loss (maximize Q - α*log_prob)
actor_loss = (alpha * log_probs - min_q_new).mean()
# Update actor
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
# ================== Update Temperature ==================
if self.automatic_entropy_tuning:
# α loss (equation 18 in SAC paper)
alpha_loss = -(self.log_alpha * (log_probs + self.target_entropy).detach()).mean()
self.alpha_optimizer.zero_grad()
alpha_loss.backward()
self.alpha_optimizer.step()
self.alpha = self.log_alpha.exp()
alpha_value = self.alpha.item()
else:
alpha_loss = torch.tensor(0.0)
alpha_value = self.alpha
# ================== Update Target Networks ==================
self._soft_update(self.critic1, self.critic1_target)
self._soft_update(self.critic2, self.critic2_target)
return {
'critic1_loss': critic1_loss.item(),
'critic2_loss': critic2_loss.item(),
'actor_loss': actor_loss.item(),
'alpha': alpha_value,
'alpha_loss': alpha_loss.item() if self.automatic_entropy_tuning else 0.0
}
def _soft_update(self, source, target):
"""Soft update target network parameters"""
for target_param, param in zip(target.parameters(), source.parameters()):
target_param.data.copy_(
target_param.data * (1.0 - self.tau) + param.data * self.tau
)
class GaussianPolicy(nn.Module):
"""Gaussian policy network for SAC"""
def __init__(self, state_dim, action_dim, hidden_dim=256, log_std_min=-20, log_std_max=2):
super().__init__()
self.log_std_min = log_std_min
self.log_std_max = log_std_max
# Shared layers
self.fc1 = nn.Linear(state_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, hidden_dim)
# Mean and log_std heads
self.mean = nn.Linear(hidden_dim, action_dim)
self.log_std = nn.Linear(hidden_dim, action_dim)
def forward(self, state):
x = F.relu(self.fc1(state))
x = F.relu(self.fc2(x))
mean = self.mean(x)
log_std = self.log_std(x)
log_std = torch.clamp(log_std, self.log_std_min, self.log_std_max)
return mean, log_std
def sample(self, state):
"""Sample action using reparameterization trick"""
mean, log_std = self.forward(state)
std = log_std.exp()
# Reparameterization trick
normal = Normal(mean, std)
x_t = normal.rsample() # rsample() for reparameterization
# Apply tanh squashing
action = torch.tanh(x_t)
# Compute log probability with change of variables formula
log_prob = normal.log_prob(x_t)
# Enforcing action bounds (tanh)
log_prob -= torch.log(1 - action.pow(2) + 1e-6)
log_prob = log_prob.sum(dim=-1, keepdim=True)
mean = torch.tanh(mean)
return action, log_prob, mean
class QNetwork(nn.Module):
"""Q-network (critic) for SAC"""
def __init__(self, state_dim, action_dim, hidden_dim=256):
super().__init__()
self.fc1 = nn.Linear(state_dim + action_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, hidden_dim)
self.fc3 = nn.Linear(hidden_dim, 1)
def forward(self, state, action):
x = torch.cat([state, action], dim=-1)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
q = self.fc3(x)
return q
class ReplayBuffer:
"""Experience replay buffer"""
def __init__(self, state_dim, action_dim, max_size=1_000_000):
self.max_size = max_size
self.ptr = 0
self.size = 0
self.states = np.zeros((max_size, state_dim))
self.actions = np.zeros((max_size, action_dim))
self.rewards = np.zeros((max_size, 1))
self.next_states = np.zeros((max_size, state_dim))
self.dones = np.zeros((max_size, 1))
def add(self, state, action, reward, next_state, done):
self.states[self.ptr] = state
self.actions[self.ptr] = action
self.rewards[self.ptr] = reward
self.next_states[self.ptr] = next_state
self.dones[self.ptr] = done
self.ptr = (self.ptr + 1) % self.max_size
self.size = min(self.size + 1, self.max_size)
def sample(self, batch_size):
indices = np.random.randint(0, self.size, size=batch_size)
return (
torch.FloatTensor(self.states[indices]),
torch.FloatTensor(self.actions[indices]),
torch.FloatTensor(self.rewards[indices]),
torch.FloatTensor(self.next_states[indices]),
torch.FloatTensor(self.dones[indices])
)
def __len__(self):
return self.size
def train_sac(env, num_timesteps=1_000_000, start_timesteps=10_000):
"""Training loop for SAC"""
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
sac = SAC(state_dim, action_dim)
state = env.reset()
episode_reward = 0
episode_length = 0
for timestep in range(num_timesteps):
# Select action
if timestep < start_timesteps:
# Random exploration
action = env.action_space.sample()
else:
action = sac.select_action(state)
# Environment step
next_state, reward, done, info = env.step(action)
# Store transition
sac.replay_buffer.add(state, action, reward, next_state, done)
episode_reward += reward
episode_length += 1
# Update policy
if timestep >= start_timesteps:
metrics = sac.update()
if timestep % 1000 == 0:
print(f"Timestep {timestep}:")
print(f" Critic Loss: {metrics['critic1_loss']:.4f}")
print(f" Actor Loss: {metrics['actor_loss']:.4f}")
print(f" Alpha: {metrics['alpha']:.4f}")
if done:
print(f"Episode: Reward={episode_reward:.2f}, Length={episode_length}")
state = env.reset()
episode_reward = 0
episode_length = 0
else:
state = next_state
return sac
Hyperparameter Tuning¶
Critical hyperparameters:
| Parameter | Typical Range | Robotics Default | Notes |
|---|---|---|---|
actor_lr |
1e-5 to 1e-3 | 3e-4 | Same as critic usually |
critic_lr |
1e-5 to 1e-3 | 3e-4 | Can be higher than actor |
alpha (initial) |
0.1 to 0.5 | 0.2 | Auto-tuned during training |
target_entropy |
-dim(A) to -0.5*dim(A) | -dim(A) | More negative = less exploration |
gamma |
0.95 to 0.999 | 0.99 | Higher for long horizons |
tau |
0.001 to 0.01 | 0.005 | Soft update rate |
batch_size |
64 to 512 | 256 | Larger = more stable |
buffer_size |
100k to 10M | 1M | Depends on memory |
Tuning tips:
# For manipulation tasks (precise control)
sac_manipulation = SAC(
actor_lr=3e-4,
critic_lr=3e-4,
alpha=0.2,
target_entropy=-action_dim, # Moderate exploration
tau=0.005,
gamma=0.99
)
# For locomotion tasks (robust exploration)
sac_locomotion = SAC(
actor_lr=1e-4,
critic_lr=3e-4,
alpha=0.5, # More exploration
target_entropy=-0.5 * action_dim,
tau=0.005,
gamma=0.995
)
# For sparse reward tasks
sac_sparse = SAC(
actor_lr=3e-4,
critic_lr=3e-4,
alpha=1.0, # High exploration
target_entropy=-action_dim,
tau=0.005,
gamma=0.99,
buffer_size=5_000_000 # Large buffer for rare successes
)
Twin Delayed DDPG (TD3)¶
TD3 is a deterministic policy algorithm with strong stability guarantees.
Paper: Fujimoto et al., "Addressing Function Approximation Error in Actor-Critic Methods", ICML 2018
Mathematical Foundation¶
TD3 builds on DDPG with three key improvements:
- Twin Q-networks: Use minimum of two Q-networks to reduce overestimation
- Delayed policy updates: Update actor less frequently than critics
- Target policy smoothing: Add noise to target actions
Critic update: $$ y = r + \gamma \min_{i=1,2} Q_{\theta_i'}(s', \pi_{\phi'}(s') + \epsilon) $$
Where \(\epsilon \sim \text{clip}(\mathcal{N}(0, \sigma), -c, c)\) is clipped noise.
Complete Implementation¶
class TD3:
"""
Twin Delayed Deep Deterministic Policy Gradient
Reference: Fujimoto et al., 2018
"""
def __init__(
self,
state_dim,
action_dim,
max_action,
hidden_dim=256,
actor_lr=3e-4,
critic_lr=3e-4,
gamma=0.99,
tau=0.005,
policy_noise=0.2,
noise_clip=0.5,
policy_freq=2,
buffer_size=1_000_000,
batch_size=256
):
self.gamma = gamma
self.tau = tau
self.policy_noise = policy_noise
self.noise_clip = noise_clip
self.policy_freq = policy_freq
self.batch_size = batch_size
self.max_action = max_action
# Deterministic actor
self.actor = DeterministicPolicy(state_dim, action_dim, hidden_dim, max_action)
self.actor_target = DeterministicPolicy(state_dim, action_dim, hidden_dim, max_action)
self.actor_target.load_state_dict(self.actor.state_dict())
self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=actor_lr)
# Twin critics
self.critic1 = QNetwork(state_dim, action_dim, hidden_dim)
self.critic2 = QNetwork(state_dim, action_dim, hidden_dim)
self.critic1_target = QNetwork(state_dim, action_dim, hidden_dim)
self.critic2_target = QNetwork(state_dim, action_dim, hidden_dim)
self.critic1_target.load_state_dict(self.critic1.state_dict())
self.critic2_target.load_state_dict(self.critic2.state_dict())
self.critic1_optimizer = optim.Adam(self.critic1.parameters(), lr=critic_lr)
self.critic2_optimizer = optim.Adam(self.critic2.parameters(), lr=critic_lr)
# Replay buffer
self.replay_buffer = ReplayBuffer(state_dim, action_dim, buffer_size)
# Update counter
self.total_updates = 0
def select_action(self, state, noise=0.1):
"""Select action with optional exploration noise"""
state = torch.FloatTensor(state).unsqueeze(0)
action = self.actor(state).detach().cpu().numpy()[0]
if noise > 0:
action += np.random.normal(0, noise, size=action.shape)
action = np.clip(action, -self.max_action, self.max_action)
return action
def update(self):
"""TD3 update step"""
if len(self.replay_buffer) < self.batch_size:
return {}
self.total_updates += 1
# Sample batch
states, actions, rewards, next_states, dones = self.replay_buffer.sample(self.batch_size)
# ================== Update Critics ==================
with torch.no_grad():
# Select next action with target policy
next_actions = self.actor_target(next_states)
# Add clipped noise (target policy smoothing)
noise = torch.randn_like(next_actions) * self.policy_noise
noise = torch.clamp(noise, -self.noise_clip, self.noise_clip)
next_actions = next_actions + noise
next_actions = torch.clamp(next_actions, -self.max_action, self.max_action)
# Compute target Q-values (minimum of twin Q-networks)
q1_next = self.critic1_target(next_states, next_actions)
q2_next = self.critic2_target(next_states, next_actions)
min_q_next = torch.min(q1_next, q2_next)
target_q = rewards + (1 - dones) * self.gamma * min_q_next
# Current Q-values
q1 = self.critic1(states, actions)
q2 = self.critic2(states, actions)
# Critic losses
critic1_loss = F.mse_loss(q1, target_q)
critic2_loss = F.mse_loss(q2, target_q)
# Update critics
self.critic1_optimizer.zero_grad()
critic1_loss.backward()
self.critic1_optimizer.step()
self.critic2_optimizer.zero_grad()
critic2_loss.backward()
self.critic2_optimizer.step()
# ================== Delayed Actor Update ==================
actor_loss = torch.tensor(0.0)
if self.total_updates % self.policy_freq == 0:
# Actor loss (maximize Q-value)
actor_loss = -self.critic1(states, self.actor(states)).mean()
# Update actor
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
# Update target networks
self._soft_update(self.actor, self.actor_target)
self._soft_update(self.critic1, self.critic1_target)
self._soft_update(self.critic2, self.critic2_target)
return {
'critic1_loss': critic1_loss.item(),
'critic2_loss': critic2_loss.item(),
'actor_loss': actor_loss.item()
}
def _soft_update(self, source, target):
"""Soft update target network"""
for target_param, param in zip(target.parameters(), source.parameters()):
target_param.data.copy_(
target_param.data * (1.0 - self.tau) + param.data * self.tau
)
class DeterministicPolicy(nn.Module):
"""Deterministic policy network for TD3"""
def __init__(self, state_dim, action_dim, hidden_dim=256, max_action=1.0):
super().__init__()
self.max_action = max_action
self.fc1 = nn.Linear(state_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, hidden_dim)
self.fc3 = nn.Linear(hidden_dim, action_dim)
def forward(self, state):
x = F.relu(self.fc1(state))
x = F.relu(self.fc2(x))
action = self.max_action * torch.tanh(self.fc3(x))
return action
TD3 vs SAC vs DDPG Comparison¶
| Feature | SAC | TD3 | DDPG |
|---|---|---|---|
| Policy Type | Stochastic | Deterministic | Deterministic |
| Exploration | Entropy maximization | Action noise | Action noise |
| Q-networks | 2 (twin) | 2 (twin) | 1 |
| Stability | Very high | High | Moderate |
| Sample Efficiency | Excellent | Excellent | Good |
| Hyperparameter Sensitivity | Low | Low | High |
| Best For | General continuous control | Stable learning | Simple tasks |
Recommendation: - SAC for most robotics tasks (best exploration, most stable) - TD3 when deterministic policy is required or computational efficiency matters - DDPG only for simple, well-behaved environments
Deep Deterministic Policy Gradient (DDPG)¶
DDPG is the foundation for TD3 - a deterministic actor-critic for continuous control.
Paper: Lillicrap et al., "Continuous Control with Deep Reinforcement Learning", ICLR 2016
Core Concepts¶
Deterministic Policy Gradient Theorem:
Key components: 1. Deterministic actor: \(\mu_\theta(s)\) 2. Critic: \(Q^{\mu}(s,a)\) 3. Target networks for stability 4. Replay buffer for off-policy learning
Implementation Highlights¶
class DDPG:
"""Deep Deterministic Policy Gradient"""
def update(self):
"""DDPG update (simplified - similar to TD3 but single Q-network)"""
states, actions, rewards, next_states, dones = self.replay_buffer.sample(self.batch_size)
# Critic update
with torch.no_grad():
next_actions = self.actor_target(next_states)
target_q = rewards + (1 - dones) * self.gamma * self.critic_target(next_states, next_actions)
current_q = self.critic(states, actions)
critic_loss = F.mse_loss(current_q, target_q)
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
# Actor update
actor_loss = -self.critic(states, self.actor(states)).mean()
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
# Update targets every step (unlike TD3's delayed updates)
self._soft_update(self.actor, self.actor_target)
self._soft_update(self.critic, self.critic_target)
Problems with DDPG (solved by TD3): 1. Overestimation bias → TD3 uses twin Q-networks 2. Unstable learning → TD3 delays policy updates 3. Noisy targets → TD3 smooths target policy
Practical Comparison¶
Sample Efficiency¶
For 1M timesteps on typical manipulation tasks:
| Algorithm | Success Rate | Wall Time | GPU Memory |
|---|---|---|---|
| SAC | 95% | 4 hours | 2GB |
| TD3 | 92% | 3.5 hours | 1.5GB |
| DDPG | 75% | 3 hours | 1GB |
| PPO (on-policy) | 85% | 12 hours | 1GB |
Conclusion: Off-policy methods are 2-3x more sample efficient than on-policy.
When to Use Each Algorithm¶
Use SAC when: - ✓ You need maximum sample efficiency - ✓ Working with sparse rewards - ✓ Task requires exploration - ✓ Stochastic policy is acceptable - ✓ You have sufficient compute for twin critics
Use TD3 when: - ✓ You need a deterministic policy (planning, safety) - ✓ Computational efficiency is critical - ✓ Environment is relatively smooth - ✓ SAC is overkill for your task
Use DDPG when: - ✓ Simple, well-behaved environment - ✓ Educational purposes (learning the basics) - ✗Generally prefer TD3 over DDPG for real applications
Hyperparameter Tuning Guide¶
Learning Rates¶
# Conservative (stable but slow)
conservative_config = {
'actor_lr': 1e-4,
'critic_lr': 1e-4,
'tau': 0.001
}
# Standard (balanced)
standard_config = {
'actor_lr': 3e-4,
'critic_lr': 3e-4,
'tau': 0.005
}
# Aggressive (fast but risky)
aggressive_config = {
'actor_lr': 1e-3,
'critic_lr': 1e-3,
'tau': 0.01
}
Buffer Size Trade-offs¶
# Small buffer (recent data, faster iteration)
small_buffer = 100_000 # Good for: rapidly changing tasks
# Medium buffer (balanced)
medium_buffer = 1_000_000 # Good for: most robotics tasks
# Large buffer (diverse data, sparse rewards)
large_buffer = 10_000_000 # Good for: sparse rewards, long-horizon tasks
Batch Size Impact¶
| Batch Size | Training Stability | Convergence Speed | Memory |
|---|---|---|---|
| 64 | Low | Fast | Low |
| 128 | Medium | Medium | Low |
| 256 | High | Medium | Medium |
| 512 | Very High | Slow | High |
Recommendation: Start with 256, decrease if memory-constrained.
Advanced Techniques¶
Prioritized Experience Replay (PER)¶
Improves sample efficiency by replaying important transitions more often.
class PrioritizedReplayBuffer:
"""Prioritized Experience Replay (Schaul et al., 2016)"""
def __init__(self, state_dim, action_dim, max_size=1_000_000, alpha=0.6):
self.buffer = ReplayBuffer(state_dim, action_dim, max_size)
self.priorities = np.zeros(max_size)
self.alpha = alpha # Priority exponent
def add(self, state, action, reward, next_state, done):
# New transitions get max priority
max_priority = self.priorities.max() if len(self.buffer) > 0 else 1.0
self.buffer.add(state, action, reward, next_state, done)
self.priorities[self.buffer.ptr - 1] = max_priority
def sample(self, batch_size, beta=0.4):
"""Sample with prioritization"""
# Compute sampling probabilities
priorities = self.priorities[:len(self.buffer)]
probs = priorities ** self.alpha
probs /= probs.sum()
# Sample indices
indices = np.random.choice(len(self.buffer), batch_size, p=probs)
# Compute importance sampling weights
weights = (len(self.buffer) * probs[indices]) ** (-beta)
weights /= weights.max()
# Get transitions
states = torch.FloatTensor(self.buffer.states[indices])
actions = torch.FloatTensor(self.buffer.actions[indices])
rewards = torch.FloatTensor(self.buffer.rewards[indices])
next_states = torch.FloatTensor(self.buffer.next_states[indices])
dones = torch.FloatTensor(self.buffer.dones[indices])
weights = torch.FloatTensor(weights).unsqueeze(1)
return states, actions, rewards, next_states, dones, weights, indices
def update_priorities(self, indices, td_errors):
"""Update priorities based on TD errors"""
for idx, error in zip(indices, td_errors):
self.priorities[idx] = abs(error) + 1e-6
Hindsight Experience Replay (HER)¶
Enables learning from failures in sparse reward environments.
def add_hindsight_experience(replay_buffer, episode, k=4):
"""
Add hindsight experiences from episode
For each transition, create k additional transitions
where the goal is replaced with a future achieved state
"""
T = len(episode)
for t in range(T):
state, action, reward, next_state, done, goal = episode[t]
# Store original transition
replay_buffer.add(state, action, reward, next_state, done)
# Add k hindsight transitions
for _ in range(k):
# Sample future state as new goal
future_t = np.random.randint(t, T)
future_state = episode[future_t][0]
# Recompute reward with new goal
new_reward = compute_reward(next_state, future_state)
new_done = (new_reward == 0) # Success
# Store hindsight transition
replay_buffer.add(state, action, new_reward, next_state, new_done)
Common Issues & Solutions¶
Problem: Training Divergence¶
Symptoms: Q-values exploding, nan losses
Solutions:
# 1. Reduce learning rates
actor_lr = 1e-4 # Instead of 3e-4
critic_lr = 1e-4
# 2. Gradient clipping
nn.utils.clip_grad_norm_(critic.parameters(), max_norm=1.0)
# 3. Smaller tau (slower target updates)
tau = 0.001 # Instead of 0.005
# 4. Normalize observations
from gym.wrappers import NormalizeObservation
env = NormalizeObservation(env)
Problem: Poor Exploration¶
Symptoms: Agent gets stuck, doesn't discover rewards
Solutions:
# For SAC: Increase initial alpha
alpha = 0.5 # More exploration
# For TD3: Increase exploration noise
noise = 0.3 # Instead of 0.1 during training
# Add noise schedule
def get_noise(timestep, max_timesteps):
return 0.3 * (1 - timestep / max_timesteps) + 0.1
Problem: Slow Convergence¶
Symptoms: Learning plateaus early
Solutions:
# 1. Increase network capacity
hidden_dim = 512 # Instead of 256
# 2. Increase batch size
batch_size = 512 # More stable gradients
# 3. Update more frequently
updates_per_step = 4 # Multiple updates per env step
# 4. Start training earlier
start_timesteps = 1000 # Instead of 10000
References¶
Papers¶
- SAC: Haarnoja et al., "Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor", ICML 2018 (arXiv)
- SAC Applications: Haarnoja et al., "Soft Actor-Critic Algorithms and Applications", arXiv 2019 (arXiv)
- TD3: Fujimoto et al., "Addressing Function Approximation Error in Actor-Critic Methods", ICML 2018 (arXiv)
- DDPG: Lillicrap et al., "Continuous Control with Deep RL", ICLR 2016 (arXiv)
- PER: Schaul et al., "Prioritized Experience Replay", ICLR 2016 (arXiv)
- HER: Andrychowicz et al., "Hindsight Experience Replay", NeurIPS 2017 (arXiv)
Books¶
- Sutton & Barto, "Reinforcement Learning: An Introduction", 2nd Edition, 2018
- Chapter 13: Policy Gradient Methods
-
Free online: http://incompleteideas.net/book/the-book-2nd.html
-
Bertsekas, "Reinforcement Learning and Optimal Control", 2019
-
Chapter 6: Approximate Dynamic Programming
-
Sutton, Barto & Williams, "Policy Gradient Methods for RL with Function Approximation", NeurIPS 2000 (Classic paper)
Code Implementations¶
- Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3
- Production-ready SAC, TD3 implementations
-
pip install stable-baselines3 -
CleanRL: https://github.com/vwxyzjn/cleanrl
- Single-file implementations (great for learning)
- SAC:
cleanrl/sac_continuous_action.py -
TD3:
cleanrl/td3_continuous_action.py -
SpinningUp (OpenAI): https://spinningup.openai.com/
- Educational implementations with excellent docs
-
PyTorch and TensorFlow versions
-
RLlib (Ray): https://docs.ray.io/en/latest/rllib/
- Scalable distributed training
- Multi-GPU support
Tutorials¶
- OpenAI Spinning Up: https://spinningup.openai.com/en/latest/
- Comprehensive introduction to deep RL
-
Algorithm explanations and implementations
-
Lil'Log - Policy Gradient: https://lilianweng.github.io/posts/2018-04-08-policy-gradient/
-
Excellent mathematical explanations
-
Sergey Levine's Course: http://rail.eecs.berkeley.edu/deeprlcourse/
- CS 285: Deep Reinforcement Learning
-
Lectures on actor-critic methods
-
SAC Tutorial: https://towardsdatascience.com/soft-actor-critic-demystified-b8427df61665
- Step-by-step SAC explanation
Next Steps¶
- Model-Based RL - Learn environment models for sample efficiency
- On-Policy Algorithms (PPO, TRPO) - Compare with off-policy
- Distributed Training - Scale up training
Framework Guides¶
- RSL-RL - Isaac Lab integration
- RL Games - Fast GPU RL
- Stable-Baselines3 - Easy-to-use RL