RSL-RL: Isaac Lab Integration¶
RSL-RL is a lightweight RL library specifically designed for Isaac Lab (formerly Isaac Gym), optimized for massive parallelization and GPU-accelerated training.
Overview¶
RSL-RL (Robotic Systems Lab - Reinforcement Learning) is developed by ETH Zurich's Robotic Systems Lab for:
- GPU-accelerated RL with thousands of parallel environments
- Isaac Lab integration out-of-the-box
- Legged robot locomotion (ANYmal, quadrupeds)
- Fast iteration for robotics research
Key Features: - ✓ Optimized PPO implementation - ✓ Vectorized environments (10,000+ parallel on single GPU) - ✓ Privileged training (teacher-student) - ✓ Curriculum learning support - ✓ Domain randomization utilities - ✓ WandB integration
Official Repository: https://github.com/leggedrobotics/rsl_rl
Installation¶
With Isaac Lab¶
If you already have Isaac Lab installed:
# Clone RSL-RL
git clone https://github.com/leggedrobotics/rsl_rl.git
cd rsl_rl
# Install in development mode
pip install -e .
# Verify installation
python -c "import rsl_rl; print(rsl_rl.__version__)"
Standalone Installation¶
# Install dependencies
pip install torch numpy
# Install RSL-RL
pip install rsl-rl
# Or from source
git clone https://github.com/leggedrobotics/rsl_rl.git
cd rsl_rl
pip install -e .
With Isaac Lab (Recommended)¶
Isaac Lab includes RSL-RL by default:
# Isaac Lab installation
git clone https://github.com/isaac-sim/IsaacLab.git
cd IsaacLab
# Run install script
./isaaclab.sh --install
# RSL-RL is included automatically
Quick Start¶
Training Example¶
from rsl_rl.runners import OnPolicyRunner
from rsl_rl.env import VecEnv
import torch
# Create vectorized environment (using Isaac Lab)
env = VecEnv(
task="Isaac-Velocity-Flat-Anymal-D-v0",
num_envs=4096, # 4096 parallel environments!
device="cuda:0"
)
# Configure training
runner = OnPolicyRunner(
env=env,
train_cfg={
"algorithm": {
"class_name": "PPO",
"clip_param": 0.2,
"desired_kl": 0.01,
"entropy_coef": 0.01,
"gamma": 0.99,
"lam": 0.95,
"learning_rate": 1e-3,
"max_grad_norm": 1.0,
"num_learning_epochs": 5,
"num_mini_batches": 4,
"schedule": "adaptive",
"use_clipped_value_loss": True,
"value_loss_coef": 1.0
},
"policy": {
"class_name": "ActorCritic",
"activation": "elu",
"actor_hidden_dims": [512, 256, 128],
"critic_hidden_dims": [512, 256, 128],
"init_noise_std": 1.0
},
"runner": {
"algorithm_class_name": "PPO",
"max_iterations": 1500,
"num_steps_per_env": 24,
"save_interval": 50
}
}
)
# Train
runner.learn(num_learning_iterations=1500, init_at_random_ep_len=True)
# Save policy
policy_path = "logs/anymal_ppo/model_1500.pt"
runner.save(policy_path)
# Load and evaluate
runner.load(policy_path)
runner.eval_mode()
Core Components¶
1. OnPolicyRunner¶
The main training loop manager.
from rsl_rl.runners import OnPolicyRunner
class OnPolicyRunner:
"""
Manages PPO training loop with Isaac Lab integration
Key features:
- Automatic logging (TensorBoard, WandB)
- Checkpoint management
- Evaluation episodes
- Adaptive learning rate scheduling
"""
def __init__(self, env, train_cfg, log_dir=None, device='cuda:0'):
self.env = env
self.device = device
# Create PPO algorithm
self.alg = PPO(
actor_critic=ActorCritic(...),
num_learning_epochs=train_cfg['algorithm']['num_learning_epochs'],
num_mini_batches=train_cfg['algorithm']['num_mini_batches'],
clip_param=train_cfg['algorithm']['clip_param'],
gamma=train_cfg['algorithm']['gamma'],
lam=train_cfg['algorithm']['lam'],
value_loss_coef=train_cfg['algorithm']['value_loss_coef'],
entropy_coef=train_cfg['algorithm']['entropy_coef'],
learning_rate=train_cfg['algorithm']['learning_rate'],
max_grad_norm=train_cfg['algorithm']['max_grad_norm'],
device=device
)
def learn(self, num_learning_iterations, init_at_random_ep_len=False):
"""Main training loop"""
obs = self.env.get_observations()
for it in range(num_learning_iterations):
# Rollout
for step in range(self.num_steps_per_env):
# Get actions from policy
actions = self.alg.act(obs)
# Step environment (all parallel envs at once!)
obs, rewards, dones, infos = self.env.step(actions)
# Store transition
self.alg.process_env_step(rewards, dones, infos)
# Update policy
self.alg.compute_returns(obs)
mean_value_loss, mean_surrogate_loss = self.alg.update()
# Logging
self.log({'value_loss': mean_value_loss, 'policy_loss': mean_surrogate_loss})
2. PPO Algorithm¶
Optimized PPO implementation for GPU.
from rsl_rl.algorithms import PPO
class PPO:
"""
Proximal Policy Optimization
Optimizations:
- Vectorized advantage computation
- GPU-accelerated batch processing
- Adaptive KL penalty
"""
def __init__(
self,
actor_critic,
num_learning_epochs=5,
num_mini_batches=4,
clip_param=0.2,
gamma=0.99,
lam=0.95,
value_loss_coef=1.0,
entropy_coef=0.01,
learning_rate=1e-3,
max_grad_norm=1.0,
use_clipped_value_loss=True,
schedule="adaptive",
desired_kl=0.01,
device='cuda:0'
):
self.actor_critic = actor_critic
self.num_learning_epochs = num_learning_epochs
self.num_mini_batches = num_mini_batches
self.clip_param = clip_param
self.gamma = gamma
self.lam = lam
self.value_loss_coef = value_loss_coef
self.entropy_coef = entropy_coef
self.max_grad_norm = max_grad_norm
self.use_clipped_value_loss = use_clipped_value_loss
self.device = device
# Optimizer
self.optimizer = torch.optim.Adam(actor_critic.parameters(), lr=learning_rate)
# Adaptive schedule
self.schedule = schedule
self.desired_kl = desired_kl
self.learning_rate = learning_rate
def act(self, obs):
"""Sample actions from policy"""
return self.actor_critic.act(obs)
def compute_returns(self, last_values):
"""
Compute returns and advantages using GAE
Fully vectorized for GPU acceleration
"""
advantage = torch.zeros_like(self.rewards)
last_gae_lam = 0
for step in reversed(range(self.num_transitions_per_env)):
if step == self.num_transitions_per_env - 1:
next_values = last_values
else:
next_values = self.values[step + 1]
next_is_not_terminal = 1.0 - self.dones[step].float()
delta = (
self.rewards[step]
+ self.gamma * next_values * next_is_not_terminal
- self.values[step]
)
advantage[step] = last_gae_lam = (
delta + self.gamma * self.lam * next_is_not_terminal * last_gae_lam
)
self.returns = advantage + self.values
# Normalize advantages
self.advantages = advantage
self.advantages = (self.advantages - self.advantages.mean()) / (
self.advantages.std() + 1e-8
)
def update(self):
"""
Update policy using mini-batch gradient descent
Returns:
mean_value_loss, mean_surrogate_loss
"""
mean_value_loss = 0
mean_surrogate_loss = 0
# Flatten batch dimensions
batch = self.storage.get_batch()
for epoch in range(self.num_learning_epochs):
# Mini-batch updates
for indices in batch.mini_batch_generator(self.num_mini_batches):
# Evaluate actions
actions_log_prob, values, entropy = self.actor_critic.evaluate(
batch.observations[indices],
batch.actions[indices]
)
# Surrogate loss
ratio = torch.exp(actions_log_prob - batch.old_actions_log_prob[indices])
surrogate = -batch.advantages[indices] * ratio
surrogate_clipped = -batch.advantages[indices] * torch.clamp(
ratio, 1.0 - self.clip_param, 1.0 + self.clip_param
)
surrogate_loss = torch.max(surrogate, surrogate_clipped).mean()
# Value loss
if self.use_clipped_value_loss:
value_pred_clipped = batch.values[indices] + (
values - batch.values[indices]
).clamp(-self.clip_param, self.clip_param)
value_losses = (values - batch.returns[indices]).pow(2)
value_losses_clipped = (value_pred_clipped - batch.returns[indices]).pow(2)
value_loss = torch.max(value_losses, value_losses_clipped).mean()
else:
value_loss = (batch.returns[indices] - values).pow(2).mean()
# Total loss
loss = (
surrogate_loss
+ self.value_loss_coef * value_loss
- self.entropy_coef * entropy.mean()
)
# Gradient step
self.optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(self.actor_critic.parameters(), self.max_grad_norm)
self.optimizer.step()
mean_value_loss += value_loss.item()
mean_surrogate_loss += surrogate_loss.item()
num_updates = self.num_learning_epochs * self.num_mini_batches
mean_value_loss /= num_updates
mean_surrogate_loss /= num_updates
# Adaptive learning rate
if self.schedule == "adaptive":
with torch.no_grad():
kl = (batch.old_actions_log_prob - actions_log_prob).mean()
if kl > self.desired_kl * 2.0:
self.learning_rate = max(1e-5, self.learning_rate / 1.5)
elif kl < self.desired_kl / 2.0 and self.learning_rate < 1e-2:
self.learning_rate = min(1e-2, self.learning_rate * 1.5)
for param_group in self.optimizer.param_groups:
param_group['lr'] = self.learning_rate
return mean_value_loss, mean_surrogate_loss
3. ActorCritic Network¶
from rsl_rl.modules import ActorCritic
import torch.nn as nn
class ActorCritic(nn.Module):
"""
Actor-Critic network with shared/separate architectures
Features:
- ELU activations (better for RL)
- Orthogonal initialization
- Separate actor and critic (common in robotics)
"""
def __init__(
self,
num_obs,
num_actions,
actor_hidden_dims=[256, 256, 256],
critic_hidden_dims=[256, 256, 256],
activation='elu',
init_noise_std=1.0
):
super().__init__()
# Activation
activation = get_activation(activation)
# Policy (actor)
actor_layers = []
actor_layers.append(nn.Linear(num_obs, actor_hidden_dims[0]))
actor_layers.append(activation)
for i in range(len(actor_hidden_dims) - 1):
actor_layers.append(nn.Linear(actor_hidden_dims[i], actor_hidden_dims[i + 1]))
actor_layers.append(activation)
actor_layers.append(nn.Linear(actor_hidden_dims[-1], num_actions))
self.actor = nn.Sequential(*actor_layers)
# Value function (critic)
critic_layers = []
critic_layers.append(nn.Linear(num_obs, critic_hidden_dims[0]))
critic_layers.append(activation)
for i in range(len(critic_hidden_dims) - 1):
critic_layers.append(nn.Linear(critic_hidden_dims[i], critic_hidden_dims[i + 1]))
critic_layers.append(activation)
critic_layers.append(nn.Linear(critic_hidden_dims[-1], 1))
self.critic = nn.Sequential(*critic_layers)
# Action noise
self.log_std = nn.Parameter(
torch.log(torch.ones(num_actions) * init_noise_std)
)
# Initialize weights
self.apply(self._init_weights)
def _init_weights(self, module):
"""Orthogonal initialization"""
if isinstance(module, nn.Linear):
nn.init.orthogonal_(module.weight, gain=1.0)
nn.init.constant_(module.bias, 0.0)
def forward(self, observations):
"""Forward pass (not typically used directly)"""
return self.actor(observations), self.critic(observations)
def act(self, observations):
"""Sample actions from policy"""
mean = self.actor(observations)
std = torch.exp(self.log_std)
# Sample from Gaussian
dist = torch.distributions.Normal(mean, std)
actions = dist.sample()
return actions.clamp(-1.0, 1.0)
def evaluate(self, observations, actions):
"""Evaluate actions (for PPO update)"""
mean = self.actor(observations)
std = torch.exp(self.log_std)
dist = torch.distributions.Normal(mean, std)
actions_log_prob = dist.log_prob(actions).sum(dim=-1)
entropy = dist.entropy().sum(dim=-1)
value = self.critic(observations).squeeze(-1)
return actions_log_prob, value, entropy
def get_activation(act_name):
"""Get activation function"""
if act_name == "elu":
return nn.ELU()
elif act_name == "relu":
return nn.ReLU()
elif act_name == "tanh":
return nn.Tanh()
elif act_name == "selu":
return nn.SELU()
else:
raise ValueError(f"Unknown activation: {act_name}")
Advanced Features¶
Privileged Training (Teacher-Student)¶
Train with privileged information (ground truth) then distill to student policy.
# Train teacher with privileged info
teacher_runner = OnPolicyRunner(
env=env,
train_cfg={
"policy": {
"class_name": "ActorCriticPrivileged",
"num_obs": 48, # Normal observations
"num_privileged_obs": 187, # + privileged info (ground truth)
"actor_hidden_dims": [512, 256, 128],
"critic_hidden_dims": [512, 256, 128]
}
}
)
teacher_runner.learn(num_learning_iterations=1500)
# Distill to student (deployment policy)
student_runner = OnPolicyRunner(
env=env,
train_cfg={
"policy": {
"class_name": "ActorCritic",
"num_obs": 48, # Only normal observations
"actor_hidden_dims": [512, 256, 128]
}
}
)
# Train student to mimic teacher
student_runner.learn_from_teacher(
teacher_policy=teacher_runner.alg.actor_critic,
num_learning_iterations=500
)
When to use: - ✓ Sim-to-real transfer (privileged = perfect sim state) - ✓ Partial observability (privileged = full state) - ✓ Terrain adaptation (privileged = terrain map)
Curriculum Learning¶
Progressively increase task difficulty.
from rsl_rl.env import VecEnv
class CurriculumEnv(VecEnv):
"""Environment with curriculum learning"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.curriculum_level = 0
self.success_threshold = 0.8
def update_curriculum(self, success_rate):
"""Update curriculum based on performance"""
if success_rate > self.success_threshold:
self.curriculum_level += 1
print(f"Curriculum advanced to level {self.curriculum_level}")
# Increase difficulty
if self.curriculum_level == 1:
# Add obstacles
self.spawn_obstacles(density=0.1)
elif self.curriculum_level == 2:
# Increase obstacle density
self.spawn_obstacles(density=0.3)
elif self.curriculum_level == 3:
# Add terrain roughness
self.enable_rough_terrain(roughness=0.1)
# etc.
# Use in training loop
runner = OnPolicyRunner(env=curriculum_env, train_cfg=cfg)
for iteration in range(max_iterations):
runner.learn(num_learning_iterations=100)
# Evaluate
success_rate = runner.evaluate(num_episodes=100)
# Update curriculum
curriculum_env.update_curriculum(success_rate)
Domain Randomization¶
RSL-RL provides utilities for domain randomization:
from rsl_rl.utils import RandomizationCfg
# Configure randomization
randomization_cfg = RandomizationCfg(
# Physics randomization
friction_range=[0.5, 1.25],
restitution_range=[0.0, 0.5],
mass_range=[0.9, 1.1], # ±10%
# Motor randomization
motor_strength_range=[0.9, 1.1],
motor_offset_range=[-0.02, 0.02],
# Sensor randomization
imu_noise_std=0.01,
joint_pos_noise_std=0.01,
# Visual randomization (if using vision)
light_randomization=True,
texture_randomization=True,
# External forces
push_force_range=[0, 50], # Random pushes
push_interval_s=5.0
)
# Apply to environment
env.set_randomization(randomization_cfg)
Isaac Lab Integration¶
Complete Training Script¶
"""
train_anymal_velocity.py
Train ANYmal quadruped for velocity tracking
"""
import isaaclab # Import first to initialize Isaac Sim
from isaaclab.envs import ManagerBasedRLEnv
from rsl_rl.runners import OnPolicyRunner
import torch
def main():
# Environment configuration
env_cfg = {
"scene": {
"robot": "anymal_d",
"terrain": "flat",
"num_envs": 4096
},
"observations": {
"policy": [
"base_lin_vel",
"base_ang_vel",
"projected_gravity",
"commands",
"joint_pos",
"joint_vel",
"actions"
]
},
"actions": {
"joint_positions": {
"class": "JointPositionAction",
"asset_name": "robot",
"joint_names": [".*"],
"scale": 0.5,
"offset": 0.0
}
},
"rewards": {
"track_lin_vel_xy": {
"weight": 1.0,
"params": {"std": 0.5}
},
"track_ang_vel_z": {
"weight": 0.5,
"params": {"std": 0.5}
},
"lin_vel_z": {
"weight": -2.0
},
"ang_vel_xy": {
"weight": -0.05
},
"orientation": {
"weight": -1.0
},
"torques": {
"weight": -0.0001
},
"joint_acc": {
"weight": -2.5e-7
},
"action_rate": {
"weight": -0.01
},
"feet_air_time": {
"weight": 1.0,
"params": {"threshold": 0.5}
}
},
"terminations": {
"base_contact": True,
"timeout": {
"time": 20.0
}
}
}
# Create environment
env = ManagerBasedRLEnv(cfg=env_cfg)
# Training configuration
train_cfg = {
"algorithm": {
"class_name": "PPO",
"value_loss_coef": 1.0,
"use_clipped_value_loss": True,
"clip_param": 0.2,
"entropy_coef": 0.01,
"num_learning_epochs": 5,
"num_mini_batches": 4,
"learning_rate": 1e-3,
"schedule": "adaptive",
"gamma": 0.99,
"lam": 0.95,
"desired_kl": 0.01,
"max_grad_norm": 1.0
},
"policy": {
"class_name": "ActorCritic",
"init_noise_std": 1.0,
"actor_hidden_dims": [512, 256, 128],
"critic_hidden_dims": [512, 256, 128],
"activation": "elu"
},
"runner": {
"algorithm_class_name": "PPO",
"max_iterations": 1500,
"num_steps_per_env": 24,
"save_interval": 50,
"experiment_name": "anymal_velocity",
"run_name": "",
"logger": "wandb",
"neptune_project": "isaaclab",
"wandb_project": "isaaclab",
"resume": False,
"load_run": -1,
"checkpoint": -1
}
}
# Create runner
runner = OnPolicyRunner(env=env, train_cfg=train_cfg, log_dir="logs", device="cuda:0")
# Train
runner.learn(
num_learning_iterations=train_cfg["runner"]["max_iterations"],
init_at_random_ep_len=True
)
# Close environment
env.close()
if __name__ == "__main__":
main()
Running Training¶
# Train
python train_anymal_velocity.py
# Train with specific device
CUDA_VISIBLE_DEVICES=0 python train_anymal_velocity.py
# Resume from checkpoint
python train_anymal_velocity.py --resume --load_run 50
# Evaluate trained policy
python train_anymal_velocity.py --play --checkpoint logs/anymal_velocity/model_1500.pt
Tips & Best Practices¶
Hyperparameter Tuning¶
For legged locomotion:
locomotion_cfg = {
"num_steps_per_env": 24, # Short rollouts for fast feedback
"gamma": 0.99,
"lam": 0.95,
"learning_rate": 1e-3, # Can go higher with large batch
"clip_param": 0.2,
"num_learning_epochs": 5,
"num_mini_batches": 4,
"entropy_coef": 0.01 # Encourage exploration
}
For manipulation:
manipulation_cfg = {
"num_steps_per_env": 32, # Longer rollouts
"gamma": 0.995, # Higher discount for precision
"lam": 0.97,
"learning_rate": 3e-4, # Lower LR for stability
"clip_param": 0.15, # Smaller clips
"num_learning_epochs": 8,
"num_mini_batches": 8,
"entropy_coef": 0.001 # Less exploration
}
Debugging Tips¶
# 1. Check observation/action spaces
print(f"Obs space: {env.observation_space}")
print(f"Action space: {env.action_space}")
# 2. Visualize first environment
env.render()
# 3. Check reward distribution
rewards = []
for step in range(100):
actions = env.action_space.sample()
obs, rew, done, info = env.step(actions)
rewards.append(rew[0].item())
import matplotlib.pyplot as plt
plt.hist(rewards, bins=50)
plt.xlabel("Reward")
plt.show()
# 4. Monitor GPU usage
nvidia-smi -l 1
# 5. Profile training
import torch.profiler
with torch.profiler.profile() as prof:
runner.learn(num_learning_iterations=10)
print(prof.key_averages().table())
Common Issues¶
Problem: OOM (Out of Memory)
# Solution: Reduce num_envs or batch size
env_cfg["scene"]["num_envs"] = 2048 # Instead of 4096
train_cfg["algorithm"]["num_mini_batches"] = 8 # More batches = smaller batches
Problem: Training unstable
# Solution: Reduce learning rate, add gradient clipping
train_cfg["algorithm"]["learning_rate"] = 5e-4
train_cfg["algorithm"]["max_grad_norm"] = 0.5
Problem: Policy not improving
# Solution: Check reward scale, increase entropy
# Scale rewards to be roughly [-1, 1]
reward_scale = 0.1
env_cfg["rewards"]["track_lin_vel_xy"]["weight"] *= reward_scale
# Increase exploration
train_cfg["algorithm"]["entropy_coef"] = 0.05
References¶
Official Resources¶
- RSL-RL GitHub: https://github.com/leggedrobotics/rsl_rl
- Isaac Lab: https://isaac-sim.github.io/IsaacLab/
- ETH RSL: https://rsl.ethz.ch/
Papers¶
- Learning to Walk: Lee et al., "Learning Quadrupedal Locomotion over Challenging Terrain", Science Robotics 2020
- Privileged Learning: Kumar et al., "RMA: Rapid Motor Adaptation for Legged Robots", RSS 2021
- Isaac Gym: Makoviychuk et al., "Isaac Gym: High Performance GPU-Based Physics Simulation", NeurIPS 2021
Tutorials¶
- Isaac Lab Tutorials: https://isaac-sim.github.io/IsaacLab/source/tutorials/index.html
- Legged Gym: https://github.com/leggedrobotics/legged_gym (Predecessor to RSL-RL)
Next Steps¶
- RL Games - Alternative fast RL library
- Stable-Baselines3 - More general-purpose RL
- Isaac Lab Advanced - Deep dive into Isaac Lab RL