RSL-RL: Isaac Lab Integration¶

RSL-RL is a lightweight RL library specifically designed for Isaac Lab (formerly Isaac Gym), optimized for massive parallelization and GPU-accelerated training.

Overview¶

RSL-RL (Robotic Systems Lab - Reinforcement Learning) is developed by ETH Zurich's Robotic Systems Lab for:

GPU-accelerated RL with thousands of parallel environments
Isaac Lab integration out-of-the-box
Legged robot locomotion (ANYmal, quadrupeds)
Fast iteration for robotics research

Key Features: - ✓ Optimized PPO implementation - ✓ Vectorized environments (10,000+ parallel on single GPU) - ✓ Privileged training (teacher-student) - ✓ Curriculum learning support - ✓ Domain randomization utilities - ✓ WandB integration

Official Repository: https://github.com/leggedrobotics/rsl_rl

Installation¶

With Isaac Lab¶

If you already have Isaac Lab installed:

# Clone RSL-RL
git clone https://github.com/leggedrobotics/rsl_rl.git
cd rsl_rl

# Install in development mode
pip install -e .

# Verify installation
python -c "import rsl_rl; print(rsl_rl.__version__)"

Standalone Installation¶

# Install dependencies
pip install torch numpy

# Install RSL-RL
pip install rsl-rl

# Or from source
git clone https://github.com/leggedrobotics/rsl_rl.git
cd rsl_rl
pip install -e .

With Isaac Lab (Recommended)¶

Isaac Lab includes RSL-RL by default:

# Isaac Lab installation
git clone https://github.com/isaac-sim/IsaacLab.git
cd IsaacLab

# Run install script
./isaaclab.sh --install

# RSL-RL is included automatically

Quick Start¶

Training Example¶

from rsl_rl.runners import OnPolicyRunner
from rsl_rl.env import VecEnv
import torch

# Create vectorized environment (using Isaac Lab)
env = VecEnv(
    task="Isaac-Velocity-Flat-Anymal-D-v0",
    num_envs=4096,  # 4096 parallel environments!
    device="cuda:0"
)

# Configure training
runner = OnPolicyRunner(
    env=env,
    train_cfg={
        "algorithm": {
            "class_name": "PPO",
            "clip_param": 0.2,
            "desired_kl": 0.01,
            "entropy_coef": 0.01,
            "gamma": 0.99,
            "lam": 0.95,
            "learning_rate": 1e-3,
            "max_grad_norm": 1.0,
            "num_learning_epochs": 5,
            "num_mini_batches": 4,
            "schedule": "adaptive",
            "use_clipped_value_loss": True,
            "value_loss_coef": 1.0
        },
        "policy": {
            "class_name": "ActorCritic",
            "activation": "elu",
            "actor_hidden_dims": [512, 256, 128],
            "critic_hidden_dims": [512, 256, 128],
            "init_noise_std": 1.0
        },
        "runner": {
            "algorithm_class_name": "PPO",
            "max_iterations": 1500,
            "num_steps_per_env": 24,
            "save_interval": 50
        }
    }
)

# Train
runner.learn(num_learning_iterations=1500, init_at_random_ep_len=True)

# Save policy
policy_path = "logs/anymal_ppo/model_1500.pt"
runner.save(policy_path)

# Load and evaluate
runner.load(policy_path)
runner.eval_mode()

Core Components¶

1. OnPolicyRunner¶

The main training loop manager.

from rsl_rl.runners import OnPolicyRunner

class OnPolicyRunner:
    """
    Manages PPO training loop with Isaac Lab integration

    Key features:
    - Automatic logging (TensorBoard, WandB)
    - Checkpoint management
    - Evaluation episodes
    - Adaptive learning rate scheduling
    """

    def __init__(self, env, train_cfg, log_dir=None, device='cuda:0'):
        self.env = env
        self.device = device

        # Create PPO algorithm
        self.alg = PPO(
            actor_critic=ActorCritic(...),
            num_learning_epochs=train_cfg['algorithm']['num_learning_epochs'],
            num_mini_batches=train_cfg['algorithm']['num_mini_batches'],
            clip_param=train_cfg['algorithm']['clip_param'],
            gamma=train_cfg['algorithm']['gamma'],
            lam=train_cfg['algorithm']['lam'],
            value_loss_coef=train_cfg['algorithm']['value_loss_coef'],
            entropy_coef=train_cfg['algorithm']['entropy_coef'],
            learning_rate=train_cfg['algorithm']['learning_rate'],
            max_grad_norm=train_cfg['algorithm']['max_grad_norm'],
            device=device
        )

    def learn(self, num_learning_iterations, init_at_random_ep_len=False):
        """Main training loop"""
        obs = self.env.get_observations()

        for it in range(num_learning_iterations):
            # Rollout
            for step in range(self.num_steps_per_env):
                # Get actions from policy
                actions = self.alg.act(obs)

                # Step environment (all parallel envs at once!)
                obs, rewards, dones, infos = self.env.step(actions)

                # Store transition
                self.alg.process_env_step(rewards, dones, infos)

            # Update policy
            self.alg.compute_returns(obs)
            mean_value_loss, mean_surrogate_loss = self.alg.update()

            # Logging
            self.log({'value_loss': mean_value_loss, 'policy_loss': mean_surrogate_loss})

2. PPO Algorithm¶

Optimized PPO implementation for GPU.

from rsl_rl.algorithms import PPO

class PPO:
    """
    Proximal Policy Optimization

    Optimizations:
    - Vectorized advantage computation
    - GPU-accelerated batch processing
    - Adaptive KL penalty
    """

    def __init__(
        self,
        actor_critic,
        num_learning_epochs=5,
        num_mini_batches=4,
        clip_param=0.2,
        gamma=0.99,
        lam=0.95,
        value_loss_coef=1.0,
        entropy_coef=0.01,
        learning_rate=1e-3,
        max_grad_norm=1.0,
        use_clipped_value_loss=True,
        schedule="adaptive",
        desired_kl=0.01,
        device='cuda:0'
    ):
        self.actor_critic = actor_critic
        self.num_learning_epochs = num_learning_epochs
        self.num_mini_batches = num_mini_batches
        self.clip_param = clip_param
        self.gamma = gamma
        self.lam = lam
        self.value_loss_coef = value_loss_coef
        self.entropy_coef = entropy_coef
        self.max_grad_norm = max_grad_norm
        self.use_clipped_value_loss = use_clipped_value_loss
        self.device = device

        # Optimizer
        self.optimizer = torch.optim.Adam(actor_critic.parameters(), lr=learning_rate)

        # Adaptive schedule
        self.schedule = schedule
        self.desired_kl = desired_kl
        self.learning_rate = learning_rate

    def act(self, obs):
        """Sample actions from policy"""
        return self.actor_critic.act(obs)

    def compute_returns(self, last_values):
        """
        Compute returns and advantages using GAE

        Fully vectorized for GPU acceleration
        """
        advantage = torch.zeros_like(self.rewards)
        last_gae_lam = 0

        for step in reversed(range(self.num_transitions_per_env)):
            if step == self.num_transitions_per_env - 1:
                next_values = last_values
            else:
                next_values = self.values[step + 1]

            next_is_not_terminal = 1.0 - self.dones[step].float()
            delta = (
                self.rewards[step]
                + self.gamma * next_values * next_is_not_terminal
                - self.values[step]
            )
            advantage[step] = last_gae_lam = (
                delta + self.gamma * self.lam * next_is_not_terminal * last_gae_lam
            )

        self.returns = advantage + self.values

        # Normalize advantages
        self.advantages = advantage
        self.advantages = (self.advantages - self.advantages.mean()) / (
            self.advantages.std() + 1e-8
        )

    def update(self):
        """
        Update policy using mini-batch gradient descent

        Returns:
            mean_value_loss, mean_surrogate_loss
        """
        mean_value_loss = 0
        mean_surrogate_loss = 0

        # Flatten batch dimensions
        batch = self.storage.get_batch()

        for epoch in range(self.num_learning_epochs):
            # Mini-batch updates
            for indices in batch.mini_batch_generator(self.num_mini_batches):
                # Evaluate actions
                actions_log_prob, values, entropy = self.actor_critic.evaluate(
                    batch.observations[indices],
                    batch.actions[indices]
                )

                # Surrogate loss
                ratio = torch.exp(actions_log_prob - batch.old_actions_log_prob[indices])
                surrogate = -batch.advantages[indices] * ratio
                surrogate_clipped = -batch.advantages[indices] * torch.clamp(
                    ratio, 1.0 - self.clip_param, 1.0 + self.clip_param
                )
                surrogate_loss = torch.max(surrogate, surrogate_clipped).mean()

                # Value loss
                if self.use_clipped_value_loss:
                    value_pred_clipped = batch.values[indices] + (
                        values - batch.values[indices]
                    ).clamp(-self.clip_param, self.clip_param)
                    value_losses = (values - batch.returns[indices]).pow(2)
                    value_losses_clipped = (value_pred_clipped - batch.returns[indices]).pow(2)
                    value_loss = torch.max(value_losses, value_losses_clipped).mean()
                else:
                    value_loss = (batch.returns[indices] - values).pow(2).mean()

                # Total loss
                loss = (
                    surrogate_loss
                    + self.value_loss_coef * value_loss
                    - self.entropy_coef * entropy.mean()
                )

                # Gradient step
                self.optimizer.zero_grad()
                loss.backward()
                nn.utils.clip_grad_norm_(self.actor_critic.parameters(), self.max_grad_norm)
                self.optimizer.step()

                mean_value_loss += value_loss.item()
                mean_surrogate_loss += surrogate_loss.item()

        num_updates = self.num_learning_epochs * self.num_mini_batches
        mean_value_loss /= num_updates
        mean_surrogate_loss /= num_updates

        # Adaptive learning rate
        if self.schedule == "adaptive":
            with torch.no_grad():
                kl = (batch.old_actions_log_prob - actions_log_prob).mean()
                if kl > self.desired_kl * 2.0:
                    self.learning_rate = max(1e-5, self.learning_rate / 1.5)
                elif kl < self.desired_kl / 2.0 and self.learning_rate < 1e-2:
                    self.learning_rate = min(1e-2, self.learning_rate * 1.5)

                for param_group in self.optimizer.param_groups:
                    param_group['lr'] = self.learning_rate

        return mean_value_loss, mean_surrogate_loss

3. ActorCritic Network¶

from rsl_rl.modules import ActorCritic
import torch.nn as nn

class ActorCritic(nn.Module):
    """
    Actor-Critic network with shared/separate architectures

    Features:
    - ELU activations (better for RL)
    - Orthogonal initialization
    - Separate actor and critic (common in robotics)
    """

    def __init__(
        self,
        num_obs,
        num_actions,
        actor_hidden_dims=[256, 256, 256],
        critic_hidden_dims=[256, 256, 256],
        activation='elu',
        init_noise_std=1.0
    ):
        super().__init__()

        # Activation
        activation = get_activation(activation)

        # Policy (actor)
        actor_layers = []
        actor_layers.append(nn.Linear(num_obs, actor_hidden_dims[0]))
        actor_layers.append(activation)
        for i in range(len(actor_hidden_dims) - 1):
            actor_layers.append(nn.Linear(actor_hidden_dims[i], actor_hidden_dims[i + 1]))
            actor_layers.append(activation)
        actor_layers.append(nn.Linear(actor_hidden_dims[-1], num_actions))
        self.actor = nn.Sequential(*actor_layers)

        # Value function (critic)
        critic_layers = []
        critic_layers.append(nn.Linear(num_obs, critic_hidden_dims[0]))
        critic_layers.append(activation)
        for i in range(len(critic_hidden_dims) - 1):
            critic_layers.append(nn.Linear(critic_hidden_dims[i], critic_hidden_dims[i + 1]))
            critic_layers.append(activation)
        critic_layers.append(nn.Linear(critic_hidden_dims[-1], 1))
        self.critic = nn.Sequential(*critic_layers)

        # Action noise
        self.log_std = nn.Parameter(
            torch.log(torch.ones(num_actions) * init_noise_std)
        )

        # Initialize weights
        self.apply(self._init_weights)

    def _init_weights(self, module):
        """Orthogonal initialization"""
        if isinstance(module, nn.Linear):
            nn.init.orthogonal_(module.weight, gain=1.0)
            nn.init.constant_(module.bias, 0.0)

    def forward(self, observations):
        """Forward pass (not typically used directly)"""
        return self.actor(observations), self.critic(observations)

    def act(self, observations):
        """Sample actions from policy"""
        mean = self.actor(observations)
        std = torch.exp(self.log_std)

        # Sample from Gaussian
        dist = torch.distributions.Normal(mean, std)
        actions = dist.sample()

        return actions.clamp(-1.0, 1.0)

    def evaluate(self, observations, actions):
        """Evaluate actions (for PPO update)"""
        mean = self.actor(observations)
        std = torch.exp(self.log_std)

        dist = torch.distributions.Normal(mean, std)
        actions_log_prob = dist.log_prob(actions).sum(dim=-1)
        entropy = dist.entropy().sum(dim=-1)

        value = self.critic(observations).squeeze(-1)

        return actions_log_prob, value, entropy


def get_activation(act_name):
    """Get activation function"""
    if act_name == "elu":
        return nn.ELU()
    elif act_name == "relu":
        return nn.ReLU()
    elif act_name == "tanh":
        return nn.Tanh()
    elif act_name == "selu":
        return nn.SELU()
    else:
        raise ValueError(f"Unknown activation: {act_name}")

Advanced Features¶

Privileged Training (Teacher-Student)¶

Train with privileged information (ground truth) then distill to student policy.

# Train teacher with privileged info
teacher_runner = OnPolicyRunner(
    env=env,
    train_cfg={
        "policy": {
            "class_name": "ActorCriticPrivileged",
            "num_obs": 48,  # Normal observations
            "num_privileged_obs": 187,  # + privileged info (ground truth)
            "actor_hidden_dims": [512, 256, 128],
            "critic_hidden_dims": [512, 256, 128]
        }
    }
)

teacher_runner.learn(num_learning_iterations=1500)

# Distill to student (deployment policy)
student_runner = OnPolicyRunner(
    env=env,
    train_cfg={
        "policy": {
            "class_name": "ActorCritic",
            "num_obs": 48,  # Only normal observations
            "actor_hidden_dims": [512, 256, 128]
        }
    }
)

# Train student to mimic teacher
student_runner.learn_from_teacher(
    teacher_policy=teacher_runner.alg.actor_critic,
    num_learning_iterations=500
)

When to use: - ✓ Sim-to-real transfer (privileged = perfect sim state) - ✓ Partial observability (privileged = full state) - ✓ Terrain adaptation (privileged = terrain map)

Curriculum Learning¶

Progressively increase task difficulty.

from rsl_rl.env import VecEnv

class CurriculumEnv(VecEnv):
    """Environment with curriculum learning"""

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        self.curriculum_level = 0
        self.success_threshold = 0.8

    def update_curriculum(self, success_rate):
        """Update curriculum based on performance"""
        if success_rate > self.success_threshold:
            self.curriculum_level += 1
            print(f"Curriculum advanced to level {self.curriculum_level}")

            # Increase difficulty
            if self.curriculum_level == 1:
                # Add obstacles
                self.spawn_obstacles(density=0.1)
            elif self.curriculum_level == 2:
                # Increase obstacle density
                self.spawn_obstacles(density=0.3)
            elif self.curriculum_level == 3:
                # Add terrain roughness
                self.enable_rough_terrain(roughness=0.1)
            # etc.

# Use in training loop
runner = OnPolicyRunner(env=curriculum_env, train_cfg=cfg)

for iteration in range(max_iterations):
    runner.learn(num_learning_iterations=100)

    # Evaluate
    success_rate = runner.evaluate(num_episodes=100)

    # Update curriculum
    curriculum_env.update_curriculum(success_rate)

Domain Randomization¶

RSL-RL provides utilities for domain randomization:

from rsl_rl.utils import RandomizationCfg

# Configure randomization
randomization_cfg = RandomizationCfg(
    # Physics randomization
    friction_range=[0.5, 1.25],
    restitution_range=[0.0, 0.5],
    mass_range=[0.9, 1.1],  # ±10%

    # Motor randomization
    motor_strength_range=[0.9, 1.1],
    motor_offset_range=[-0.02, 0.02],

    # Sensor randomization
    imu_noise_std=0.01,
    joint_pos_noise_std=0.01,

    # Visual randomization (if using vision)
    light_randomization=True,
    texture_randomization=True,

    # External forces
    push_force_range=[0, 50],  # Random pushes
    push_interval_s=5.0
)

# Apply to environment
env.set_randomization(randomization_cfg)

Isaac Lab Integration¶

Complete Training Script¶

"""
train_anymal_velocity.py

Train ANYmal quadruped for velocity tracking
"""

import isaaclab  # Import first to initialize Isaac Sim
from isaaclab.envs import ManagerBasedRLEnv
from rsl_rl.runners import OnPolicyRunner
import torch

def main():
    # Environment configuration
    env_cfg = {
        "scene": {
            "robot": "anymal_d",
            "terrain": "flat",
            "num_envs": 4096
        },
        "observations": {
            "policy": [
                "base_lin_vel",
                "base_ang_vel",
                "projected_gravity",
                "commands",
                "joint_pos",
                "joint_vel",
                "actions"
            ]
        },
        "actions": {
            "joint_positions": {
                "class": "JointPositionAction",
                "asset_name": "robot",
                "joint_names": [".*"],
                "scale": 0.5,
                "offset": 0.0
            }
        },
        "rewards": {
            "track_lin_vel_xy": {
                "weight": 1.0,
                "params": {"std": 0.5}
            },
            "track_ang_vel_z": {
                "weight": 0.5,
                "params": {"std": 0.5}
            },
            "lin_vel_z": {
                "weight": -2.0
            },
            "ang_vel_xy": {
                "weight": -0.05
            },
            "orientation": {
                "weight": -1.0
            },
            "torques": {
                "weight": -0.0001
            },
            "joint_acc": {
                "weight": -2.5e-7
            },
            "action_rate": {
                "weight": -0.01
            },
            "feet_air_time": {
                "weight": 1.0,
                "params": {"threshold": 0.5}
            }
        },
        "terminations": {
            "base_contact": True,
            "timeout": {
                "time": 20.0
            }
        }
    }

    # Create environment
    env = ManagerBasedRLEnv(cfg=env_cfg)

    # Training configuration
    train_cfg = {
        "algorithm": {
            "class_name": "PPO",
            "value_loss_coef": 1.0,
            "use_clipped_value_loss": True,
            "clip_param": 0.2,
            "entropy_coef": 0.01,
            "num_learning_epochs": 5,
            "num_mini_batches": 4,
            "learning_rate": 1e-3,
            "schedule": "adaptive",
            "gamma": 0.99,
            "lam": 0.95,
            "desired_kl": 0.01,
            "max_grad_norm": 1.0
        },
        "policy": {
            "class_name": "ActorCritic",
            "init_noise_std": 1.0,
            "actor_hidden_dims": [512, 256, 128],
            "critic_hidden_dims": [512, 256, 128],
            "activation": "elu"
        },
        "runner": {
            "algorithm_class_name": "PPO",
            "max_iterations": 1500,
            "num_steps_per_env": 24,
            "save_interval": 50,
            "experiment_name": "anymal_velocity",
            "run_name": "",
            "logger": "wandb",
            "neptune_project": "isaaclab",
            "wandb_project": "isaaclab",
            "resume": False,
            "load_run": -1,
            "checkpoint": -1
        }
    }

    # Create runner
    runner = OnPolicyRunner(env=env, train_cfg=train_cfg, log_dir="logs", device="cuda:0")

    # Train
    runner.learn(
        num_learning_iterations=train_cfg["runner"]["max_iterations"],
        init_at_random_ep_len=True
    )

    # Close environment
    env.close()


if __name__ == "__main__":
    main()

Running Training¶

# Train
python train_anymal_velocity.py

# Train with specific device
CUDA_VISIBLE_DEVICES=0 python train_anymal_velocity.py

# Resume from checkpoint
python train_anymal_velocity.py --resume --load_run 50

# Evaluate trained policy
python train_anymal_velocity.py --play --checkpoint logs/anymal_velocity/model_1500.pt

Tips & Best Practices¶

Hyperparameter Tuning¶

For legged locomotion:

locomotion_cfg = {
    "num_steps_per_env": 24,  # Short rollouts for fast feedback
    "gamma": 0.99,
    "lam": 0.95,
    "learning_rate": 1e-3,  # Can go higher with large batch
    "clip_param": 0.2,
    "num_learning_epochs": 5,
    "num_mini_batches": 4,
    "entropy_coef": 0.01  # Encourage exploration
}

For manipulation:

manipulation_cfg = {
    "num_steps_per_env": 32,  # Longer rollouts
    "gamma": 0.995,  # Higher discount for precision
    "lam": 0.97,
    "learning_rate": 3e-4,  # Lower LR for stability
    "clip_param": 0.15,  # Smaller clips
    "num_learning_epochs": 8,
    "num_mini_batches": 8,
    "entropy_coef": 0.001  # Less exploration
}

Debugging Tips¶

# 1. Check observation/action spaces
print(f"Obs space: {env.observation_space}")
print(f"Action space: {env.action_space}")

# 2. Visualize first environment
env.render()

# 3. Check reward distribution
rewards = []
for step in range(100):
    actions = env.action_space.sample()
    obs, rew, done, info = env.step(actions)
    rewards.append(rew[0].item())

import matplotlib.pyplot as plt
plt.hist(rewards, bins=50)
plt.xlabel("Reward")
plt.show()

# 4. Monitor GPU usage
nvidia-smi -l 1

# 5. Profile training
import torch.profiler
with torch.profiler.profile() as prof:
    runner.learn(num_learning_iterations=10)
print(prof.key_averages().table())

Common Issues¶

Problem: OOM (Out of Memory)

# Solution: Reduce num_envs or batch size
env_cfg["scene"]["num_envs"] = 2048  # Instead of 4096
train_cfg["algorithm"]["num_mini_batches"] = 8  # More batches = smaller batches

Problem: Training unstable

# Solution: Reduce learning rate, add gradient clipping
train_cfg["algorithm"]["learning_rate"] = 5e-4
train_cfg["algorithm"]["max_grad_norm"] = 0.5

Problem: Policy not improving

# Solution: Check reward scale, increase entropy
# Scale rewards to be roughly [-1, 1]
reward_scale = 0.1
env_cfg["rewards"]["track_lin_vel_xy"]["weight"] *= reward_scale

# Increase exploration
train_cfg["algorithm"]["entropy_coef"] = 0.05

References¶

Official Resources¶

RSL-RL GitHub: https://github.com/leggedrobotics/rsl_rl
Isaac Lab: https://isaac-sim.github.io/IsaacLab/
ETH RSL: https://rsl.ethz.ch/

Papers¶

Learning to Walk: Lee et al., "Learning Quadrupedal Locomotion over Challenging Terrain", Science Robotics 2020
Privileged Learning: Kumar et al., "RMA: Rapid Motor Adaptation for Legged Robots", RSS 2021
Isaac Gym: Makoviychuk et al., "Isaac Gym: High Performance GPU-Based Physics Simulation", NeurIPS 2021

Tutorials¶

Isaac Lab Tutorials: https://isaac-sim.github.io/IsaacLab/source/tutorials/index.html
Legged Gym: https://github.com/leggedrobotics/legged_gym (Predecessor to RSL-RL)

Next Steps¶

RL Games - Alternative fast RL library
Stable-Baselines3 - More general-purpose RL
Isaac Lab Advanced - Deep dive into Isaac Lab RL