Introduction to Imitation Learning¶

What is Imitation Learning?¶

Imitation Learning (IL) is a paradigm where agents learn to perform tasks by observing and mimicking expert behavior.

Motivation¶

The Reward Specification Problem¶

Defining good reward functions for RL is hard:

# Too sparse
reward = 1 if task_complete else 0

# Too dense (might encourage shortcuts)
reward = -distance_to_goal + action_smoothness - collision_penalty + ...

IL sidesteps this by learning directly from demonstrations.

Formal Framework¶

Problem Setup¶

Given: - Expert policy \(\pi^*\) (unknown) - Demonstrations \(\mathcal{D} = \{(s_1, a_1), (s_2, a_2), ..., (s_N, a_N)\}\) - Sampled from \(\pi^*\)

Goal: - Learn policy \(\hat{\pi}\) that mimics \(\pi^*\)

Objective¶

Minimize expected difference between learned and expert policy:

\[ \min_{\pi} \mathbb{E}_{s \sim d^{\pi^*}} [c(s, \pi(s), \pi^*(s))] \]

where \(d^{\pi^*}\) is state distribution under expert policy, and \(c\) is a cost function.

The Distribution Shift Problem¶

Compounding Errors¶

Behavioral cloning suffers from distribution shift:

Policy trained on expert state distribution
Deployment: policy makes small mistakes
Reaches states never seen in training
Makes larger mistakes (no training data)
Errors compound over time

graph TD
    A[Expert States] --> B[Train Policy]
    B --> C[Deploy Policy]
    C --> D[Small Error]
    D --> E[Novel State]
    E --> F[Large Error]
    F --> G[Catastrophic Failure]

Mathematical Analysis¶

Error bound for behavioral cloning:

\[ J(\pi^*) - J(\hat{\pi}) \leq \epsilon T^2 \]

\(\epsilon\): Per-step error
\(T\): Episode length
Quadratic in horizon!

IL Algorithm Families¶

1. Behavioral Cloning¶

Approach: Supervised learning

Algorithm:

Input: Expert demonstrations D = {(s, a)}
Output: Policy π

1. Collect dataset D from expert
2. Train policy π(a|s) to maximize likelihood:
   max_π Σ log π(a|s) for (s,a) in D

When to use: - Demonstrations are on-policy (cover deployment distribution) - Short horizon tasks - Expert is consistent

2. Interactive IL (DAgger)¶

Approach: Query expert during training

Algorithm:

Initialize: Dataset D = D_expert
for iteration in 1..N:
    1. Train policy π on D
    2. Execute π, collect states S
    3. Query expert for actions A on states S
    4. Add (S, A) to D
    5. Retrain

When to use: - Expert available during training - Need to handle distribution shift - Can afford interactive learning

3. Inverse Reinforcement Learning¶

Approach: Learn reward, then optimize it

Algorithm:

1. Infer reward function R from demonstrations
2. Use RL to find policy that maximizes R

When to use: - Want interpretable reward - Need generalization to new scenarios - Have compute for RL step

Theoretical Foundations¶

Performance Bounds¶

DAgger Theorem:

With DAgger, error is linear in horizon:

\[ J(\pi^*) - J(\hat{\pi}) \leq \epsilon T \]

Much better than BC's quadratic bound!

Sample Complexity¶

Number of demonstrations needed scales with:

State space dimensionality
Action space complexity
Task horizon
Desired performance level

Practical Considerations¶

Demonstrations may show multiple valid strategies:

# Example: Two valid grasps for same object
demonstration_1: approach_from_top()
demonstration_2: approach_from_side()

# BC will average these actions (bad!)
learned_action: approach_from_middle() # Fails!

Solutions: - Mixture of Experts - Action discretization - Diffusion policies - Energy-based models

Suboptimal Demonstrations¶

Real demonstrations are rarely perfect:

# Handle suboptimal data
def train_with_confidence_weights(demonstrations):
    for demo in demonstrations:
        # Weight by demonstration quality
        weight = estimate_optimality(demo)
        loss = weight * bc_loss(demo)

Modern Extensions¶

Generative Behavioral Cloning¶

Use generative models for multi-modal distributions:

VAE: Learn latent space of actions
Diffusion: Iteratively denoise to action
Flow Matching: Transform noise to action

One-Shot Imitation¶

Learn from single demonstration:

Meta-learning approaches
Context-conditioned policies
Vision-language models

Learning from Videos¶

Learn without action labels:

Video prediction models
Inverse models
Time-contrastive learning

Next Steps¶

Methods - Specific IL algorithms
Data Collection - Gathering demonstrations
Training - Training IL policies

Introduction to Imitation Learning¶

What is Imitation Learning?¶

Motivation¶

The Reward Specification Problem¶

Formal Framework¶

Problem Setup¶

Objective¶

The Distribution Shift Problem¶

Compounding Errors¶

Mathematical Analysis¶

IL Algorithm Families¶

1. Behavioral Cloning¶

2. Interactive IL (DAgger)¶

3. Inverse Reinforcement Learning¶

Theoretical Foundations¶

Performance Bounds¶

Sample Complexity¶

Practical Considerations¶

Multi-Modal Actions¶

Suboptimal Demonstrations¶

Modern Extensions¶

Generative Behavioral Cloning¶

One-Shot Imitation¶

Learning from Videos¶

Next Steps¶