Introduction to Imitation Learning¶
What is Imitation Learning?¶
Imitation Learning (IL) is a paradigm where agents learn to perform tasks by observing and mimicking expert behavior.
Motivation¶
The Reward Specification Problem¶
Defining good reward functions for RL is hard:
# Too sparse
reward = 1 if task_complete else 0
# Too dense (might encourage shortcuts)
reward = -distance_to_goal + action_smoothness - collision_penalty + ...
IL sidesteps this by learning directly from demonstrations.
Formal Framework¶
Problem Setup¶
Given: - Expert policy \(\pi^*\) (unknown) - Demonstrations \(\mathcal{D} = \{(s_1, a_1), (s_2, a_2), ..., (s_N, a_N)\}\) - Sampled from \(\pi^*\)
Goal: - Learn policy \(\hat{\pi}\) that mimics \(\pi^*\)
Objective¶
Minimize expected difference between learned and expert policy:
where \(d^{\pi^*}\) is state distribution under expert policy, and \(c\) is a cost function.
The Distribution Shift Problem¶
Compounding Errors¶
Behavioral cloning suffers from distribution shift:
- Policy trained on expert state distribution
- Deployment: policy makes small mistakes
- Reaches states never seen in training
- Makes larger mistakes (no training data)
- Errors compound over time
graph TD
A[Expert States] --> B[Train Policy]
B --> C[Deploy Policy]
C --> D[Small Error]
D --> E[Novel State]
E --> F[Large Error]
F --> G[Catastrophic Failure]
Mathematical Analysis¶
Error bound for behavioral cloning:
- \(\epsilon\): Per-step error
- \(T\): Episode length
- Quadratic in horizon!
IL Algorithm Families¶
1. Behavioral Cloning¶
Approach: Supervised learning
Algorithm:
Input: Expert demonstrations D = {(s, a)}
Output: Policy π
1. Collect dataset D from expert
2. Train policy π(a|s) to maximize likelihood:
max_π Σ log π(a|s) for (s,a) in D
When to use: - Demonstrations are on-policy (cover deployment distribution) - Short horizon tasks - Expert is consistent
2. Interactive IL (DAgger)¶
Approach: Query expert during training
Algorithm:
Initialize: Dataset D = D_expert
for iteration in 1..N:
1. Train policy π on D
2. Execute π, collect states S
3. Query expert for actions A on states S
4. Add (S, A) to D
5. Retrain
When to use: - Expert available during training - Need to handle distribution shift - Can afford interactive learning
3. Inverse Reinforcement Learning¶
Approach: Learn reward, then optimize it
Algorithm:
When to use: - Want interpretable reward - Need generalization to new scenarios - Have compute for RL step
Theoretical Foundations¶
Performance Bounds¶
DAgger Theorem:
With DAgger, error is linear in horizon:
Much better than BC's quadratic bound!
Sample Complexity¶
Number of demonstrations needed scales with:
- State space dimensionality
- Action space complexity
- Task horizon
- Desired performance level
Practical Considerations¶
Multi-Modal Actions¶
Demonstrations may show multiple valid strategies:
# Example: Two valid grasps for same object
demonstration_1: approach_from_top()
demonstration_2: approach_from_side()
# BC will average these actions (bad!)
learned_action: approach_from_middle() # Fails!
Solutions: - Mixture of Experts - Action discretization - Diffusion policies - Energy-based models
Suboptimal Demonstrations¶
Real demonstrations are rarely perfect:
# Handle suboptimal data
def train_with_confidence_weights(demonstrations):
for demo in demonstrations:
# Weight by demonstration quality
weight = estimate_optimality(demo)
loss = weight * bc_loss(demo)
Modern Extensions¶
Generative Behavioral Cloning¶
Use generative models for multi-modal distributions:
- VAE: Learn latent space of actions
- Diffusion: Iteratively denoise to action
- Flow Matching: Transform noise to action
One-Shot Imitation¶
Learn from single demonstration:
- Meta-learning approaches
- Context-conditioned policies
- Vision-language models
Learning from Videos¶
Learn without action labels:
- Video prediction models
- Inverse models
- Time-contrastive learning
Next Steps¶
- Methods - Specific IL algorithms
- Data Collection - Gathering demonstrations
- Training - Training IL policies