Imitation Learning Theory¶
Understanding the theoretical foundations of imitation learning helps make informed decisions about algorithms and deployment.
Problem Formulation¶
Standard RL Setting¶
Markov Decision Process (MDP): \(\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{T}, R, \gamma)\)
- \(\mathcal{S}\): State space
- \(\mathcal{A}\): Action space
- \(\mathcal{T}(s' | s, a)\): Transition dynamics
- \(R(s, a)\): Reward function
- \(\gamma\): Discount factor
Goal: Find policy \(\pi^*\) that maximizes expected return:
Imitation Learning Setting¶
Given: Expert demonstrations \(\mathcal{D} = \{(s_i, a_i)\}_{i=1}^N\) from expert policy \(\pi^E\)
Goal: Learn policy \(\pi_\theta\) that imitates expert behavior
Key difference: No explicit reward function!
Behavioral Cloning (BC)¶
Supervised Learning Formulation¶
Treat IL as supervised learning:
Where \(\ell\) is a loss function: - Continuous actions: MSE loss \(\ell(a', a) = \|a' - a\|^2\) - Discrete actions: Cross-entropy \(\ell(a', a) = -\log \pi_\theta(a|s)\)
def behavioral_cloning_loss(policy, demonstrations):
"""
BC minimizes distance between policy and expert actions
For continuous actions: MSE
For discrete actions: Cross-entropy
"""
total_loss = 0
for (state, expert_action) in demonstrations:
# Predict action
predicted_action = policy(state)
# Loss
if continuous:
loss = torch.nn.functional.mse_loss(predicted_action, expert_action)
else:
loss = torch.nn.functional.cross_entropy(predicted_action, expert_action)
total_loss += loss
return total_loss / len(demonstrations)
Theoretical Guarantee¶
Under some assumptions, BC has bounded performance:
Where: - \(J(\pi)\): Expected return of policy \(\pi\) - \(T\): Horizon length - \(C\): Constant depending on dynamics - \(\epsilon = \mathbb{E}_{s \sim d_{\pi^E}}[\text{KL}(\pi^E(\cdot|s) || \pi_{BC}(\cdot|s))]\): Average KL divergence
Key insight: Performance gap grows quadratically with horizon \(T\)!
The Compounding Error Problem¶
Why BC fails for long horizons:
- Small errors accumulate: Even if \(\epsilon\) small, errors compound over \(T\) steps
- Distribution shift: Policy visits different states than expert
# Visualization of compounding errors
import matplotlib.pyplot as plt
def simulate_compounding_errors(expert_policy, learned_policy, T=100):
"""Show how small errors compound"""
expert_states = [s0]
learned_states = [s0]
for t in range(T):
# Expert trajectory
a_expert = expert_policy(expert_states[-1])
s_next_expert = dynamics(expert_states[-1], a_expert)
expert_states.append(s_next_expert)
# Learned policy trajectory
a_learned = learned_policy(learned_states[-1])
s_next_learned = dynamics(learned_states[-1], a_learned)
learned_states.append(s_next_learned)
# Plot distance over time
distances = [
np.linalg.norm(expert_states[t] - learned_states[t])
for t in range(T+1)
]
plt.plot(distances)
plt.xlabel('Timestep')
plt.ylabel('State Distance')
plt.title('Compounding Errors in Behavioral Cloning')
plt.show()
# Distance grows roughly linearly or exponentially!
Distribution Mismatch¶
Definitions¶
Expert state distribution: $$ d_{\pi^E}(s) = \sum_{t=0}^{\infty} \gamma^t P(s_t = s | \pi^E) $$
Learned policy state distribution: $$ d_{\pi_\theta}(s) = \sum_{t=0}^{\infty} \gamma^t P(s_t = s | \pi_\theta) $$
Problem: BC minimizes loss under \(d_{\pi^E}\), but policy executes under \(d_{\pi_\theta}\)!
But we care about: $$ \mathbb{E}{s \sim d [\ell(\pi_\theta(s), \pi^E(s))] $$}
Covariate Shift¶
When \(d_{\pi^E} \neq d_{\pi_\theta}\), we have covariate shift:
def measure_distribution_shift(expert_policy, learned_policy, env, num_episodes=100):
"""Quantify distribution shift between expert and learned policy"""
expert_states = []
learned_states = []
# Collect expert states
for _ in range(num_episodes):
state = env.reset()
done = False
while not done:
expert_states.append(state)
action = expert_policy(state)
state, _, done, _ = env.step(action)
# Collect learned policy states
for _ in range(num_episodes):
state = env.reset()
done = False
while not done:
learned_states.append(state)
action = learned_policy(state)
state, _, done, _ = env.step(action)
# Measure distribution distance (simplified: Wasserstein)
expert_states = np.array(expert_states)
learned_states = np.array(learned_states)
# Project to 1D for simplicity
expert_proj = expert_states.mean(axis=1)
learned_proj = learned_states.mean(axis=1)
# Wasserstein distance (earth mover's distance)
from scipy.stats import wasserstein_distance
dist = wasserstein_distance(expert_proj, learned_proj)
return dist
DAgger: Addressing Distribution Shift¶
Dataset Aggregation (DAgger) fixes distribution shift by collecting data under learned policy:
Algorithm¶
- Initialize \(\mathcal{D} \leftarrow \mathcal{D}_{\text{expert}}\)
- For \(i = 1, \ldots, N\):
- Train \(\pi_i\) on \(\mathcal{D}\)
- Execute \(\pi_i\) to collect states
- Query expert for actions at those states
- Add \((s, a^E)\) to \(\mathcal{D}\)
- Return final policy \(\pi_N\)
def dagger(expert_policy, env, num_iterations=10, samples_per_iteration=1000):
"""
DAgger algorithm
Iteratively collect data under learned policy, label with expert
"""
# Initialize with expert demonstrations
dataset = collect_expert_demonstrations(expert_policy, env, num_demos=50)
learned_policy = Policy()
for iteration in range(num_iterations):
print(f"DAgger Iteration {iteration+1}/{num_iterations}")
# 1. Train policy on current dataset
learned_policy = train_behavioral_cloning(learned_policy, dataset)
# 2. Collect states under learned policy
states_visited = []
for _ in range(samples_per_iteration):
state = env.reset()
done = False
while not done:
# Execute learned policy
action = learned_policy(state)
states_visited.append(state)
# Step environment
state, _, done, _ = env.step(action)
# 3. Query expert for actions at visited states
expert_labels = []
for state in states_visited:
expert_action = expert_policy(state)
expert_labels.append(expert_action)
# 4. Add to dataset
for state, action in zip(states_visited, expert_labels):
dataset.append((state, action))
# Evaluate
success_rate = evaluate_policy(learned_policy, env)
print(f" Success rate: {success_rate:.2%}")
return learned_policy
Theoretical Guarantee¶
DAgger achieves better bound than BC:
Linear (not quadratic!) in horizon \(T\).
Inverse Reinforcement Learning (IRL)¶
Problem Formulation¶
Given: Expert demonstrations \(\mathcal{D}\)
Goal: Infer reward function \(R\) such that expert is optimal
Where \(\pi^*(R)\) is optimal policy for reward \(R\).
Maximum Entropy IRL¶
Assume expert maximizes entropy-regularized objective:
Where \(\mathcal{H}(\pi) = \mathbb{E}_{\pi}[-\log \pi(a|s)]\) is policy entropy.
Key idea: Expert is noisy-rational
def maximum_entropy_irl(demonstrations, env, num_iterations=100):
"""
Maximum Entropy IRL
Paper: Ziebart et al., "Maximum Entropy Inverse Reinforcement Learning", AAAI 2008
"""
# Initialize reward function
reward_weights = np.random.randn(state_dim)
for iteration in range(num_iterations):
# 1. Compute optimal policy under current reward
R = lambda s: np.dot(reward_weights, s)
optimal_policy = solve_mdp_with_reward(env, R)
# 2. Compute feature expectations
# Expert feature expectation
expert_features = compute_feature_expectation(demonstrations)
# Learned policy feature expectation
learned_features = compute_feature_expectation(
collect_trajectories(optimal_policy, env)
)
# 3. Gradient ascent on likelihood
gradient = expert_features - learned_features
reward_weights += learning_rate * gradient
return reward_weights
def compute_feature_expectation(trajectories):
"""
Compute expected feature counts
μ(π) = E_{τ~π}[∑_t φ(s_t)]
"""
total_features = np.zeros(state_dim)
for trajectory in trajectories:
for state in trajectory:
total_features += state # Using state as feature
return total_features / len(trajectories)
IRL → RL Pipeline¶
Once reward learned, solve RL problem:
Advantage: Can transfer reward to new environments
Disadvantage: Computationally expensive (nested optimization)
Generative Adversarial Imitation Learning (GAIL)¶
Idea: Combine IRL + GAN
- Generator: Policy \(\pi_\theta\) (tries to fool discriminator)
- Discriminator: Distinguishes expert from policy trajectories
Objective¶
class GAIL:
"""
Generative Adversarial Imitation Learning
Paper: Ho & Ermon, "Generative Adversarial Imitation Learning", NIPS 2016
"""
def __init__(self, policy, discriminator):
self.policy = policy
self.discriminator = discriminator
self.policy_optimizer = torch.optim.Adam(policy.parameters(), lr=3e-4)
self.discriminator_optimizer = torch.optim.Adam(discriminator.parameters(), lr=3e-4)
def train_step(self, expert_trajectories, env):
"""Single GAIL training step"""
# 1. Collect policy trajectories
policy_trajectories = self.collect_trajectories(env)
# 2. Update discriminator
# Expert trajectories should get label 1
expert_loss = -torch.log(
self.discriminator(expert_trajectories)
).mean()
# Policy trajectories should get label 0
policy_loss = -torch.log(
1 - self.discriminator(policy_trajectories)
).mean()
disc_loss = expert_loss + policy_loss
self.discriminator_optimizer.zero_grad()
disc_loss.backward()
self.discriminator_optimizer.step()
# 3. Update policy (maximize discriminator confidence)
# Use discriminator output as reward signal
rewards = -torch.log(1 - self.discriminator(policy_trajectories))
# Policy gradient update (simplified)
policy_loss = -rewards.mean()
self.policy_optimizer.zero_grad()
policy_loss.backward()
self.policy_optimizer.step()
return {
'disc_loss': disc_loss.item(),
'policy_loss': policy_loss.item()
}
Theoretical Connection¶
GAIL minimizes Jensen-Shannon divergence between policy and expert state-action distributions:
Where \(\rho_\pi\) is the occupancy measure (state-action visitation distribution).
Sample Complexity¶
How many demonstrations needed?
Behavioral Cloning¶
For \(\epsilon\)-accurate policy in tabular case.
Practical: 10-1000 demos for simple tasks, 10K+ for complex manipulation
DAgger¶
Reduces sample complexity by collecting on-policy data:
Better dependence on \(\epsilon\) (linear vs quadratic).
IRL/GAIL¶
Higher sample complexity due to nested optimization:
Key Theoretical Insights¶
1. BC vs RL Sample Complexity¶
BC: Sample complexity depends on state-action space
RL: Also depends on horizon and discount factor
For long-horizon tasks, BC often more sample-efficient than RL.
2. Expert Optimality Assumption¶
Most IL theory assumes expert is optimal. In practice: - Experts are suboptimal - Need to handle expert imperfection
3. Compounding Errors¶
Fundamental challenge in IL: errors compound over time
Solutions: - DAgger (more data) - Receding horizon control (action chunking) - Closed-loop execution (frequent re-planning)
Practical Takeaways¶
- BC works when:
- Expert is nearly optimal
- Horizon is short
-
Have lots of diverse data
-
Use DAgger when:
- Can query expert online
- Horizon is long
-
Distribution shift is severe
-
Use IRL/GAIL when:
- Need reward for transfer
- Want to exceed expert performance
- Have computational budget
Resources¶
Foundational Papers: - Abbeel & Ng, "Apprenticeship Learning via IRL", ICML 2004 - Ross et al., "A Reduction of Imitation Learning to No-Regret Online Learning", AISTATS 2011 (DAgger) - Ho & Ermon, "Generative Adversarial Imitation Learning", NIPS 2016
Surveys: - Osa et al., "An Algorithmic Perspective on Imitation Learning", Foundations and Trends 2018
Next Steps¶
- Behavioral Cloning - Practical BC implementation
- DAgger - Interactive IL
- Diffusion Policies - Modern IL with diffusion models
- ACT - Action chunking to handle compounding errors