Imitation Learning Theory¶

Understanding the theoretical foundations of imitation learning helps make informed decisions about algorithms and deployment.

Problem Formulation¶

Standard RL Setting¶

Markov Decision Process (MDP): $\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{T}, R, \gamma)$

$\mathcal{S}$: State space
$\mathcal{A}$: Action space
$\mathcal{T}(s' | s, a)$: Transition dynamics
$R(s, a)$: Reward function
$\gamma$: Discount factor

Goal: Find policy $\pi^*$ that maximizes expected return:

\[ \pi^* = \arg\max_\pi \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \right] \]

Imitation Learning Setting¶

Given: Expert demonstrations $\mathcal{D} = \{(s_i, a_i)\}_{i=1}^N$ from expert policy $\pi^E$

Goal: Learn policy $\pi_\theta$ that imitates expert behavior

Key difference: No explicit reward function!

Behavioral Cloning (BC)¶

Supervised Learning Formulation¶

Treat IL as supervised learning:

\[ \min_\theta \mathbb{E}_{(s,a) \sim \mathcal{D}} \left[ \ell(\pi_\theta(s), a) \right] \]

Where $\ell$ is a loss function: - Continuous actions: MSE loss $\ell(a', a) = \|a' - a\|^2$ - Discrete actions: Cross-entropy $\ell(a', a) = -\log \pi_\theta(a|s)$

def behavioral_cloning_loss(policy, demonstrations):
    """
    BC minimizes distance between policy and expert actions

    For continuous actions: MSE
    For discrete actions: Cross-entropy
    """
    total_loss = 0

    for (state, expert_action) in demonstrations:
        # Predict action
        predicted_action = policy(state)

        # Loss
        if continuous:
            loss = torch.nn.functional.mse_loss(predicted_action, expert_action)
        else:
            loss = torch.nn.functional.cross_entropy(predicted_action, expert_action)

        total_loss += loss

    return total_loss / len(demonstrations)

Theoretical Guarantee¶

Under some assumptions, BC has bounded performance:

\[ J(\pi^E) - J(\pi_{BC}) \leq \frac{2\gamma T^2 C}{(1-\gamma)^2} \cdot \epsilon \]

Where: - $J(\pi)$: Expected return of policy $\pi$ - $T$: Horizon length - $C$: Constant depending on dynamics - $\epsilon = \mathbb{E}_{s \sim d_{\pi^E}}[\text{KL}(\pi^E(\cdot|s) || \pi_{BC}(\cdot|s))]$: Average KL divergence

Key insight: Performance gap grows quadratically with horizon $T$!

The Compounding Error Problem¶

Why BC fails for long horizons:

Small errors accumulate: Even if $\epsilon$ small, errors compound over $T$ steps
Distribution shift: Policy visits different states than expert

# Visualization of compounding errors
import matplotlib.pyplot as plt

def simulate_compounding_errors(expert_policy, learned_policy, T=100):
    """Show how small errors compound"""

    expert_states = [s0]
    learned_states = [s0]

    for t in range(T):
        # Expert trajectory
        a_expert = expert_policy(expert_states[-1])
        s_next_expert = dynamics(expert_states[-1], a_expert)
        expert_states.append(s_next_expert)

        # Learned policy trajectory
        a_learned = learned_policy(learned_states[-1])
        s_next_learned = dynamics(learned_states[-1], a_learned)
        learned_states.append(s_next_learned)

    # Plot distance over time
    distances = [
        np.linalg.norm(expert_states[t] - learned_states[t])
        for t in range(T+1)
    ]

    plt.plot(distances)
    plt.xlabel('Timestep')
    plt.ylabel('State Distance')
    plt.title('Compounding Errors in Behavioral Cloning')
    plt.show()

    # Distance grows roughly linearly or exponentially!

Distribution Mismatch¶

Definitions¶

Expert state distribution: $$ d_{\pi^E}(s) = \sum_{t=0}^{\infty} \gamma^t P(s_t = s | \pi^E) $$

Learned policy state distribution: $$ d_{\pi_\theta}(s) = \sum_{t=0}^{\infty} \gamma^t P(s_t = s | \pi_\theta) $$

Problem: BC minimizes loss under $d_{\pi^E}$, but policy executes under $d_{\pi_\theta}$!

\[ \min_\theta \mathbb{E}_{s \sim d_{\pi^E}} [\ell(\pi_\theta(s), \pi^E(s))] \]

But we care about: $$ \mathbb{E}{s \sim d [\ell(\pi_\theta(s), \pi^E(s))] $$}

Covariate Shift¶

When $d_{\pi^E} \neq d_{\pi_\theta}$, we have covariate shift:

def measure_distribution_shift(expert_policy, learned_policy, env, num_episodes=100):
    """Quantify distribution shift between expert and learned policy"""

    expert_states = []
    learned_states = []

    # Collect expert states
    for _ in range(num_episodes):
        state = env.reset()
        done = False
        while not done:
            expert_states.append(state)
            action = expert_policy(state)
            state, _, done, _ = env.step(action)

    # Collect learned policy states
    for _ in range(num_episodes):
        state = env.reset()
        done = False
        while not done:
            learned_states.append(state)
            action = learned_policy(state)
            state, _, done, _ = env.step(action)

    # Measure distribution distance (simplified: Wasserstein)
    expert_states = np.array(expert_states)
    learned_states = np.array(learned_states)

    # Project to 1D for simplicity
    expert_proj = expert_states.mean(axis=1)
    learned_proj = learned_states.mean(axis=1)

    # Wasserstein distance (earth mover's distance)
    from scipy.stats import wasserstein_distance
    dist = wasserstein_distance(expert_proj, learned_proj)

    return dist

DAgger: Addressing Distribution Shift¶

Dataset Aggregation (DAgger) fixes distribution shift by collecting data under learned policy:

Algorithm¶

Initialize $\mathcal{D} \leftarrow \mathcal{D}_{\text{expert}}$
For $i = 1, \ldots, N$:
Train $\pi_i$ on $\mathcal{D}$
Execute $\pi_i$ to collect states
Query expert for actions at those states
Add $(s, a^E)$ to $\mathcal{D}$
Return final policy $\pi_N$

def dagger(expert_policy, env, num_iterations=10, samples_per_iteration=1000):
    """
    DAgger algorithm

    Iteratively collect data under learned policy, label with expert
    """
    # Initialize with expert demonstrations
    dataset = collect_expert_demonstrations(expert_policy, env, num_demos=50)

    learned_policy = Policy()

    for iteration in range(num_iterations):
        print(f"DAgger Iteration {iteration+1}/{num_iterations}")

        # 1. Train policy on current dataset
        learned_policy = train_behavioral_cloning(learned_policy, dataset)

        # 2. Collect states under learned policy
        states_visited = []
        for _ in range(samples_per_iteration):
            state = env.reset()
            done = False

            while not done:
                # Execute learned policy
                action = learned_policy(state)
                states_visited.append(state)

                # Step environment
                state, _, done, _ = env.step(action)

        # 3. Query expert for actions at visited states
        expert_labels = []
        for state in states_visited:
            expert_action = expert_policy(state)
            expert_labels.append(expert_action)

        # 4. Add to dataset
        for state, action in zip(states_visited, expert_labels):
            dataset.append((state, action))

        # Evaluate
        success_rate = evaluate_policy(learned_policy, env)
        print(f"  Success rate: {success_rate:.2%}")

    return learned_policy

Theoretical Guarantee¶

DAgger achieves better bound than BC:

\[ J(\pi^E) - J(\pi_{\text{DAgger}}) \leq \frac{2T C}{1-\gamma} \cdot \epsilon \]

Linear (not quadratic!) in horizon $T$.

Inverse Reinforcement Learning (IRL)¶

Problem Formulation¶

Given: Expert demonstrations $\mathcal{D}$

Goal: Infer reward function $R$ such that expert is optimal

\[ R^* = \arg\max_R \left[ J(\pi^*(R)) - J(\pi^E) \right] \]

Where $\pi^*(R)$ is optimal policy for reward $R$.

Maximum Entropy IRL¶

Assume expert maximizes entropy-regularized objective:

\[ \pi^E = \arg\max_\pi \mathbb{E}_{\tau \sim \pi} \left[ \sum_t R(s_t, a_t) \right] + \alpha \mathcal{H}(\pi) \]

Where $\mathcal{H}(\pi) = \mathbb{E}_{\pi}[-\log \pi(a|s)]$ is policy entropy.

Key idea: Expert is noisy-rational

def maximum_entropy_irl(demonstrations, env, num_iterations=100):
    """
    Maximum Entropy IRL

    Paper: Ziebart et al., "Maximum Entropy Inverse Reinforcement Learning", AAAI 2008
    """
    # Initialize reward function
    reward_weights = np.random.randn(state_dim)

    for iteration in range(num_iterations):
        # 1. Compute optimal policy under current reward
        R = lambda s: np.dot(reward_weights, s)
        optimal_policy = solve_mdp_with_reward(env, R)

        # 2. Compute feature expectations
        # Expert feature expectation
        expert_features = compute_feature_expectation(demonstrations)

        # Learned policy feature expectation
        learned_features = compute_feature_expectation(
            collect_trajectories(optimal_policy, env)
        )

        # 3. Gradient ascent on likelihood
        gradient = expert_features - learned_features
        reward_weights += learning_rate * gradient

    return reward_weights


def compute_feature_expectation(trajectories):
    """
    Compute expected feature counts

    μ(π) = E_{τ~π}[∑_t φ(s_t)]
    """
    total_features = np.zeros(state_dim)

    for trajectory in trajectories:
        for state in trajectory:
            total_features += state  # Using state as feature

    return total_features / len(trajectories)

IRL → RL Pipeline¶

Once reward learned, solve RL problem:

Demonstrations → IRL → Reward → RL → Policy

Advantage: Can transfer reward to new environments

Disadvantage: Computationally expensive (nested optimization)

Generative Adversarial Imitation Learning (GAIL)¶

Idea: Combine IRL + GAN

Generator: Policy $\pi_\theta$ (tries to fool discriminator)
Discriminator: Distinguishes expert from policy trajectories

Objective¶

\[ \min_\theta \max_\phi \mathbb{E}_{\tau \sim \pi^E}[\log D_\phi(\tau)] + \mathbb{E}_{\tau \sim \pi_\theta}[\log(1 - D_\phi(\tau))] \]

class GAIL:
    """
    Generative Adversarial Imitation Learning

    Paper: Ho & Ermon, "Generative Adversarial Imitation Learning", NIPS 2016
    """
    def __init__(self, policy, discriminator):
        self.policy = policy
        self.discriminator = discriminator

        self.policy_optimizer = torch.optim.Adam(policy.parameters(), lr=3e-4)
        self.discriminator_optimizer = torch.optim.Adam(discriminator.parameters(), lr=3e-4)

    def train_step(self, expert_trajectories, env):
        """Single GAIL training step"""

        # 1. Collect policy trajectories
        policy_trajectories = self.collect_trajectories(env)

        # 2. Update discriminator
        # Expert trajectories should get label 1
        expert_loss = -torch.log(
            self.discriminator(expert_trajectories)
        ).mean()

        # Policy trajectories should get label 0
        policy_loss = -torch.log(
            1 - self.discriminator(policy_trajectories)
        ).mean()

        disc_loss = expert_loss + policy_loss

        self.discriminator_optimizer.zero_grad()
        disc_loss.backward()
        self.discriminator_optimizer.step()

        # 3. Update policy (maximize discriminator confidence)
        # Use discriminator output as reward signal
        rewards = -torch.log(1 - self.discriminator(policy_trajectories))

        # Policy gradient update (simplified)
        policy_loss = -rewards.mean()

        self.policy_optimizer.zero_grad()
        policy_loss.backward()
        self.policy_optimizer.step()

        return {
            'disc_loss': disc_loss.item(),
            'policy_loss': policy_loss.item()
        }

Theoretical Connection¶

GAIL minimizes Jensen-Shannon divergence between policy and expert state-action distributions:

\[ \min_\theta JS(\rho_{\pi^E}, \rho_{\pi_\theta}) \]

Where $\rho_\pi$ is the occupancy measure (state-action visitation distribution).

Sample Complexity¶

How many demonstrations needed?

Behavioral Cloning¶

\[ N = O\left( \frac{|\mathcal{S}| |\mathcal{A}|}{\epsilon^2} \right) \]

For $\epsilon$-accurate policy in tabular case.

Practical: 10-1000 demos for simple tasks, 10K+ for complex manipulation

DAgger¶

Reduces sample complexity by collecting on-policy data:

\[ N = O\left( \frac{|\mathcal{S}| |\mathcal{A}| T}{\epsilon} \right) \]

Better dependence on $\epsilon$ (linear vs quadratic).

IRL/GAIL¶

Higher sample complexity due to nested optimization:

\[ N = O\left( \frac{|\mathcal{S}|^2 |\mathcal{A}|}{\epsilon^2} \right) \]

Key Theoretical Insights¶

1. BC vs RL Sample Complexity¶

BC: Sample complexity depends on state-action space

RL: Also depends on horizon and discount factor

For long-horizon tasks, BC often more sample-efficient than RL.

2. Expert Optimality Assumption¶

Most IL theory assumes expert is optimal. In practice: - Experts are suboptimal - Need to handle expert imperfection

3. Compounding Errors¶

Fundamental challenge in IL: errors compound over time

Solutions: - DAgger (more data) - Receding horizon control (action chunking) - Closed-loop execution (frequent re-planning)

Practical Takeaways¶

BC works when:
Expert is nearly optimal
Horizon is short
Have lots of diverse data
Use DAgger when:
Can query expert online
Horizon is long
Distribution shift is severe
Use IRL/GAIL when:
Need reward for transfer
Want to exceed expert performance
Have computational budget

Resources¶

Foundational Papers: - Abbeel & Ng, "Apprenticeship Learning via IRL", ICML 2004 - Ross et al., "A Reduction of Imitation Learning to No-Regret Online Learning", AISTATS 2011 (DAgger) - Ho & Ermon, "Generative Adversarial Imitation Learning", NIPS 2016

Surveys: - Osa et al., "An Algorithmic Perspective on Imitation Learning", Foundations and Trends 2018

Next Steps¶

Behavioral Cloning - Practical BC implementation
DAgger - Interactive IL
Diffusion Policies - Modern IL with diffusion models
ACT - Action chunking to handle compounding errors