Skip to content

Imitation Learning Theory

Understanding the theoretical foundations of imitation learning helps make informed decisions about algorithms and deployment.

Problem Formulation

Standard RL Setting

Markov Decision Process (MDP): \(\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{T}, R, \gamma)\)

  • \(\mathcal{S}\): State space
  • \(\mathcal{A}\): Action space
  • \(\mathcal{T}(s' | s, a)\): Transition dynamics
  • \(R(s, a)\): Reward function
  • \(\gamma\): Discount factor

Goal: Find policy \(\pi^*\) that maximizes expected return:

\[ \pi^* = \arg\max_\pi \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \right] \]

Imitation Learning Setting

Given: Expert demonstrations \(\mathcal{D} = \{(s_i, a_i)\}_{i=1}^N\) from expert policy \(\pi^E\)

Goal: Learn policy \(\pi_\theta\) that imitates expert behavior

Key difference: No explicit reward function!

Behavioral Cloning (BC)

Supervised Learning Formulation

Treat IL as supervised learning:

\[ \min_\theta \mathbb{E}_{(s,a) \sim \mathcal{D}} \left[ \ell(\pi_\theta(s), a) \right] \]

Where \(\ell\) is a loss function: - Continuous actions: MSE loss \(\ell(a', a) = \|a' - a\|^2\) - Discrete actions: Cross-entropy \(\ell(a', a) = -\log \pi_\theta(a|s)\)

def behavioral_cloning_loss(policy, demonstrations):
    """
    BC minimizes distance between policy and expert actions

    For continuous actions: MSE
    For discrete actions: Cross-entropy
    """
    total_loss = 0

    for (state, expert_action) in demonstrations:
        # Predict action
        predicted_action = policy(state)

        # Loss
        if continuous:
            loss = torch.nn.functional.mse_loss(predicted_action, expert_action)
        else:
            loss = torch.nn.functional.cross_entropy(predicted_action, expert_action)

        total_loss += loss

    return total_loss / len(demonstrations)

Theoretical Guarantee

Under some assumptions, BC has bounded performance:

\[ J(\pi^E) - J(\pi_{BC}) \leq \frac{2\gamma T^2 C}{(1-\gamma)^2} \cdot \epsilon \]

Where: - \(J(\pi)\): Expected return of policy \(\pi\) - \(T\): Horizon length - \(C\): Constant depending on dynamics - \(\epsilon = \mathbb{E}_{s \sim d_{\pi^E}}[\text{KL}(\pi^E(\cdot|s) || \pi_{BC}(\cdot|s))]\): Average KL divergence

Key insight: Performance gap grows quadratically with horizon \(T\)!

The Compounding Error Problem

Why BC fails for long horizons:

  1. Small errors accumulate: Even if \(\epsilon\) small, errors compound over \(T\) steps
  2. Distribution shift: Policy visits different states than expert
# Visualization of compounding errors
import matplotlib.pyplot as plt

def simulate_compounding_errors(expert_policy, learned_policy, T=100):
    """Show how small errors compound"""

    expert_states = [s0]
    learned_states = [s0]

    for t in range(T):
        # Expert trajectory
        a_expert = expert_policy(expert_states[-1])
        s_next_expert = dynamics(expert_states[-1], a_expert)
        expert_states.append(s_next_expert)

        # Learned policy trajectory
        a_learned = learned_policy(learned_states[-1])
        s_next_learned = dynamics(learned_states[-1], a_learned)
        learned_states.append(s_next_learned)

    # Plot distance over time
    distances = [
        np.linalg.norm(expert_states[t] - learned_states[t])
        for t in range(T+1)
    ]

    plt.plot(distances)
    plt.xlabel('Timestep')
    plt.ylabel('State Distance')
    plt.title('Compounding Errors in Behavioral Cloning')
    plt.show()

    # Distance grows roughly linearly or exponentially!

Distribution Mismatch

Definitions

Expert state distribution: $$ d_{\pi^E}(s) = \sum_{t=0}^{\infty} \gamma^t P(s_t = s | \pi^E) $$

Learned policy state distribution: $$ d_{\pi_\theta}(s) = \sum_{t=0}^{\infty} \gamma^t P(s_t = s | \pi_\theta) $$

Problem: BC minimizes loss under \(d_{\pi^E}\), but policy executes under \(d_{\pi_\theta}\)!

\[ \min_\theta \mathbb{E}_{s \sim d_{\pi^E}} [\ell(\pi_\theta(s), \pi^E(s))] \]

But we care about: $$ \mathbb{E}{s \sim d [\ell(\pi_\theta(s), \pi^E(s))] $$}

Covariate Shift

When \(d_{\pi^E} \neq d_{\pi_\theta}\), we have covariate shift:

def measure_distribution_shift(expert_policy, learned_policy, env, num_episodes=100):
    """Quantify distribution shift between expert and learned policy"""

    expert_states = []
    learned_states = []

    # Collect expert states
    for _ in range(num_episodes):
        state = env.reset()
        done = False
        while not done:
            expert_states.append(state)
            action = expert_policy(state)
            state, _, done, _ = env.step(action)

    # Collect learned policy states
    for _ in range(num_episodes):
        state = env.reset()
        done = False
        while not done:
            learned_states.append(state)
            action = learned_policy(state)
            state, _, done, _ = env.step(action)

    # Measure distribution distance (simplified: Wasserstein)
    expert_states = np.array(expert_states)
    learned_states = np.array(learned_states)

    # Project to 1D for simplicity
    expert_proj = expert_states.mean(axis=1)
    learned_proj = learned_states.mean(axis=1)

    # Wasserstein distance (earth mover's distance)
    from scipy.stats import wasserstein_distance
    dist = wasserstein_distance(expert_proj, learned_proj)

    return dist

DAgger: Addressing Distribution Shift

Dataset Aggregation (DAgger) fixes distribution shift by collecting data under learned policy:

Algorithm

  1. Initialize \(\mathcal{D} \leftarrow \mathcal{D}_{\text{expert}}\)
  2. For \(i = 1, \ldots, N\):
  3. Train \(\pi_i\) on \(\mathcal{D}\)
  4. Execute \(\pi_i\) to collect states
  5. Query expert for actions at those states
  6. Add \((s, a^E)\) to \(\mathcal{D}\)
  7. Return final policy \(\pi_N\)
def dagger(expert_policy, env, num_iterations=10, samples_per_iteration=1000):
    """
    DAgger algorithm

    Iteratively collect data under learned policy, label with expert
    """
    # Initialize with expert demonstrations
    dataset = collect_expert_demonstrations(expert_policy, env, num_demos=50)

    learned_policy = Policy()

    for iteration in range(num_iterations):
        print(f"DAgger Iteration {iteration+1}/{num_iterations}")

        # 1. Train policy on current dataset
        learned_policy = train_behavioral_cloning(learned_policy, dataset)

        # 2. Collect states under learned policy
        states_visited = []
        for _ in range(samples_per_iteration):
            state = env.reset()
            done = False

            while not done:
                # Execute learned policy
                action = learned_policy(state)
                states_visited.append(state)

                # Step environment
                state, _, done, _ = env.step(action)

        # 3. Query expert for actions at visited states
        expert_labels = []
        for state in states_visited:
            expert_action = expert_policy(state)
            expert_labels.append(expert_action)

        # 4. Add to dataset
        for state, action in zip(states_visited, expert_labels):
            dataset.append((state, action))

        # Evaluate
        success_rate = evaluate_policy(learned_policy, env)
        print(f"  Success rate: {success_rate:.2%}")

    return learned_policy

Theoretical Guarantee

DAgger achieves better bound than BC:

\[ J(\pi^E) - J(\pi_{\text{DAgger}}) \leq \frac{2T C}{1-\gamma} \cdot \epsilon \]

Linear (not quadratic!) in horizon \(T\).

Inverse Reinforcement Learning (IRL)

Problem Formulation

Given: Expert demonstrations \(\mathcal{D}\)

Goal: Infer reward function \(R\) such that expert is optimal

\[ R^* = \arg\max_R \left[ J(\pi^*(R)) - J(\pi^E) \right] \]

Where \(\pi^*(R)\) is optimal policy for reward \(R\).

Maximum Entropy IRL

Assume expert maximizes entropy-regularized objective:

\[ \pi^E = \arg\max_\pi \mathbb{E}_{\tau \sim \pi} \left[ \sum_t R(s_t, a_t) \right] + \alpha \mathcal{H}(\pi) \]

Where \(\mathcal{H}(\pi) = \mathbb{E}_{\pi}[-\log \pi(a|s)]\) is policy entropy.

Key idea: Expert is noisy-rational

def maximum_entropy_irl(demonstrations, env, num_iterations=100):
    """
    Maximum Entropy IRL

    Paper: Ziebart et al., "Maximum Entropy Inverse Reinforcement Learning", AAAI 2008
    """
    # Initialize reward function
    reward_weights = np.random.randn(state_dim)

    for iteration in range(num_iterations):
        # 1. Compute optimal policy under current reward
        R = lambda s: np.dot(reward_weights, s)
        optimal_policy = solve_mdp_with_reward(env, R)

        # 2. Compute feature expectations
        # Expert feature expectation
        expert_features = compute_feature_expectation(demonstrations)

        # Learned policy feature expectation
        learned_features = compute_feature_expectation(
            collect_trajectories(optimal_policy, env)
        )

        # 3. Gradient ascent on likelihood
        gradient = expert_features - learned_features
        reward_weights += learning_rate * gradient

    return reward_weights


def compute_feature_expectation(trajectories):
    """
    Compute expected feature counts

    μ(π) = E_{τ~π}[∑_t φ(s_t)]
    """
    total_features = np.zeros(state_dim)

    for trajectory in trajectories:
        for state in trajectory:
            total_features += state  # Using state as feature

    return total_features / len(trajectories)

IRL → RL Pipeline

Once reward learned, solve RL problem:

Demonstrations → IRL → Reward → RL → Policy

Advantage: Can transfer reward to new environments

Disadvantage: Computationally expensive (nested optimization)

Generative Adversarial Imitation Learning (GAIL)

Idea: Combine IRL + GAN

  • Generator: Policy \(\pi_\theta\) (tries to fool discriminator)
  • Discriminator: Distinguishes expert from policy trajectories

Objective

\[ \min_\theta \max_\phi \mathbb{E}_{\tau \sim \pi^E}[\log D_\phi(\tau)] + \mathbb{E}_{\tau \sim \pi_\theta}[\log(1 - D_\phi(\tau))] \]
class GAIL:
    """
    Generative Adversarial Imitation Learning

    Paper: Ho & Ermon, "Generative Adversarial Imitation Learning", NIPS 2016
    """
    def __init__(self, policy, discriminator):
        self.policy = policy
        self.discriminator = discriminator

        self.policy_optimizer = torch.optim.Adam(policy.parameters(), lr=3e-4)
        self.discriminator_optimizer = torch.optim.Adam(discriminator.parameters(), lr=3e-4)

    def train_step(self, expert_trajectories, env):
        """Single GAIL training step"""

        # 1. Collect policy trajectories
        policy_trajectories = self.collect_trajectories(env)

        # 2. Update discriminator
        # Expert trajectories should get label 1
        expert_loss = -torch.log(
            self.discriminator(expert_trajectories)
        ).mean()

        # Policy trajectories should get label 0
        policy_loss = -torch.log(
            1 - self.discriminator(policy_trajectories)
        ).mean()

        disc_loss = expert_loss + policy_loss

        self.discriminator_optimizer.zero_grad()
        disc_loss.backward()
        self.discriminator_optimizer.step()

        # 3. Update policy (maximize discriminator confidence)
        # Use discriminator output as reward signal
        rewards = -torch.log(1 - self.discriminator(policy_trajectories))

        # Policy gradient update (simplified)
        policy_loss = -rewards.mean()

        self.policy_optimizer.zero_grad()
        policy_loss.backward()
        self.policy_optimizer.step()

        return {
            'disc_loss': disc_loss.item(),
            'policy_loss': policy_loss.item()
        }

Theoretical Connection

GAIL minimizes Jensen-Shannon divergence between policy and expert state-action distributions:

\[ \min_\theta JS(\rho_{\pi^E}, \rho_{\pi_\theta}) \]

Where \(\rho_\pi\) is the occupancy measure (state-action visitation distribution).

Sample Complexity

How many demonstrations needed?

Behavioral Cloning

\[ N = O\left( \frac{|\mathcal{S}| |\mathcal{A}|}{\epsilon^2} \right) \]

For \(\epsilon\)-accurate policy in tabular case.

Practical: 10-1000 demos for simple tasks, 10K+ for complex manipulation

DAgger

Reduces sample complexity by collecting on-policy data:

\[ N = O\left( \frac{|\mathcal{S}| |\mathcal{A}| T}{\epsilon} \right) \]

Better dependence on \(\epsilon\) (linear vs quadratic).

IRL/GAIL

Higher sample complexity due to nested optimization:

\[ N = O\left( \frac{|\mathcal{S}|^2 |\mathcal{A}|}{\epsilon^2} \right) \]

Key Theoretical Insights

1. BC vs RL Sample Complexity

BC: Sample complexity depends on state-action space

RL: Also depends on horizon and discount factor

For long-horizon tasks, BC often more sample-efficient than RL.

2. Expert Optimality Assumption

Most IL theory assumes expert is optimal. In practice: - Experts are suboptimal - Need to handle expert imperfection

3. Compounding Errors

Fundamental challenge in IL: errors compound over time

Solutions: - DAgger (more data) - Receding horizon control (action chunking) - Closed-loop execution (frequent re-planning)

Practical Takeaways

  1. BC works when:
  2. Expert is nearly optimal
  3. Horizon is short
  4. Have lots of diverse data

  5. Use DAgger when:

  6. Can query expert online
  7. Horizon is long
  8. Distribution shift is severe

  9. Use IRL/GAIL when:

  10. Need reward for transfer
  11. Want to exceed expert performance
  12. Have computational budget

Resources

Foundational Papers: - Abbeel & Ng, "Apprenticeship Learning via IRL", ICML 2004 - Ross et al., "A Reduction of Imitation Learning to No-Regret Online Learning", AISTATS 2011 (DAgger) - Ho & Ermon, "Generative Adversarial Imitation Learning", NIPS 2016

Surveys: - Osa et al., "An Algorithmic Perspective on Imitation Learning", Foundations and Trends 2018

Next Steps