Skip to content

Introduction to Reinforcement Learning

What is Reinforcement Learning?

Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to make decisions by interacting with an environment to maximize cumulative reward.

Mathematical Formulation

Markov Decision Process (MDP)

An RL problem is formally defined as an MDP with:

  • State space \(\mathcal{S}\): Set of all possible states
  • Action space \(\mathcal{A}\): Set of all possible actions
  • Transition function \(P(s'|s,a)\): Probability of transitioning to state \(s'\) given state \(s\) and action \(a\)
  • Reward function \(R(s,a,s')\): Immediate reward for transition
  • Discount factor \(\gamma \in [0,1]\): Importance of future rewards

Objective

Find optimal policy \(\pi^*(a|s)\) that maximizes expected return:

\[ J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right] \]

where \(\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)\) is a trajectory.

Value Functions

State Value Function

\[ V^\pi(s) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s \right] \]

Expected return starting from state \(s\) and following policy \(\pi\).

Action Value Function (Q-Function)

\[ Q^\pi(s,a) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a \right] \]

Expected return taking action \(a\) in state \(s\) and following policy \(\pi\) thereafter.

Bellman Equations

Value function: $$ V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')] $$

Q-function: $$ Q^\pi(s,a) = \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a')] $$

RL Algorithm Families

Value-Based Methods

Learn value function and derive policy from it.

Q-Learning:

Q(s,a)  Q(s,a) + α[r + γ max_{a'} Q(s',a') - Q(s,a)]

Deep Q-Networks (DQN): - Use neural network to approximate Q-function - Experience replay for stability - Target network for stable targets

Policy-Based Methods

Directly parameterize and optimize policy.

Policy Gradient Theorem: $$ \nabla_\theta J(\pi_\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau) \right] $$

Advantages: - Can handle continuous action spaces - Can learn stochastic policies - Better convergence properties

Actor-Critic Methods

Combine value and policy learning.

  • Actor: Policy network \(\pi_\theta(a|s)\)
  • Critic: Value network \(V_\phi(s)\) or \(Q_\phi(s,a)\)

Advantage: Lower variance than pure policy gradient.

Exploration vs. Exploitation

The Exploration-Exploitation Tradeoff

  • Exploitation: Choose actions that maximize known reward
  • Exploration: Try new actions to discover better strategies

Exploration Strategies

if random.random() < epsilon:
    action = random.choice(actions)  # Explore
else:
    action = argmax(Q[state])  # Exploit
probabilities = softmax(Q[state] / temperature)
action = sample(probabilities)
# For continuous actions
action = policy(state) + N(0, σ²)

On-Policy vs. Off-Policy

On-Policy

Learn about the policy currently being executed.

  • Examples: SARSA, PPO, A3C
  • Advantage: More stable
  • Disadvantage: Less sample efficient

Off-Policy

Learn about a different policy than the one being executed.

  • Examples: Q-Learning, DQN, SAC, TD3
  • Advantage: Can reuse old data
  • Disadvantage: Can be unstable

Robotics-Specific Considerations

Continuous Action Spaces

Most robot control problems involve continuous actions (joint torques, velocities).

Solutions: - Policy gradient methods (PPO, SAC) - Continuous Q-learning (DDPG, TD3)

High-Dimensional Observations

Robots have complex sensors (cameras, LiDAR, force/torque).

Solutions: - Deep neural networks for function approximation - CNN for image processing - Recurrent networks for temporal dependencies

Sample Efficiency

Real-world robot data is expensive.

Solutions: - Model-based RL - Learning from demonstrations - Sim-to-real transfer

Partial Observability

Sensors may not capture full environment state.

Solutions: - Recurrent policies (LSTM, GRU) - Frame stacking - Belief state estimation

Common Challenges

Challenge Description Solutions
Reward Sparsity Reward only given at task completion Reward shaping, HER, curiosity
Credit Assignment Which actions led to reward? Value functions, eligibility traces
Non-stationarity Environment changes over time Adaptive policies, meta-learning
Sim-to-Real Gap Policies don't transfer from simulation Domain randomization, system ID

Next Steps