Introduction to Reinforcement Learning¶
What is Reinforcement Learning?¶
Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to make decisions by interacting with an environment to maximize cumulative reward.
Mathematical Formulation¶
Markov Decision Process (MDP)¶
An RL problem is formally defined as an MDP with:
- State space \(\mathcal{S}\): Set of all possible states
- Action space \(\mathcal{A}\): Set of all possible actions
- Transition function \(P(s'|s,a)\): Probability of transitioning to state \(s'\) given state \(s\) and action \(a\)
- Reward function \(R(s,a,s')\): Immediate reward for transition
- Discount factor \(\gamma \in [0,1]\): Importance of future rewards
Objective¶
Find optimal policy \(\pi^*(a|s)\) that maximizes expected return:
where \(\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)\) is a trajectory.
Value Functions¶
State Value Function¶
Expected return starting from state \(s\) and following policy \(\pi\).
Action Value Function (Q-Function)¶
Expected return taking action \(a\) in state \(s\) and following policy \(\pi\) thereafter.
Bellman Equations¶
Value function: $$ V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')] $$
Q-function: $$ Q^\pi(s,a) = \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a')] $$
RL Algorithm Families¶
Value-Based Methods¶
Learn value function and derive policy from it.
Q-Learning:
Deep Q-Networks (DQN): - Use neural network to approximate Q-function - Experience replay for stability - Target network for stable targets
Policy-Based Methods¶
Directly parameterize and optimize policy.
Policy Gradient Theorem: $$ \nabla_\theta J(\pi_\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau) \right] $$
Advantages: - Can handle continuous action spaces - Can learn stochastic policies - Better convergence properties
Actor-Critic Methods¶
Combine value and policy learning.
- Actor: Policy network \(\pi_\theta(a|s)\)
- Critic: Value network \(V_\phi(s)\) or \(Q_\phi(s,a)\)
Advantage: Lower variance than pure policy gradient.
Exploration vs. Exploitation¶
The Exploration-Exploitation Tradeoff¶
- Exploitation: Choose actions that maximize known reward
- Exploration: Try new actions to discover better strategies
Exploration Strategies¶
On-Policy vs. Off-Policy¶
On-Policy¶
Learn about the policy currently being executed.
- Examples: SARSA, PPO, A3C
- Advantage: More stable
- Disadvantage: Less sample efficient
Off-Policy¶
Learn about a different policy than the one being executed.
- Examples: Q-Learning, DQN, SAC, TD3
- Advantage: Can reuse old data
- Disadvantage: Can be unstable
Robotics-Specific Considerations¶
Continuous Action Spaces¶
Most robot control problems involve continuous actions (joint torques, velocities).
Solutions: - Policy gradient methods (PPO, SAC) - Continuous Q-learning (DDPG, TD3)
High-Dimensional Observations¶
Robots have complex sensors (cameras, LiDAR, force/torque).
Solutions: - Deep neural networks for function approximation - CNN for image processing - Recurrent networks for temporal dependencies
Sample Efficiency¶
Real-world robot data is expensive.
Solutions: - Model-based RL - Learning from demonstrations - Sim-to-real transfer
Partial Observability¶
Sensors may not capture full environment state.
Solutions: - Recurrent policies (LSTM, GRU) - Frame stacking - Belief state estimation
Common Challenges¶
| Challenge | Description | Solutions |
|---|---|---|
| Reward Sparsity | Reward only given at task completion | Reward shaping, HER, curiosity |
| Credit Assignment | Which actions led to reward? | Value functions, eligibility traces |
| Non-stationarity | Environment changes over time | Adaptive policies, meta-learning |
| Sim-to-Real Gap | Policies don't transfer from simulation | Domain randomization, system ID |
Next Steps¶
- Algorithms - Deep dive into specific RL algorithms
- Training Guide - Practical training tips
- Evaluation - How to evaluate RL agents