RT-1 and RT-2: Robotics Transformers¶
Google DeepMind's Robotics Transformer models represent breakthrough achievements in vision-language-action learning.
RT-1: Robotics Transformer 1¶
Paper: "RT-1: Robotics Transformer for Real-World Control at Scale" (Brohan et al., 2022)
Architecture Overview¶
RT-1 combines efficient vision encoding with transformer-based action prediction:
graph LR
A[Image 224x224x3] --> B[EfficientNet-B3]
C[Language Instruction] --> D[USE Encoder]
B --> E[TokenLearner]
E --> F[Transformer<br/>8 tokens x 512d]
D --> F
F --> G[Action Head]
G --> H[7-DoF Action]
Key Innovations¶
1. Token Learner for Efficiency¶
Instead of using all image patches, TokenLearner selects the most informative tokens:
class TokenLearner(nn.Module):
"""Learns to select important visual tokens"""
def __init__(self, num_tokens=8, dim=512):
super().__init__()
self.num_tokens = num_tokens
# Attention-based token selection
self.token_wts = nn.Sequential(
nn.Conv2d(dim, num_tokens, 1),
nn.Softmax(dim=1)
)
def forward(self, x):
# x: (B, H, W, C)
B, H, W, C = x.shape
# Compute attention weights for each token
selected = self.token_wts(x.permute(0, 3, 1, 2)) # (B, num_tokens, H, W)
# Weight and aggregate spatial features
x_flat = x.view(B, H*W, C)
selected_flat = selected.view(B, self.num_tokens, H*W)
# Weighted sum
tokens = torch.bmm(selected_flat, x_flat) # (B, num_tokens, C)
return tokens
2. Action Discretization¶
RT-1 discretizes continuous actions into 256 bins per dimension:
def discretize_actions(actions, num_bins=256, action_bounds=(-1, 1)):
"""Convert continuous actions to discrete tokens"""
# Normalize to [0, 1]
actions_normalized = (actions - action_bounds[0]) / (action_bounds[1] - action_bounds[0])
# Discretize
actions_discrete = (actions_normalized * (num_bins - 1)).long()
# Clamp to valid range
actions_discrete = torch.clamp(actions_discrete, 0, num_bins - 1)
return actions_discrete
def undiscretize_actions(actions_discrete, num_bins=256, action_bounds=(-1, 1)):
"""Convert discrete tokens back to continuous actions"""
# To [0, 1]
actions_normalized = actions_discrete.float() / (num_bins - 1)
# To original range
actions = actions_normalized * (action_bounds[1] - action_bounds[0]) + action_bounds[0]
return actions
3. Film-Efficient Net Conditioning¶
Language conditioning using FiLM layers:
class FiLMLayer(nn.Module):
"""Feature-wise Linear Modulation"""
def __init__(self, feature_dim, condition_dim):
super().__init__()
self.gamma = nn.Linear(condition_dim, feature_dim)
self.beta = nn.Linear(condition_dim, feature_dim)
def forward(self, features, conditioning):
# features: (B, ..., feature_dim)
# conditioning: (B, condition_dim)
gamma = self.gamma(conditioning)
beta = self.beta(conditioning)
# Broadcast and modulate
return gamma.unsqueeze(1) * features + beta.unsqueeze(1)
Training Details¶
Dataset: 130k demonstrations from 700+ tasks
Training Configuration:
config = {
'batch_size': 256,
'learning_rate': 1e-4,
'weight_decay': 1e-4,
'epochs': 100,
'gradient_clip': 1.0,
# Action space
'action_bins': 256,
'action_dim': 7, # x, y, z, roll, pitch, yaw, gripper
# Vision
'image_size': (224, 224),
'tokens_per_image': 8,
# Language
'max_instruction_length': 77,
'language_embedding_dim': 512,
}
Results¶
- Success rate: 97% on seen tasks, 76% on novel tasks
- Real-time capable: 3 Hz control frequency
- Generalization: Handles novel objects and scenarios
RT-2: Vision-Language-Action Model from Web Data¶
Paper: "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control" (Brohan et al., 2023)
Revolutionary Approach¶
RT-2 co-fine-tunes a pre-trained vision-language model (VLM) for robotics:
graph TD
A[Pre-trained VLM<br/>PaLI or PaLM-E] --> B[Add Action Tokens]
B --> C[Co-fine-tune on<br/>Robot Data]
C --> D[VLA Model]
D --> E[Understands:<br/>- Vision<br/>- Language<br/>- Actions]
Key Insight¶
Instead of training from scratch, RT-2 leverages internet-scale vision-language models:
- PaLI: 5B parameter vision-language model
- PaLM-E: 562B parameter embodied model
- Pre-trained on billions of web images with text
Architecture¶
RT-2 uses the VLM architecture directly with minimal modifications:
class RT2Model(nn.Module):
"""RT-2: VLM adapted for robotics"""
def __init__(self, base_vlm, action_vocab_size=256):
super().__init__()
# Use pre-trained VLM
self.vlm = base_vlm # PaLI or PaLM-E
# Extend vocabulary with action tokens
self.action_token_embedding = nn.Embedding(
action_vocab_size * 7, # 7 dimensions
self.vlm.config.hidden_size
)
def forward(self, image, instruction):
# Encode image and instruction
vlm_output = self.vlm.encode(image, instruction)
# Generate action tokens autoregressively
actions = self.vlm.generate(
vlm_output,
max_new_tokens=7, # 7-DoF action
use_cache=True
)
return self.decode_actions(actions)
def decode_actions(self, action_tokens):
"""Convert action tokens to continuous actions"""
# Each dimension is a separate token
actions = []
for dim_tokens in action_tokens.chunk(7, dim=-1):
# Undiscretize each dimension
dim_action = self.undiscretize(dim_tokens)
actions.append(dim_action)
return torch.stack(actions, dim=-1)
Training Strategy¶
Two-stage approach:
- Pre-training: Train VLM on web data (already done)
- Co-fine-tuning: Fine-tune on robot data while preserving VLM capabilities
def co_finetune_rt2(vlm_model, robot_dataset, config):
"""Co-fine-tune VLM for robotics"""
# Freeze most layers initially
for param in vlm_model.parameters():
param.requires_grad = False
# Unfreeze last N layers
for layer in vlm_model.layers[-config.unfreeze_layers:]:
for param in layer.parameters():
param.requires_grad = True
# Unfreeze action token embeddings
for param in vlm_model.action_token_embedding.parameters():
param.requires_grad = True
# Mixed objective: VLM loss + robot loss
optimizer = torch.optim.AdamW(
filter(lambda p: p.requires_grad, vlm_model.parameters()),
lr=config.learning_rate
)
for batch in robot_dataset:
# Robot control loss
predicted_actions = vlm_model(batch['image'], batch['instruction'])
robot_loss = F.mse_loss(predicted_actions, batch['action'])
# Optional: Mix with VLM objectives
if config.use_vlm_loss:
vlm_loss = vlm_model.compute_vlm_loss(batch)
total_loss = robot_loss + config.vlm_loss_weight * vlm_loss
else:
total_loss = robot_loss
# Update
optimizer.zero_grad()
total_loss.backward()
optimizer.step()
Emergent Capabilities¶
RT-2 exhibits remarkable capabilities inherited from web pre-training:
1. Reasoning from Visual Observations¶
Instruction: "Pick up the object used for cutting paper"
RT-2: [Identifies scissors among multiple objects]
2. Mathematical/Symbolic Reasoning¶
3. Multi-lingual Understanding¶
Instruction (Spanish): "Recoge el objeto rojo"
RT-2: [Picks up red object, despite training only on English]
4. Chain-of-Thought Planning¶
Instruction: "Clear the table"
RT-2: [Generates plan:
1. Identify objects on table
2. Pick closest object
3. Move to bin
4. Drop
5. Repeat]
Performance Comparison¶
| Metric | RT-1 | RT-2-PaLI | RT-2-PaLM-E |
|---|---|---|---|
| Seen Tasks | 97% | 93% | 90% |
| Novel Tasks | 76% | 89% | 93% |
| Emergent Skills | Limited | Good | Excellent |
| Zero-shot Symbols | 0% | 67% | 84% |
| Model Size | 35M | 5B | 562B |
Implementation Details¶
Tokenization Strategy:
class ActionTokenizer:
"""Tokenize actions for RT-2"""
def __init__(self, action_dim=7, bins_per_dim=256):
self.action_dim = action_dim
self.bins_per_dim = bins_per_dim
# Create vocabulary: [ACTION_0_BIN_0, ACTION_0_BIN_1, ..., ACTION_6_BIN_255]
self.vocab_size = action_dim * bins_per_dim
# Special tokens
self.ACTION_START = self.vocab_size
self.ACTION_END = self.vocab_size + 1
def tokenize(self, actions):
"""Convert continuous actions to tokens"""
# actions: (batch, action_dim) in [-1, 1]
# Discretize each dimension
actions_normalized = (actions + 1) / 2 # to [0, 1]
actions_discrete = (actions_normalized * (self.bins_per_dim - 1)).long()
# Convert to flat token IDs
tokens = []
for i in range(self.action_dim):
token_id = i * self.bins_per_dim + actions_discrete[:, i]
tokens.append(token_id)
tokens = torch.stack(tokens, dim=1) # (batch, action_dim)
return tokens
def detokenize(self, tokens):
"""Convert tokens back to continuous actions"""
# tokens: (batch, action_dim)
actions = []
for i in range(self.action_dim):
# Extract bin index for this dimension
bin_idx = tokens[:, i] % self.bins_per_dim
# Convert to continuous
action_dim = bin_idx.float() / (self.bins_per_dim - 1) # to [0, 1]
action_dim = action_dim * 2 - 1 # to [-1, 1]
actions.append(action_dim)
return torch.stack(actions, dim=1)
Training Loop:
def train_rt2(model, dataloader, config):
"""Training loop for RT-2"""
model.train()
for epoch in range(config.num_epochs):
for batch in dataloader:
# Get inputs
images = batch['images'].cuda()
instructions = batch['instructions']
actions = batch['actions'].cuda()
# Tokenize actions
action_tokens = model.tokenizer.tokenize(actions)
# Forward pass
logits = model(images, instructions, target_tokens=action_tokens)
# Compute loss (cross-entropy over action token vocabulary)
loss = F.cross_entropy(
logits.view(-1, model.vocab_size),
action_tokens.view(-1)
)
# Backward
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
Deployment Considerations¶
Inference Speed: - RT-1: ~3 Hz (real-time capable) - RT-2-PaLI: ~1 Hz - RT-2-PaLM-E: ~0.2 Hz (too slow for real-time)
Solutions for RT-2:
-
Model Distillation:
# Distill RT-2-PaLM-E → RT-2-PaLI → RT-1-sized model teacher_model = RT2_PaLM_E() student_model = RT1() for batch in dataset: # Get teacher predictions (with temperature) with torch.no_grad(): teacher_logits = teacher_model(batch) / temperature teacher_probs = F.softmax(teacher_logits, dim=-1) # Train student to match student_logits = student_model(batch) student_probs = F.softmax(student_logits, dim=-1) # KL divergence loss distill_loss = F.kl_div( student_probs.log(), teacher_probs, reduction='batchmean' ) distill_loss.backward() optimizer.step() -
Quantization: INT8 or INT4 quantization for faster inference
-
Pruning: Remove redundant weights while maintaining performance
Practical Implementation¶
Complete RT-2 Style Model¶
import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer
class RT2LikeModel(nn.Module):
"""RT-2 style model using pre-trained VLM"""
def __init__(self, vlm_name='google/paligemma-3b-pt-224', action_dim=7, bins=256):
super().__init__()
# Load pre-trained VLM
self.vlm = AutoModel.from_pretrained(vlm_name)
self.tokenizer = AutoTokenizer.from_pretrained(vlm_name)
# Action tokenizer
self.action_tokenizer = ActionTokenizer(action_dim, bins)
# Extend embeddings for action tokens
self.action_embeddings = nn.Embedding(
action_dim * bins,
self.vlm.config.hidden_size
)
# Action prediction head
self.action_head = nn.Linear(
self.vlm.config.hidden_size,
action_dim * bins
)
def forward(self, images, instructions, actions=None):
# Tokenize instruction
text_tokens = self.tokenizer(
instructions,
return_tensors='pt',
padding=True
).to(images.device)
# VLM encoding
outputs = self.vlm(
pixel_values=images,
input_ids=text_tokens['input_ids'],
attention_mask=text_tokens['attention_mask']
)
# Get last hidden state
hidden_states = outputs.last_hidden_state
# Predict actions
action_logits = self.action_head(hidden_states[:, -1, :])
# Reshape to (batch, action_dim, bins)
action_logits = action_logits.view(-1, self.action_tokenizer.action_dim, self.action_tokenizer.bins_per_dim)
if actions is not None:
# Training: compute loss
action_tokens = self.action_tokenizer.tokenize(actions)
loss = F.cross_entropy(
action_logits.view(-1, self.action_tokenizer.bins_per_dim),
action_tokens.view(-1)
)
return loss
else:
# Inference: sample actions
action_tokens = torch.argmax(action_logits, dim=-1)
predicted_actions = self.action_tokenizer.detokenize(action_tokens)
return predicted_actions
Key Takeaways¶
RT-1¶
- ✓ Real-time capable (3 Hz)
- ✓ Efficient architecture
- ✓ Proven on real robots
- ✗Limited generalization to novel concepts
RT-2¶
- ✓ Exceptional generalization (inherited from web data)
- ✓ Emergent reasoning capabilities
- ✓ Multi-lingual support
- ✗Slower inference (requires optimization)
- ✗Requires large-scale pre-trained VLMs
When to Use¶
Use RT-1 when: - Real-time control is critical - Limited compute resources - Task distribution is well-defined
Use RT-2 when: - Novel task generalization is important - Reasoning/planning capabilities needed - Can tolerate slower inference or invest in optimization
References¶
- Brohan et al., "RT-1: Robotics Transformer for Real-World Control at Scale", CoRL 2022
- Brohan et al., "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", arXiv 2023
- Zitkovich et al., "RT-X: Generalizing to New Embodiments", arXiv 2023
Next Steps¶
- OpenVLA - Open-source VLA model
- Octo - Generalist robot policy
- Custom Architectures - Build your own VLA