OpenVLA: Open-Source Vision-Language-Action Model¶

OpenVLA is the first fully open-source VLA model achieving strong performance across multiple robot embodiments.

Paper: "OpenVLA: An Open-Source Vision-Language-Action Model" (Kim et al., 2024)

GitHub: https://github.com/openvla/openvla

Overview¶

OpenVLA addresses critical limitations of closed-source models (RT-2, PaLM-E):

✓ Fully open-source: Weights, code, and training data
✓ Strong performance: Competitive with RT-2
✓ Multi-robot: Works across different embodiments
✓ Efficient: 7B parameters (vs 562B for PaLM-E)
✓ Reproducible: Complete training pipeline available

Architecture¶

OpenVLA builds on Prismatic VLMs with robotics-specific adaptations:

graph TD
    A[Image 224x224] --> B[DINOv2 ViT]
    C[SigLIP ViT] --> D[Vision Fusion]
    B --> D
    D --> E[Llama-2 7B<br/>Language Backbone]
    F[Instruction] --> E
    E --> G[Action Tokens<br/>Prediction]
    G --> H[7-DoF Actions]

Key Components¶

1. Dual Vision Encoders¶

OpenVLA uses two complementary vision encoders:

class DualVisionEncoder(nn.Module):
    """Fuse DINOv2 and SigLIP for robust vision encoding"""
    def __init__(self):
        super().__init__()

        # DINOv2: Self-supervised, strong spatial features
        self.dino = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')

        # SigLIP: Language-aligned vision features
        self.siglip = open_clip.create_model('ViT-SO400M-14-SigLIP')

        # Projection to common dimension
        self.proj_dino = nn.Linear(1024, 1024)  # DINOv2 ViT-L
        self.proj_siglip = nn.Linear(1152, 1024)  # SigLIP SO400M

        # Fusion mechanism
        self.fusion = CrossAttention(dim=1024, num_heads=16)

    def forward(self, images):
        # Extract features from both encoders
        with torch.no_grad():
            dino_features = self.dino(images)  # (B, 257, 1024)
            siglip_features = self.siglip(images)  # (B, 257, 1152)

        # Project
        dino_proj = self.proj_dino(dino_features)
        siglip_proj = self.proj_siglip(siglip_features)

        # Fuse with cross-attention
        fused = self.fusion(dino_proj, siglip_proj)

        return fused

Why Two Encoders?

DINOv2: Excellent spatial understanding, object segmentation
SigLIP: Language-aligned, better for instruction following
Fusion: Combines strengths of both

2. Llama-2 Backbone¶

Uses Llama-2 7B as the language model backbone:

from transformers import LlamaModel, LlamaTokenizer

class OpenVLABackbone(nn.Module):
    def __init__(self):
        super().__init__()

        # Load Llama-2 7B
        self.llama = LlamaModel.from_pretrained('meta-llama/Llama-2-7b-hf')
        self.tokenizer = LlamaTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')

        # Vision-language adapter
        self.vision_adapter = nn.Sequential(
            nn.Linear(1024, 4096),  # Vision dim -> Llama hidden dim
            nn.LayerNorm(4096),
            nn.GELU()
        )

        # Action prediction head
        self.action_head = nn.Linear(4096, 7 * 256)  # 7 dims, 256 bins each

    def forward(self, vision_features, instruction):
        # Adapt vision features to Llama dimension
        vision_tokens = self.vision_adapter(vision_features)

        # Tokenize instruction
        text_tokens = self.tokenizer(
            instruction,
            return_tensors='pt',
            padding=True
        ).input_ids

        # Get text embeddings
        text_embeds = self.llama.get_input_embeddings()(text_tokens)

        # Concatenate vision and text tokens
        combined = torch.cat([vision_tokens, text_embeds], dim=1)

        # Forward through Llama
        outputs = self.llama(inputs_embeds=combined)

        # Predict actions from last token
        action_logits = self.action_head(outputs.last_hidden_state[:, -1, :])

        return action_logits.view(-1, 7, 256)

Action Representation¶

OpenVLA uses discrete action tokens like RT-2:

class ActionTokenizer:
    """Discretize continuous actions into tokens"""
    def __init__(self, action_dim=7, bins=256, action_ranges=None):
        self.action_dim = action_dim
        self.bins = bins

        # Default ranges: [-1, 1] for pose, [0, 1] for gripper
        if action_ranges is None:
            self.action_ranges = [(-1, 1)] * 6 + [(0, 1)]
        else:
            self.action_ranges = action_ranges

    def encode(self, actions):
        """Continuous actions -> discrete tokens"""
        tokens = []

        for i in range(self.action_dim):
            # Normalize to [0, 1]
            low, high = self.action_ranges[i]
            normalized = (actions[:, i] - low) / (high - low)

            # Discretize
            discrete = (normalized * (self.bins - 1)).long()
            discrete = torch.clamp(discrete, 0, self.bins - 1)

            tokens.append(discrete)

        return torch.stack(tokens, dim=1)

    def decode(self, tokens):
        """Discrete tokens -> continuous actions"""
        actions = []

        for i in range(self.action_dim):
            # To [0, 1]
            normalized = tokens[:, i].float() / (self.bins - 1)

            # To original range
            low, high = self.action_ranges[i]
            continuous = normalized * (high - low) + low

            actions.append(continuous)

        return torch.stack(actions, dim=1)

Training Pipeline¶

Dataset: Open-X Embodiment¶

OpenVLA is trained on the Open-X Embodiment dataset:

Size: 1M+ trajectories across 22 robot embodiments
Tasks: 150+ task families
Diversity: Lab environments, kitchens, warehouses, outdoor

from openvla.data import OpenXEmbodimentDataset

# Load dataset
dataset = OpenXEmbodimentDataset(
    data_dir='/path/to/openx',
    robots=['franka', 'google_robot', 'taco_play'],  # Select embodiments
    tasks=['pick', 'place', 'drawer', 'wipe'],  # Select task families
    image_size=224
)

# Dataset returns
sample = dataset[0]
# {
#   'image': (3, 224, 224),
#   'instruction': "pick up the blue block",
#   'action': (7,),
#   'robot_id': 'franka'
# }

Training Configuration¶

model:
  vision_encoder:
    dino: 'dinov2_vitl14'
    siglip: 'ViT-SO400M-14-SigLIP'
  language_backbone: 'meta-llama/Llama-2-7b-hf'
  action_dim: 7
  action_bins: 256

training:
  batch_size: 256
  gradient_accumulation: 4
  learning_rate: 1e-4
  warmup_steps: 2000
  max_steps: 100000
  weight_decay: 0.01

  # LoRA for efficient fine-tuning
  use_lora: true
  lora_r: 16
  lora_alpha: 32
  lora_dropout: 0.05

optimization:
  optimizer: 'adamw'
  beta1: 0.9
  beta2: 0.95
  epsilon: 1e-8
  gradient_clip: 1.0

  # Mixed precision
  fp16: false
  bf16: true  # Better for large models

data:
  num_workers: 16
  prefetch_factor: 4
  pin_memory: true

  # Data augmentation
  augmentation:
    random_crop: true
    color_jitter: true
    random_flip: false  # Don't flip for robotics!

Training Script¶

import torch
from torch.utils.data import DataLoader
from openvla import OpenVLAModel
from peft import LoraConfig, get_peft_model

def train_openvla(config):
    # Initialize model
    model = OpenVLAModel(config)

    # Apply LoRA for efficient training
    if config.use_lora:
        lora_config = LoraConfig(
            r=config.lora_r,
            lora_alpha=config.lora_alpha,
            target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
            lora_dropout=config.lora_dropout,
            bias="none"
        )
        model = get_peft_model(model, lora_config)

    # Multi-GPU
    model = torch.nn.DataParallel(model)
    model = model.cuda()

    # Optimizer
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=config.learning_rate,
        betas=(config.beta1, config.beta2),
        eps=config.epsilon,
        weight_decay=config.weight_decay
    )

    # Learning rate scheduler with warmup
    scheduler = get_cosine_schedule_with_warmup(
        optimizer,
        num_warmup_steps=config.warmup_steps,
        num_training_steps=config.max_steps
    )

    # Load dataset
    dataset = OpenXEmbodimentDataset(config.data_dir)
    dataloader = DataLoader(
        dataset,
        batch_size=config.batch_size,
        num_workers=config.num_workers,
        shuffle=True,
        pin_memory=True
    )

    # Training loop
    model.train()
    global_step = 0

    for epoch in range(config.num_epochs):
        for batch in dataloader:
            # Forward
            logits = model(
                images=batch['image'].cuda(),
                instructions=batch['instruction']
            )

            # Loss: cross-entropy over action tokens
            targets = model.action_tokenizer.encode(batch['action'].cuda())
            loss = F.cross_entropy(
                logits.view(-1, config.action_bins),
                targets.view(-1)
            )

            # Backward
            loss.backward()

            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(
                model.parameters(),
                config.gradient_clip
            )

            # Update
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()

            global_step += 1

            # Logging
            if global_step % 100 == 0:
                print(f"Step {global_step}: Loss = {loss.item():.4f}, LR = {scheduler.get_last_lr()[0]:.2e}")

            # Checkpointing
            if global_step % 5000 == 0:
                torch.save({
                    'step': global_step,
                    'model_state_dict': model.state_dict(),
                    'optimizer_state_dict': optimizer.state_dict(),
                    'scheduler_state_dict': scheduler.state_dict(),
                }, f'checkpoint_step{global_step}.pt')

Fine-tuning for Your Robot¶

OpenVLA excels at few-shot adaptation to new robots:

from openvla import OpenVLAModel

# Load pre-trained OpenVLA
model = OpenVLAModel.from_pretrained('openvla/openvla-7b')

# Freeze base model, train adapter only
for name, param in model.named_parameters():
    if 'adapter' not in name and 'action_head' not in name:
        param.requires_grad = False

# Fine-tune on your robot data (as few as 50-100 demos)
your_dataset = YourRobotDataset(num_demos=100)
finetune(model, your_dataset, num_epochs=10)

Few-Shot Performance¶

Demonstrations	Success Rate
0 (zero-shot)	45%
10	62%
50	78%
100	85%
500	92%

Deployment¶

Inference Optimization¶

import torch
from transformers import AutoModelForCausalLM

# Load model
model = OpenVLAModel.from_pretrained('openvla/openvla-7b')
model.eval()
model.half()  # FP16 for speed
model.cuda()

# Compile for faster inference (PyTorch 2.0+)
model = torch.compile(model, mode='reduce-overhead')

# Inference loop
@torch.no_grad()
def predict_action(model, image, instruction):
    # Preprocess
    image_tensor = preprocess(image).cuda()

    # Forward pass
    action_logits = model(image_tensor, instruction)

    # Decode action
    action_tokens = torch.argmax(action_logits, dim=-1)
    action = model.action_tokenizer.decode(action_tokens)

    return action.cpu().numpy()

Real-Time Performance¶

Optimization	Latency	Throughput
Base (FP32)	200ms	5 Hz
FP16	120ms	8 Hz
FP16 + Compile	80ms	12 Hz
INT8 Quantization	50ms	20 Hz
TensorRT	30ms	33 Hz

Comparison with Other VLA Models¶

Model	Parameters	Open Source	Multi-Robot	Performance
RT-1	35M	✗	✗	Good
RT-2	5B-562B	✗	✗	Excellent
OpenVLA	7B	✓	✓	Very Good
Octo	93M	✓	✓	Good

Advanced: Multi-Embodiment Training¶

OpenVLA's key innovation is effective multi-robot training:

class EmbodimentAdapter(nn.Module):
    """Adapt OpenVLA to specific robot embodiment"""
    def __init__(self, num_embodiments=22, hidden_dim=4096, adapter_dim=256):
        super().__init__()

        # Learnable embodiment embeddings
        self.embodiment_embeddings = nn.Embedding(num_embodiments, adapter_dim)

        # Adapter layers
        self.adapter_down = nn.Linear(hidden_dim, adapter_dim)
        self.adapter_up = nn.Linear(adapter_dim, hidden_dim)

        # Initialize near-identity
        nn.init.zeros_(self.adapter_up.weight)
        nn.init.zeros_(self.adapter_up.bias)

    def forward(self, hidden_states, embodiment_id):
        # Get embodiment embedding
        emb = self.embodiment_embeddings(embodiment_id)

        # Adapter forward
        down = F.relu(self.adapter_down(hidden_states))
        down = down + emb.unsqueeze(1)  # Add embodiment info
        up = self.adapter_up(down)

        # Residual connection
        return hidden_states + up

Key Takeaways¶

Advantages: - ✓ Open-source and reproducible - ✓ Strong cross-embodiment generalization - ✓ Efficient (7B vs 562B for RT-2-PaLM) - ✓ Easy to fine-tune for new robots

Limitations: - ✗Slower than RT-1 (requires optimization) - ✗Slightly lower performance than RT-2-PaLM-E - ✗Requires significant compute for training

Best For: - Research and development - Custom robot platforms - When open-source is required - Multi-robot deployments

Resources¶

Code: https://github.com/openvla/openvla
Models: https://huggingface.co/openvla
Paper: Kim et al., "OpenVLA: An Open-Source Vision-Language-Action Model", 2024
Dataset: Open-X Embodiment

Next Steps¶

Octo - Another generalist robot policy
Fine-tuning Guide - Adapt OpenVLA to your robot
Custom Architectures - Build your own VLA