编程 Dexora 深度实战：首个 36 自由度双臂灵巧操作 VLA 模型完全指南——从 ICRA 2026 开源突破到生产级机器人部署（2026）

2026-06-02 13:53:45 +0800 CST views 2

Dexora 深度实战：首个 36 自由度双臂灵巧操作 VLA 模型完全指南——从 ICRA 2026 开源突破到生产级机器人部署（2026）

摘要：2026 年 6 月，ICRA 维也纳现场，Dexora 作为首个原生支持 36 自由度双臂灵巧操作的开放 VLA（Vision-Language-Action）模型正式开源。这不是又一个 Paper-only 的项目——它带来了可运行的模型权重、完整的遥操作数据采集 pipeline、以及面向生产环境的部署指南。本文从 VLA 的架构演进讲起，深入拆解 Dexora 的四层技术架构，给出从零开始的训练/推理实战代码，并探讨它如何打破「VLA 只能做简单抓取」的产业天花板。

一、背景：VLA 的三年进化与 Dexora 的产业信号

1.1 VLA 是怎么走到今天的

2023 年，Google DeepMind 的 RT-2 把「视觉-语言-动作」三个空间第一次端到端打通，业界第一次看到：用互联网规模的视觉-语言预训练，直接输出机器人控制指令，是可行的。

但 RT-2 之后，整个领域卡在一个尴尬的地方：

自由度瓶颈：绝大多数 VLA 模型只支持 6-7 自由度单臂操作（相当于人类一只手只有肩膀+肘+腕，没有手指），能做「拾取-放置」，做不了「拧瓶盖」「穿针引线」
双臂协同缺失：真实世界的灵巧任务几乎都需要两只手配合（一只手固定，另一只操作），而现有 VLA 模型对双臂的联合建模几乎是空白
数据稀缺：高自由度灵巧操作的数据采集成本极高——你需要昂贵的硬件，还需要专业的遥操作操作员

到 2025 年底，ICRA 2026 的审稿结果出来，VLA 相关论文占比超过 15%，但审稿人普遍的抱怨是：「demo 视频很好看，但代码不开源，数据不公开，无法复现」。

这就是 Dexora 出现的产业背景。

1.2 Dexora 是什么，为什么重要

Dexora 由深蓝具身智能团队在 ICRA 2026 Workshop on Physical Intelligence 上正式开源，核心突破有三点：

维度	此前 SOTA	Dexora
支持自由度	6-7 DoF 单臂	36 DoF 双臂（2×6 臂 + 2×12 灵巧手）
双手协同	独立建模，无协同机制	双臂联合推理，支持双手配合任务
开源程度	部分权重，无数据	权重 + 遥操作工具 + 训练 pipeline 全开源
真实部署	仅科研环境	提供 Isaac Sim 部署方案 + 真实硬件适配指南

简单说：Dexora 是第一个让你「真的可以在自己的机器人上跑起来」的高自由度 VLA 模型，而不是一个只能跑在 paper 里的概念。

二、核心概念：VLA 架构与 Dexora 的技术选型

2.1 VLA 的三层架构回顾

要理解 Dexora 的技术贡献，先要把 VLA 的标准三层架构讲清楚：

[视觉编码器]  [语言编码器]       [动作解码器]
    ↓              ↓                    ↓
  ViT /         LLM Embedding    Diffusion / 
CLIP ViT          ↓              Transformer
    ↓         [多模态融合层]          ↓
    └────────→  Fusion Token  →  机器人动作序列

视觉编码器：把摄像头图像变成 token 序列，通常用 ViT 或 CLIP ViT
语言编码器：把自然语言指令编码成语义向量，通常用预训练 LLM 的 embedding 层
多模态融合层：这是 VLA 的核心——让视觉 token 和语言 token 在同一个潜在空间里交互，常用方案是 cross-attention 或 token concatenation
动作解码器：把融合后的 latent 解码成机器人关节角度序列，主流方案是 diffusion policy 或 autoregressive transformer

2.2 Dexora 的架构创新：为什么 36 DoF 不是简单堆参数

Dexora 的核心架构图（简化）：

输入：
  RGB 图像 (2 个腕部相机 + 1 个外部第三人称相机)
      ↓ ViT-H/14 (冻结的 DINOv2 权重)
  Visual Tokens [N_vis × 768]
  
  语言指令 ("把瓶子拧开，倒进水杯")
      ↓ Llama-3-8B (语言编码器，LoRA 微调)
  Language Tokens [N_txt × 4096]

多模态融合（Dexora-Fusion）：
  Visual Tokens ──┐
                   ├──→ Cross-Attention (8 heads) → Fusion Tokens [256 × 768]
  Language Tokens ┘                                        ↓
                                                      Time-step embedding
                                                           ↓
动作解码（Dexora-Policy）：
  Fusion Tokens + Time Embedding
      ↓ Diffusion Transformer (DiT)
  36-DoF 动作序列: [T × 36]
  ├── 左臂 6 DoF  (关节角度)
  ├── 右臂 6 DoF
  ├── 左手 12 DoF (每根手指关节角度)
  └── 右手 12 DoF

关键设计决策：

为什么用 DINOv2 而不是 CLIP ViT？ CLIP 的视觉表征是针对「图像-文本匹配」优化的，对空间几何关系的感知较弱。Dexora 处理的灵巧操作任务（如「把钥匙插入锁孔」）对精细空间定位要求极高，DINOv2 的自监督预训练目标（MIM，Masked Image Modeling）在这方面更强。
为什么用 Diffusion 而不是 Autoregressive？ 机器人动作是一个连续空间里的多峰分布（同一个场景下，有多种合理的动作选择），diffusion policy 天然适合建模多峰分布，而 autoregressive 模型容易陷入单峰预测的陷阱。
双臂协同怎么建模？ Dexora 的动作向量是 36 维（而不是两个 18 维独立预测），fusion token 在最后一层同时 attend 到左右臂的视觉感受野——这意味着模型可以学到「左手固定瓶子，右手拧盖子」这类协同策略。

2.3 Dexora 的训练数据：混合遥操作采集系统

Dexora 的训练数据来自三个来源：

数据源	规模	采集方式	用途
AIRBOT 遥操作数据	~8000 条轨迹	人机混合遥操作（见下文）	主要训练集
Open X-Embodiment	~100 万条	公开数据集，做预训练	视觉-语言预训练
合成数据（Isaac Sim）	~50000 条	Domain Randomization	域适应增强

混合遥操作系统是 Dexora 工程上最值得细讲的部分：

[人类操作员]
      ↓ 手柄 + 数据手套
[实时运动映射] ←————— [力反馈设备]
      ↓
[机器人执行]  ←————— [状态记录（100Hz）]
      ↓
[数据后处理] → 去除异常帧 → 时间对齐 → 存储为 LeRobot 格式

核心代码片段（遥操作数据采集）：

# dexora/data/acquisition.py（简化版，展示核心逻辑）
import numpy as np
import airbot_py  # AIRBOT SDK
import time

class DexoraTeleopCollector:
    def __init__(self, robot_ip="192.168.1.100"):
        self.robot = airbot_py.AIRBOT(wait_on_connect=True)
        self.robot.connect(robot_ip)
        
        # 左手 12 DoF，右手 12 DoF，双臂各 6 DoF = 36 DoF
        self.dof_names = [
            # 左臂
            "left_arm_shoulder_pan", "left_arm_shoulder_lift",
            "left_arm_elbow_flex", "left_arm_wrist_pitch",
            "left_arm_wrist_roll", "left_arm_gripper",
            # 右臂（同上 6 个）
            "right_arm_shoulder_pan", "right_arm_shoulder_lift",
            "right_arm_elbow_flex", "right_arm_wrist_pitch",
            "right_arm_wrist_roll", "right_arm_gripper",
            # 左手手指（12 个关节）
            "left_hand_thumb_1", "left_hand_thumb_2",
            "left_hand_index_1", "left_hand_index_2",
            "left_hand_middle_1", "left_hand_middle_2",
            "left_hand_ring_1", "left_hand_ring_2",
            "left_hand_pinky_1", "left_hand_pinky_2",
            "left_hand_wrist_side", "left_hand_wrist_tilt",
            # 右手手指（同上 12 个）
            "right_hand_thumb_1", "right_hand_thumb_2",
            "right_hand_index_1", "right_hand_index_2",
            "right_hand_middle_1", "right_hand_middle_2",
            "right_hand_ring_1", "right_hand_ring_2",
            "right_hand_pinky_1", "right_hand_pinky_2",
            "right_hand_wrist_side", "right_hand_wrist_tilt",
        ]
        self.trajectory = []
        self.recording = False

    def start_recording(self, task_description: str):
        """开始记录遥操作轨迹"""
        self.trajectory = []
        self.task_description = task_description
        self.recording = True
        print(f"[Teleop] 开始记录: {task_description}")

    def step(self):
        """每一帧读取机器人状态，100Hz 采样"""
        if not self.recording:
            return
        state = self.robot.get_joint_positions(self.dof_names)
        # 同时读取腕部相机图像
        left_wrist_img = self.robot.get_camera_image("left_wrist")
        right_wrist_img = self.robot.get_camera_image("right_wrist")
        
        frame = {
            "timestamp": time.time(),
            "joint_positions": np.array(state),  # [36]
            "left_wrist_image": left_wrist_img,   # [480, 640, 3]
            "right_wrist_image": right_wrist_img,
            "task_description": self.task_description,
        }
        self.trajectory.append(frame)

    def stop_recording(self) -> dict:
        """停止记录，返回 LeRobot 格式数据"""
        self.recording = False
        return {
            "task": self.task_description,
            "trajectory": self.trajectory,
            "dof_names": self.dof_names,
            "fps": 100,
            "total_frames": len(self.trajectory),
        }

三、架构深度拆解：Dexora 模型的四层设计

3.1 第一层：视觉感知（Perception Backbone）

Dexora 的视觉编码器采用 ViT-Huge/14 冻结 DINOv2 权重 + 可学习适配器 的混合方案：

# dexora/model/perception.py
import torch
import torch.nn as nn
from transformers import AutoModel  # for DINOv2

class DexoraPerception(nn.Module):
    def __init__(self, freeze_backbone=True):
        super().__init__()
        # DINOv2 ViT-H/14: 634M 参数，输出 token 维度 1280
        self.backbone = AutoModel.from_pretrained("facebook/dinov2-vit-huge")
        
        if freeze_backbone:
            for param in self.backbone.parameters():
                param.requires_grad = False  # 冻结，节省显存
        
        # 可学习适配器：把 1280 维映射到模型内部维度 768
        self.adapter = nn.Sequential(
            nn.Linear(1280, 768),
            nn.LayerNorm(768),
            nn.GELU(),
            nn.Linear(768, 768),
        )
        
        # 3 个相机视图各自独立编码，然后 concat
        self.view_embeddings = nn.Parameter(torch.randn(3, 768))

    def forward(self, images: dict) -> torch.Tensor:
        """
        images: {
            "left_wrist": [B, 3, 224, 224],
            "right_wrist": [B, 3, 224, 224],
            "external": [B, 3, 224, 224],
        }
        Returns: visual_tokens [B, N_tokens, 768]
        """
        all_tokens = []
        for i, key in enumerate(["left_wrist", "right_wrist", "external"]):
            img = images[key]
            # DINOv2 输出: last_hidden_state [B, 257, 1280]
            feat = self.backbone(img).last_hidden_state
            feat = self.adapter(feat)  # [B, 257, 768]
            # 加入视角 embedding（让模型区分三个相机）
            feat = feat + self.view_embeddings[i]
            all_tokens.append(feat)
        
        # 3 个视图 concat: [B, 3×257, 768] = [B, 771, 768]
        visual_tokens = torch.cat(all_tokens, dim=1)
        return visual_tokens

设计决策说明：为什么只冻结 backbone，用适配器？因为在 8000 条遥操作数据上 finetune 一个 634M 参数的 ViT 是完全不可行的（会严重过拟合）。适配器方案让你只训练 ~2M 参数就能把 DINOv2 的表征迁移到机器人领域。

3.2 第二层：语言理解（Language Encoder）

Dexora 使用 Llama-3-8B 作为语言编码器，但只用其 embedding 层 + 前 4 层 transformer，其余层冻结：

# dexora/model/language.py
from transformers import AutoTokenizer, AutoModelForCausalLM

class DexoraLanguageEncoder(nn.Module):
    def __init__(self, model_name="meta-llama/Meta-Llama-3-8B-Instruct"):
        super().__init__()
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.llm = AutoModelForCausalLM.from_pretrained(model_name)
        
        # 只保留前 4 层 transformer，冻结其余部分
        self.active_layers = nn.ModuleList([
            self.llm.model.layers[i] for i in range(4)
        ])
        for param in self.llm.parameters():
            param.requires_grad = False

    def forward(self, text_instructions: list[str]) -> torch.Tensor:
        """
        text_instructions: List of strings, e.g.
            ["open the bottle and pour water into the cup", ...]
        Returns: lang_tokens [B, N_txt, 4096]
        """
        inputs = self.tokenizer(
            text_instructions,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=77,  # CLIP 风格的长度限制
        )
        input_ids = inputs["input_ids"]
        
        # 逐层前向（只跑前 4 层）
        hidden = self.llm.model.embed_tokens(input_ids)
        for layer in self.active_layers:
            hidden = layer(hidden)[0]
        
        return hidden  # [B, seq_len, 4096]

3.3 第三层：多模态融合（Dexora-Fusion）

这是 Dexora 最核心的创新点——双臂协同的融合机制：

# dexora/model/fusion.py
import torch.nn.functional as F

class DexoraFusion(nn.Module):
    def __init__(self, vis_dim=768, lang_dim=4096, hidden_dim=768, num_heads=8):
        super().__init__()
        # 语言 token 投影到视觉 token 的维度空间
        self.lang_proj = nn.Linear(lang_dim, hidden_dim)
        
        # Cross-attention: visual tokens 作为 query，language tokens 作为 key/value
        self.cross_attn = nn.MultiheadAttention(
            embed_dim=hidden_dim,
            num_heads=num_heads,
            batch_first=True,
        )
        
        # 自注意力：让视觉 token 之间也互相交流
        self.self_attn = nn.TransformerEncoderLayer(
            d_model=hidden_dim,
            nhead=num_heads,
            dim_feedforward=hidden_dim * 4,
            batch_first=True,
        )
        
        # 输出 token 数量（可学习）
        self.task_token = nn.Parameter(torch.randn(1, 1, hidden_dim))
        self.joint_tokens = nn.Parameter(torch.randn(1, 36, hidden_dim))
            # 36 个 token，每个对应一个 DoF 的预测

    def forward(self, visual_tokens, lang_tokens):
        """
        visual_tokens: [B, 771, 768]
        lang_tokens:   [B, N_txt, 4096]
        """
        B = visual_tokens.shape[0]
        
        # Step 1: 语言 token 投影
        lang_proj = self.lang_proj(lang_tokens)  # [B, N_txt, 768]
        
        # Step 2: Cross-attention（视觉感知 ← 语言语义）
        fused = self.cross_attn(
            query=visual_tokens,
            key=lang_proj,
            value=lang_proj,
        )[0]  # [B, 771, 768]
        
        # Step 3: Self-attention（视觉 token 内部交互）
        fused = self.self_attn(fused)  # [B, 771, 768]
        
        # Step 4: 加入可学习的 task token 和 joint tokens
        task_tok = self.task_token.expand(B, -1, -1)   # [B, 1, 768]
        joint_tok = self.joint_tokens.expand(B, -1, -1) # [B, 36, 768]
        
        fusion_output = torch.cat([fused, task_tok, joint_tok], dim=1)
        # [B, 771+1+36, 768] = [B, 808, 768]
        
        return fusion_output

3.4 第四层：动作解码（Dexora-Policy / Diffusion Transformer）

# dexora/model/policy.py
from diffusers import DiTTransformer2DModel

class DexoraPolicy(nn.Module):
    def __init__(self, fusion_dim=768, action_dim=36, horizon=16):
        """
        horizon: 预测未来 16 个时间步的动作（约 160ms @ 100Hz）
        action_dim: 36 DoF
        """
        super().__init__()
        self.horizon = horizon
        self.action_dim = action_dim
        
        # Diffusion Transformer (DiT)
        self.dit = DiTTransformer2DModel(
            num_attention_heads=8,
            attention_head_dim=96,
            in_channels=action_dim,       # 噪声动作序列
            out_channels=action_dim,
            num_layers=6,
            cross_attention_dim=fusion_dim,  # 用融合 token 做条件
        )
        
        # 时间步 embedding
        self.time_embed = nn.Sequential(
            nn.Linear(256, 768),
            nn.SiLU(),
            nn.Linear(768, 768),
        )

    def forward(self, noise_actions, timesteps, fusion_output):
        """
        noise_actions: [B, horizon, action_dim]  噪声动作（diffusion 输入）
        timesteps:     [B]                       diffusion 时间步
        fusion_output: [B, 808, 768]            融合层输出（作为条件）
        """
        # 时间步编码
        t_emb = self.time_embed(timestep_embedding(timesteps, 256))
        
        # DiT 去噪
        denoised = self.dit(
            sample=noise_actions.permute(0, 2, 1, 1),  # DiT 期望 [B, C, H, W]
            timestep=timesteps,
            encoder_hidden_states=fusion_output,  # cross-attention 条件
        ).sample
        
        denoised = denoised.permute(0, 2, 3, 1).squeeze(2)  # [B, horizon, action_dim]
        return denoised

四、代码实战：从环境搭建到第一次推理

4.1 环境安装

# 推荐使用 Python 3.10 + CUDA 12.1
conda create -n dexora python=3.10 -y
conda activate dexora

# PyTorch 2.1+ (CUDA 12.1)
pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu121

# Dexora 依赖
pip install transformers==4.36.0
pip install diffusers==0.24.0
pip install gym==0.26.2
pip install einops wandb tqdm

# 克隆 Dexora 仓库（假设开源在 GitHub）
git clone https://github.com/dexora-team/dexora.git
cd dexora
pip install -e .

4.2 下载预训练权重

# scripts/download_weights.py
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="dexora/dexora-36dof-v1",
    local_dir="./checkpoints/dexora-36dof-v1",
    local_dir_use_symlinks=False,
)
# 约 8.2 GB，包含：
#   - perception_backbone/   (DINOv2 adapters)
#   - language_encoder/      (Llama-3-8B 前4层 LoRA)
#   - fusion_module/         (Dexora-Fusion 权重)
#   - policy_module/         (DiT 去噪网络)

4.3 第一次推理：让机器人拧开瓶子

# examples/first_inference.py
import torch
from dexora.model.pipeline import DexoraPipeline
from dexora.robot.airbot_interface import AIRBOTController

# 1. 加载模型
pipeline = DexoraPipeline.from_pretrained(
    "./checkpoints/dexora-36dof-v1",
    device="cuda:0",
    torch_dtype=torch.bfloat16,
)
pipeline.eval()

# 2. 连接真实机器人（或 Isaac Sim 仿真）
robot = AIRBOTController(robot_ip="192.168.1.100")
robot.connect()

# 3. 构造输入
task = "Grasp the bottle with the left hand, unscrew the cap with the right hand, and place the cap on the table."

images = {
    "left_wrist": robot.get_camera_image("left_wrist"),   # [224, 224, 3]
    "right_wrist": robot.get_camera_image("right_wrist"),
    "external": robot.get_external_camera_image(),        # [224, 224, 3]
}

# 4. 推理（diffusion 去噪需要多个 step）
with torch.no_grad(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
    action_sequence = pipeline.predict(
        images=images,
        text_instruction=task,
        num_inference_steps=10,  # diffusion 去噪步数
        horizon=16,              # 预测未来 16 帧
    )
# action_sequence: [horizon, 36] = [16, 36]

# 5. 执行动作（每帧间隔 10ms = 100Hz）
for t in range(action_sequence.shape[0]):
    target_joints = action_sequence[t]  # [36]
    robot.set_joint_positions(target_joints, blocking=False)
    time.sleep(0.01)  # 100Hz

print("[Dexora] 任务执行完成")

4.4 在 Isaac Sim 中无硬件训练/评估

没有真实机器人？Dexora 提供完整的 Isaac Sim 仿真环境：

# examples/train_in_isaac_sim.py
from omni.isaac.kit import SimulationApp
from dexora.env.dexora_env import DexoraIsaacEnv

# 启动 Isaac Sim（headless 模式，无 GUI）
sim = SimulationApp({"headless": True})

env = DexoraIsaacEnv(
    num_envs=4,           # 并行 4 个仿真环境
    use_gpu_pipeline=True,
    task="bottle_opening",  # 内置任务：开瓶子
)

obs = env.reset()
for step in range(1000):
    # 用 Dexora 模型预测动作
    actions = pipeline.predict(
        images=obs["images"],
        text_instruction=env.get_current_task_description(),
        num_inference_steps=5,  # 仿真中可以少用几步去噪，加快速度
    )
    obs, reward, done, info = env.step(actions)
    
    if done.any():
        obs = env.reset()
        print(f"[IsaacSim] Epoch {step}: success rate = {info['success_rate']:.2%}")

sim.close()

五、性能优化：让 Dexora 在真实机器人上跑起来

5.1 推理延迟是机器人部署的死穴

机器人控制需要 100Hz 以上的控制频率（每 10ms 输出一次动作）。Dexora 原始模型在 A100 上跑一次推理需要约 80ms（diffusion 10 步去噪），远远不够实时。

优化方案 1：Diffusion 步数压缩

用 progressive distillation 把 10 步去噪压缩到 2 步：

# 原始（10 步，80ms）
action = pipeline.predict(..., num_inference_steps=10)

# 优化后（2 步，18ms）
action = pipeline.predict(..., num_inference_steps=2)

Dexora 官方提供了蒸馏后的权重 dexora-36dof-v1-distilled-2step，可以直接下载使用，精度损失 < 3%。

优化方案 2：视觉编码器异步化

视觉编码（DINOv2）占推理时间的 60%。把视觉编码和动作解码解耦：

# dexora/optimization/async_perception.py
import threading

class AsyncPerception:
    def __init__(self, perception_model, policy_model):
        self.perception = perception_model
        self.policy = policy_model
        self.latest_tokens = None
        self.lock = threading.Lock()
        self.thread = None

    def _encode_loop(self):
        """后台线程：持续编码最新图像"""
        while True:
            images = self.robot.get_latest_images()
            with torch.no_grad():
                tokens = self.perception(images)
            with self.lock:
                self.latest_tokens = tokens

    def start(self):
        self.thread = threading.Thread(target=self._encode_loop, daemon=True)
        self.thread.start()

    def predict_async(self, text_instruction):
        """用最新的视觉 token 做动作预测（无需等待视觉编码）"""
        with self.lock:
            visual_tokens = self.latest_tokens
        if visual_tokens is None:
            return None  # 还没编码好第一帧
        with torch.no_grad():
            action = self.policy(visual_tokens, text_instruction)
        return action

优化后端到端延迟：18ms（diffusion 2步） + 0ms（视觉异步） = 18ms，相当于 55Hz，接近实时控制要求。再配合 NVIDIA Jetson Orin 的 TensorRT 量化，可以进一步提升到 80Hz。

5.2 量化：INT8 部署到边缘设备

# 使用 NVIDIA TensorRT 对 Dexora 做 INT8 量化
python tools/export_tensorrt.py \
    --checkpoint ./checkpoints/dexora-36dof-v1-distilled-2step \
    --output ./checkpoints/dexora-trt-int8.engine \
    --precision int8 \
    --max_batch_size 1 \
    --onnx-opset 17

# 量化后模型大小：8.2 GB → 2.1 GB
# 推理延迟（Jetson Orin）：18ms → 12ms（83Hz）

5.3 领域适应：用 LoRA 快速适配新任务

Dexora 预训练权重覆盖了「开瓶子」「拧螺丝」「穿珠子」等通用灵巧任务，但如果你要让机器人做「外科手术缝合」这类新任务，不需要重新训练整个模型，只需要用 LoRA 微调融合层：

# examples/lora_finetune.py
from peft import LoraConfig, get_peft_model

# 只对 DexoraFusion 模块加 LoRA
lora_config = LoraConfig(
    r=16,                     # LoRA 秩
    lora_alpha=32,
    target_modules=["cross_attn", "self_attn"],  # 只微调注意力层
    modules_to_save=["task_token", "joint_tokens"],
)

model = DexoraPipeline.from_pretrained(...)
model = get_peft_model(model.fusion_module, lora_config)

# 用新任务的 200 条遥操作数据微调
trainer = Trainer(
    model=model,
    train_dataset=new_task_dataset,  # 仅需 200 条轨迹
    args=TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,
        num_train_epochs=20,
        learning_rate=1e-4,
        fp16=True,
        output_dir="./checkpoints/dexora-lora-surgical",
    ),
)
trainer.train()
# 20 个 epoch 后，新任务成功率从 12% → 78%

六、Dexora 的产业意义与局限性

6.1 它真正打破了什么

在 Dexora 之前，高自由度灵巧操作的 SOTA 模型（如 Google 的 RT-X、UC Berkeley 的 CEM）都面临同一个问题：模型结构和训练数据强绑定某一款硬件，换一个机械臂就得重新采集数据、重新训练。

Dexora 通过 以关节角度而非末端执行器位姿作为动作空间，实现了硬件无关性：

# 动作空间的两种表示对比

# ❌ 末端执行器位姿（传统方案，硬件相关）
action_ee = {
    "left_arm_pos": [x, y, z],        # 依赖具体机械臂的 FK/IK
    "left_arm_rot": [qx, qy, qz, qw],
    "right_arm_pos": [x, y, z],
    "right_arm_rot": [qx, qy, qz, qw],
}

# ✅ 关节角度（Dexora 方案，硬件无关）
action_joint = {
    "left_arm": [j1, j2, j3, j4, j5, j6],  # 直接发送给关节控制器
    "right_arm": [j1, j2, j3, j4, j5, j6],
    "left_hand": [f1, f2, ..., f12],
    "right_hand": [f1, f2, ..., f12],
}

这意味着：你在一个 AIRBOT 上训练好的 Dexora 模型，可以通过简单的标定迁移到 Franka、UR5 或其他支持 36 DoF 的硬件上，而不需要重新采集数据。

6.2 当前的局限性（坦诚地说）

训练数据规模仍然偏小：8000 条遥操作轨迹对于 36 DoF 的动作空间来说，数据效率不高。Dexora 团队自己也承认，在「未见过的物体类别」上泛化能力有限。
Diffusion 推理延迟：即使压缩到 2 步去噪，12ms 的延迟对于高速操作任务（如「接住抛出的球」）仍然不够快。
双手协同的学习效率：当前模型对双臂协同任务的建模是隐式的（通过 joint tokens 的联合 attending），没有显式的「左手-右手协调模块」，导致在需要精细配合的任务上成功率只有 ~45%。
开源但不完全开放：模型权重和训练代码开源了，但 AIRBOT 硬件的遥操作数据采集工具依赖商业 SDK（airbot_py），学术用户可能无法低成本复现数据采集流程。

6.3 与同期项目的对比

项目	自由度	开源程度	双臂协同	部署难度
Dexora	36 DoF	权重+代码	✅ 原生支持	中等（需 GPU）
OpenVLA	7 DoF	完全开源	❌ 仅单臂	低
RT-X (Google)	7 DoF	权重开源，数据部分开源	❌	高（依赖 Google 内部基础设施）
CEM (Berkeley)	6 DoF	代码开源，权重不开源	❌	中等
Humanoid-Everyday	全身 50+ DoF	数据开源，模型不开源	✅（全身协同）	极高

七、总结与展望

Dexora 在 ICRA 2026 的开源，标志着 VLA 模型从「实验室 demo」走向「可部署系统」迈出了实质性一步。它的核心价值不在于某项单一技术的突破，而在于把高自由度灵巧操作所需的整套技术栈（数据采集、模型架构、训练 pipeline、部署工具链）完整地开放出来，让更多的研究者可以在此基础上迭代。

对于开发者，现在可以做什么：

跑通 Isaac Sim 仿真：用仿真数据验证你的任务是否适合 VLA 方案，零硬件成本
用 LoRA 微调适配新任务：不需要从头训练，200 条数据就能看到明显效果
关注双手协同建模的后续工作：Dexora 团队已在 arXiv 放出了 Dexora-v2 的 technical report（预计 2026 年 Q3 开源），重点改进了双臂协同的显式建模

对于产业，Dexora 打开的场景包括：

精密装配（电子元器件插装）
柔性物体操作（穿线、系带、折叠衣物）
人机协作（双手递送、配合组装）

VLA 的下一个里程碑，是从「能完成孤立的灵巧操作任务」进化到「能在非结构化环境中持续自主工作 8 小时」。Dexora 是把这条路走通的重要一步，但还不是终点。

参考资源

Dexora 开源仓库：https://github.com/dexora-team/dexora
ICRA 2026 Workshop on Physical Intelligence：https://icra2026.org/workshops
Dexora 技术报告 arXiv：https://arxiv.org/abs/2606.xxxxx（搜索 "Dexora" 获取最新版本）
AIRBOT 硬件文档：https://airbot-docs.readthedocs.io
LeRobot 数据格式规范：https://github.com/huggingface/lerobot

本文撰写于 2026 年 6 月，基于 ICRA 2026 现场发布内容及公开技术资料。代码均为简化示例，生产部署请参考官方仓库的完整实现。