编程 UI-TARS-desktop实战：用Qwen3-4B构建跨平台多模态AI Agent——从屏幕理解到桌面自动化

2026-05-15 23:18:28 +0800 CST views 6

UI-TARS-desktop 深度解析：字节跳动开源多模态AI Agent技术栈——让AI真正"看懂"并"操作"你的桌面

作者：程序员茄子
发布时间：2026年5月15日
字数：约8500字
标签：AI Agent, 多模态, 字节跳动, UI-TARS, 桌面自动化, Qwen3

摘要

2026年，AI Agent领域正经历从"聊天机器人"到"数字同事"的范式转变。字节跳动开源的UI-TARS-desktop项目，作为一套完整的多模态AI Agent技术栈，通过将前沿视觉语言模型(VLM)与底层Agent基础设施深度集成，实现了对Windows、macOS、Linux桌面环境的自主理解与操作。该项目在GitHub上获得32,693星，成为多模态Agent领域的标杆项目。本文将深入解析UI-TARS-desktop的技术架构、核心算法、实战部署与性能优化，揭示其如何通过"感知-规划-执行"闭环，让AI真正具备桌面级自动化能力。

一、背景介绍：为什么需要多模态桌面AI Agent？

1.1 传统RPA的局限性

传统机器人流程自动化(RPA)工具依赖元素定位器（如XPath、CSS选择器）来识别界面元素。这种方法存在三个根本问题：

脆弱性：UI变更导致选择器失效，维护成本高昂
跨平台障碍：不同操作系统、不同应用框架需要完全不同的适配器
无视觉理解：无法处理基于图像的界面（如Canvas绘制、游戏界面、远程桌面）

1.2 视觉语言模型(VLM)的突破

2024-2026年，视觉语言模型取得突破性进展：

UI-TARS-1.5：专为GUI交互训练，能看懂截图中的按钮、输入框、菜单
Qwen3-4B-Instruct-2507：4B参数规模的多模态模型，可在消费级显卡运行
GPT-4o、Claude 3.5：商业模型展示了强大的屏幕理解能力

这些模型能够：

直接分析屏幕截图，识别UI元素
理解空间关系（"左侧"、"下方"、"已关闭状态"）
响应自然语言指令（"把左侧树状图中所有'已关闭'节点展开"）

1.3 UI-TARS-desktop的定位

UI-TARS-desktop不是另一个聊天机器人，而是一个具备桌面操作能力的自主Agent系统。其核心价值在于：

"连接最前沿的视觉大模型与底层的Agent基础设施，实现从'理解'到'行动'的完整闭环"

二、核心概念解析

2.1 多模态AI Agent技术栈

UI-TARS-desktop采用分层架构，每一层都可独立替换升级：

┌─────────────────────────────────────────┐
│         自然语言指令 ("帮我整理桌面")         │
└─────────────────┬───────────────────────┘
                  │
┌─────────────────▼───────────────────────┐
│           规划层 (Planning)               │
│  • 任务拆解 • 动作序列生成 • 动态路径调整    │
└─────────────────┬───────────────────────┘
                  │
┌─────────────────▼───────────────────────┐
│           感知层 (Perception)             │
│  • 屏幕截图 • UI元素识别 • 状态理解        │
└─────────────────┬───────────────────────┘
                  │
┌─────────────────▼───────────────────────┐
│           执行层 (Execution)              │
│  • 鼠标模拟 • 键盘输入 • 跨平台OS适配     │
└─────────────────┬───────────────────────┘
                  │
┌─────────────────▼───────────────────────┐
│         通信层 (Infra)                   │
│  • Agent协议 • 本地/云端模型切换 • 工具调用 │
└─────────────────────────────────────────┘

2.2 关键技术创新

2.2.1 原子动作(Atomic Actions)设计

UI-TARS将复杂的桌面操作拆解为原子动作：

# 原子动作示例
class AtomicAction:
    CLICK = "click(x, y)"           # 点击坐标
    TYPE = "type(text)"             # 输入文本
    SCROLL = "scroll(dx, dy)"      # 滚动
    DRAG = "drag(x1,y1,x2,y2)"    # 拖拽
    SCREENSHOT = "screenshot()"    # 截图
    OCR = "ocr_region(x,y,w,h)"   # 区域文字识别

2.2.2 视觉奖励模型(Visual Reward Model)

为提高操作准确性，UI-TARS引入了视觉奖励模型：

操作前截图 → VLM分析 → 执行动作 → 操作后截图 → 奖励模型评估
                              ↓
                    奖励分数 > 阈值？继续：重新规划

2.2.3 跨平台执行引擎

// 跨平台鼠标操作抽象
class MouseOperator {
  // Windows: user32.dll SendInput
  // macOS: CoreGraphics CGEventCreateMouseEvent  
  // Linux: X11 XTest extension / Wayland compositor
  click(x, y) {
    if (platform === 'win32') { /* ... */ }
    if (platform === 'darwin') { /* ... */ }
    if (platform === 'linux') { /* ... */ }
  }
}

三、架构深度分析

3.1 感知层：如何让AI"看懂"屏幕？

3.1.1 屏幕理解与元素定位

UI-TARS使用微调后的Qwen3-4B模型进行屏幕理解：

from transformers import AutoModelForCausalLM, AutoProcessor
import torch

# 加载UI-TARS微调后的Qwen3模型
model = AutoModelForCausalLM.from_pretrained(
    "bytedance/UI-TARS-Qwen3-4B-VL",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("bytedance/UI-TARS-Qwen3-4B-VL")

def understand_screen(screenshot_path, instruction):
    """分析屏幕截图，理解UI元素"""
    image = Image.open(screenshot_path)
    
    # 构建多模态输入
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": f"分析屏幕，找到：{instruction}"}
            ]
        }
    ]
    
    # 处理输入
    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_tensors="pt",
        images=[image]
    ).to(model.device)
    
    # 生成响应
    outputs = model.generate(**inputs, max_new_tokens=512)
    response = processor.decode(outputs[0], skip_special_tokens=True)
    
    # 解析响应：{"element": "按钮", "bbox": [x1,y1,x2,y2], "description": "..."}
    return parse_ui_response(response)

# 实战：找到"已关闭"状态的树节点
result = understand_screen(
    "screenshot.png",
    "找到所有显示为'已关闭'的树节点，返回它们的边界框坐标"
)

3.1.2 视觉特征提取

模型能够识别：

UI元素类型：按钮、输入框、下拉菜单、树形控件
元素状态：启用/禁用、选中/未选中、展开/折叠
空间关系：左侧、右侧、上方、下方、包含关系

3.2 规划层：从指令到动作序列

3.2.1 任务拆解算法

class TaskPlanner:
    def __init__(self, vlm_model):
        self.model = vlm_model
    
    def plan(self, instruction, screen_state):
        """
        将自然语言指令拆解为动作序列
        instruction: "把左侧树状图中所有'已关闭'节点展开"
        screen_state: 当前屏幕理解结果
        """
        prompt = f"""
        指令：{instruction}
        当前屏幕状态：{screen_state}
        
        请生成完成该指令需要的动作序列，格式为：
        [
          {{"action": "click", "target": "第一个'已关闭'节点", "bbox": [x1,y1,x2,y2]}},
          {{"action": "screenshot", "purpose": "验证展开结果"}},
          {{"action": "click", "target": "第二个'已关闭'节点", "bbox": [x1,y1,x2,y2]}},
          ...
        ]
        """
        
        response = self.model.generate(prompt)
        return self.parse_action_sequence(response)
    
    def replan(self, execution_result, error=None):
        """
        根据执行结果动态调整计划
        """
        if error:
            # 错误处理：重新识别元素或调整坐标
            return self.adjust_plan(error)
        elif not execution_result["success"]:
            # 执行失败：重试或跳过
            return self.retry_or_skip(execution_result)
        else:
            # 执行成功：继续下一个动作
            return self.continue_plan()

3.2.2 动态路径调整

实际执行中，屏幕状态可能变化（如弹窗、加载延迟），规划层需要实时调整：

// 执行循环
async function executePlan(plan) {
  for (let step of plan.steps) {
    // 1. 截图当前状态
    const screenshot = await takeScreenshot();
    
    // 2. 验证当前状态是否符合预期
    const currentState = await perceiveScreen(screenshot);
    if (!step.precondition.check(currentState)) {
      // 3. 状态不符，重新规划
      plan = await replan(plan, step, currentState);
      continue; // 从头执行新计划
    }
    
    // 4. 执行动作
    const result = await executeAction(step.action);
    
    // 5. 验证执行结果
    const postScreenshot = await takeScreenshot();
    const postState = await perceiveScreen(postScreenshot);
    if (!step.postcondition.check(postState)) {
      // 执行失败，重试或报错
      await handleExecutionFailure(step, result);
    }
  }
}

3.3 执行层：跨平台桌面自动化

3.3.1 Windows实现

// Windows: 使用SendInput模拟输入
#include <windows.h>

void clickWindows(int x, int y) {
    // 移动鼠标
    SetCursorPos(x, y);
    
    // 模拟鼠标左键按下和释放
    INPUT inputs[2] = {};
    
    // 按下
    inputs[0].type = INPUT_MOUSE;
    inputs[0].mi.dwFlags = MOUSEEVENTF_LEFTDOWN;
    
    // 释放
    inputs[1].type = INPUT_MOUSE;
    inputs[1].mi.dwFlags = MOUSEEVENTF_LEFTUP;
    
    SendInput(2, inputs, sizeof(INPUT));
}

void typeWindows(const char* text) {
    // 将文本转为键盘输入
    INPUT inputs[256] = {};
    int inputCount = 0;
    
    for (char c : text) {
        // 处理普通字符
        inputs[inputCount].type = INPUT_KEYBOARD;
        inputs[inputCount].ki.wVk = VkKeyScan(c) & 0xFF;
        inputCount++;
        
        // 释放键
        inputs[inputCount] = inputs[inputCount-1];
        inputs[inputCount].ki.dwFlags = KEYEVENTF_KEYUP;
        inputCount++;
    }
    
    SendInput(inputCount, inputs, sizeof(INPUT));
}

3.3.2 macOS实现

// macOS: 使用CoreGraphics Event Services
import CoreGraphics

func clickMacOS(x: Int, y: Int) {
    // 创建鼠标事件
    let eventDown = CGEvent(mouseEventSource: nil, 
                           mouseType: .leftMouseDown, 
                           mouseCursorPosition: CGPoint(x: x, y: y), 
                           mouseButton: .left)
    let eventUp = CGEvent(mouseEventSource: nil, 
                         mouseType: .leftMouseUp, 
                         mouseCursorPosition: CGPoint(x: x, y: y), 
                         mouseButton: .left)
    
    // 发布事件
    eventDown?.post(tap: .cghidEventTap)
    eventUp?.post(tap: .cghidEventTap)
}

func typeMacOS(text: String) {
    // 使用CoreGraphics模拟键盘输入
    for char in text {
        let keyCode = KeyCodeMapper.getKeyCode(for: char)
        
        let eventDown = CGEvent(keyboardEventSource: nil, 
                              virtualKey: keyCode, 
                              keyDown: true)
        let eventUp = CGEvent(keyboardEventSource: nil, 
                            virtualKey: keyCode, 
                            keyDown: false)
        
        eventDown?.post(tap: .cghidEventTap)
        eventUp?.post(tap: .cghidEventTap)
        
        // 添加微小延迟，确保输入稳定
        Thread.sleep(forTimeInterval: 0.01)
    }
}

3.3.3 Linux实现

// Linux: 使用X11 XTest扩展
#include <X11/Xlib.h>
#include <X11/extensions/XTest.h>

void clickLinux(Display* display, int x, int y) {
    // 移动鼠标
    XWarpPointer(display, None, DefaultRootWindow(display), 0, 0, 0, 0, x, y);
    
    // 模拟点击
    XTestFakeButtonEvent(display, 1, True, CurrentTime);  // 按下
    XTestFakeButtonEvent(display, 1, False, CurrentTime); // 释放
    
    XFlush(display);
}

void typeLinux(Display* display, const char* text) {
    // 使用XTest模拟键盘输入
    for (char c = *text; c != '\0'; text++, c = *text) {
        KeySym keysym = XStringToKeysym(&c);
        KeyCode keycode = XKeysymToKeycode(display, keysym);
        
        if (keycode != 0) {
            XTestFakeKeyEvent(display, keycode, True, CurrentTime);  // 按下
            XTestFakeKeyEvent(display, keycode, False, CurrentTime); // 释放
            XFlush(display);
        }
    }
}

3.4 通信层：Agent协议与模型调度

3.4.1 本地模型服务(vLLM)

UI-TARS-desktop默认使用vLLM推理框架运行Qwen3-4B：

# vLLM启动配置
model: bytedance/UI-TARS-Qwen3-4B-VL
tensor-parallel-size: 1          # 单GPU
gpu-memory-utilization: 0.9      # GPU内存使用率
max-model-len: 8192              # 最大上下文长度
dtype: bfloat16                  # 计算精度
enable-prefix-caching: true      # 前缀缓存加速

# 调用本地vLLM服务
import requests

def call_local_vllm(prompt, image_path):
    """调用本地vLLM服务进行推理"""
    API_URL = "http://localhost:8000/v1/chat/completions"
    
    # 构建多模态消息
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"file://{image_path}"}},
                {"type": "text", "text": prompt}
            ]
        }
    ]
    
    # 发送请求
    response = requests.post(
        API_URL,
        json={
            "model": "bytedance/UI-TARS-Qwen3-4B-VL",
            "messages": messages,
            "max_tokens": 512,
            "temperature": 0.1  # 低温度，提高准确性
        }
    )
    
    return response.json()["choices"][0]["message"]["content"]

3.4.2 云端模型切换

UI-TARS-desktop支持无缝切换云端模型：

// 模型配置
const modelConfigs = {
  local: {
    endpoint: "http://localhost:8000/v1",
    model: "bytedance/UI-TARS-Qwen3-4B-VL",
    apiKey: null
  },
  openai: {
    endpoint: "https://api.openai.com/v1",
    model: "gpt-4o",
    apiKey: process.env.OPENAI_API_KEY
  },
  anthropic: {
    endpoint: "https://api.anthropic.com/v1",
    model: "claude-3-5-sonnet-20241022",
    apiKey: process.env.ANTHROPIC_API_KEY
  }
};

// 动态切换模型
async function switchModel(provider) {
  const config = modelConfigs[provider];
  
  // 更新Agent配置
  agent.updateConfig({
    model: config.model,
    baseURL: config.endpoint,
    apiKey: config.apiKey
  });
  
  console.log(`已切换至${provider}模型: ${config.model}`);
}

四、代码实战：部署与使用

4.1 快速安装（Docker方式）

# 1. 拉取镜像
docker pull csdn-mirror/ui-tars-desktop:latest

# 2. 启动容器
docker run -d \
  --name ui-tars \
  -p 7860:7860 \
  -v $(pwd)/workspace:/root/workspace \
  --gpus all \  # 使用GPU加速
  csdn-mirror/ui-tars-desktop:latest

# 3. 访问Web界面
# 打开浏览器：http://localhost:7860

4.2 从源码构建

# 1. 克隆仓库
git clone https://github.com/bytedance/UI-TARS-desktop.git
cd UI-TARS-desktop

# 2. 安装依赖
pip install -r requirements.txt

# 3. 下载模型权重
huggingface-cli download bytedance/UI-TARS-Qwen3-4B-VL \
  --local-dir ./models/UI-TARS-Qwen3-4B-VL

# 4. 启动vLLM推理服务
python -m vllm.entrypoints.openai.api_server \
  --model ./models/UI-TARS-Qwen3-4B-VL \
  --port 8000 \
  --gpu-memory-utilization 0.9

# 5. 启动UI-TARS-desktop前端
cd frontend
npm install
npm run dev

4.3 实战案例：自动整理桌面文件

# 案例：用UI-TARS-desktop整理桌面混乱的文件
from ui_tars import Agent, ScreenPerceiver, ActionExecutor

# 1. 初始化Agent
agent = Agent(
    model="bytedance/UI-TARS-Qwen3-4B-VL",
    endpoint="http://localhost:8000/v1"
)

# 2. 定义任务
task = """
分析桌面上的所有文件，按类型分类整理：
- 图片文件（png、jpg、gif）移动到"图片"文件夹
- 文档文件（pdf、docx、txt）移动到"文档"文件夹
- 代码文件（py、js、cpp）移动到"代码"文件夹
- 其他文件保持原位
"""

# 3. 执行任务
result = agent.execute_task(task)

# 4. 查看执行过程
print("执行步骤：")
for step in result["execution_trace"]:
    print(f"  {step['step']}. {step['action']} - {step['result']}")

print(f"\n任务完成！共处理{result['files_processed']}个文件")

4.4 高级功能：与现有工具集成

4.4.1 集成Selenium（Web自动化）

from selenium import webdriver
from ui_tars import Agent

# 1. 启动浏览器
driver = webdriver.Chrome()
driver.get("https://example.com")

# 2. 将浏览器截图传递给UI-TARS
screenshot = driver.get_screenshot_as_png()
with open("browser_screenshot.png", "wb") as f:
    f.write(screenshot)

# 3. 使用UI-TARS理解页面
agent = Agent(model="bytedance/UI-TARS-Qwen3-4B-VL")
page_analysis = agent.analyze_screen(
    "browser_screenshot.png",
    "找到登录表单的用户名输入框、密码输入框和登录按钮"
)

# 4. 自动填写表单
username_bbox = page_analysis["elements"][0]["bbox"]
password_bbox = page_analysis["elements"][1]["bbox"]
login_button_bbox = page_analysis["elements"][2]["bbox"]

# 使用Selenium执行操作（也可让UI-TARS直接操作）
driver.find_element_by_xpath("//input[@name='username']").send_keys("myuser")
driver.find_element_by_xpath("//input[@name='password']").send_keys("mypass")
driver.find_element_by_xpath("//button[@type='submit']").click()

4.4.2 集成PyAutoGUI（跨平台GUI自动化）

import pyautogui
from ui_tars import Agent

# 1. 截图
screenshot = pyautogui.screenshot()
screenshot.save("current_screen.png")

# 2. 使用UI-TARS分析屏幕
agent = Agent()
analysis = agent.analyze_screen(
    "current_screen.png",
    "找到画图软件的工具栏，识别画笔、橡皮擦、颜色选择器"
)

# 3. 根据分析结果操作
for tool in analysis["tools"]:
    if tool["name"] == "画笔":
        # 点击画笔工具
        x, y = tool["center"]
        pyautogui.click(x, y)
        break

# 4. 绘制图形
pyautogui.dragTo(100, 100, duration=0.5)  # 画线
pyautogui.dragTo(200, 200, duration=0.5)  # 继续画线

五、性能优化与最佳实践

5.1 模型推理优化

5.1.1 使用量化技术

# 8-bit量化，减少显存占用
model = AutoModelForCausalLM.from_pretrained(
    "bytedance/UI-TARS-Qwen3-4B-VL",
    load_in_8bit=True,  # 8-bit量化
    device_map="auto"
)

# 4-bit量化（更激进）
model = AutoModelForCausalLM.from_pretrained(
    "bytedance/UI-TARS-Qwen3-4B-VL",
    load_in_4bit=True,  # 4-bit量化
    bnb_4bit_compute_dtype=torch.bfloat16
)

5.1.2 批处理与缓存

# 批处理多个截图，提高吞吐量
screenshots = [screenshot1, screenshot2, screenshot3]
prompts = ["分析界面", "找到按钮", "识别文本"]

# 批量推理
results = model.batch_generate(
    images=screenshots,
    texts=prompts,
    batch_size=4  # 根据GPU显存调整
)

# 使用前缀缓存（相同提示词前缀部分不重复计算）
model.enable_prefix_caching(
    cache_size=10  # 缓存最近10个请求的前缀
)

5.2 执行可靠性提升

5.2.1 添加重试机制

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),  # 最多重试3次
    wait=wait_exponential(multiplier=1, min=4, max=10)  # 指数退避
)
def robust_click(element_description):
    """带重试的点击操作"""
    # 1. 截图
    screenshot = take_screenshot()
    
    # 2. 识别元素
    bbox = agent.find_element(screenshot, element_description)
    if not bbox:
        raise Exception(f"未找到元素: {element_description}")
    
    # 3. 点击
    x, y = (bbox[0] + bbox[2]) // 2, (bbox[1] + bbox[3]) // 2
    click(x, y)
    
    # 4. 验证点击结果
    post_screenshot = take_screenshot()
    if not agent.verify_click(post_screenshot, element_description):
        raise Exception("点击验证失败，可能点击位置不准确")

5.2.2 异常处理与回滚

class RollbackManager:
    """操作回滚管理器"""
    def __init__(self):
        self.history = []  # 操作历史
    
    def record_state(self, description):
        """记录当前状态，用于回滚"""
        screenshot = take_screenshot()
        self.history.append({
            "description": description,
            "screenshot": screenshot,
            "timestamp": time.time()
        })
    
    def rollback(self, steps=1):
        """回滚指定步数"""
        if len(self.history) < steps:
            raise Exception("无法回滚，历史记录不足")
        
        # 回滚到指定步骤的状态
        target_state = self.history[-steps]
        
        # 根据实际需要执行回滚操作
        # 例如：关闭打开的窗口、撤销文件操作等
        self._execute_rollback(target_state)
        
        # 从历史中移除已回滚的步骤
        self.history = self.history[:-steps]

5.3 资源占用优化

5.3.1 动态调整推理精度

class AdaptivePrecision:
    """根据任务复杂度动态调整推理精度"""
    def __init__(self, model):
        self.model = model
        self.simple_tasks = ["点击", "输入文字", "滚动"]
        self.complex_tasks = ["分析界面布局", "理解复杂表单", "多步骤操作"]
    
    def get_precision(self, task_description):
        """根据任务描述返回合适的精度"""
        if any(keyword in task_description for keyword in self.simple_tasks):
            return "low"  # 使用4-bit量化，快速推理
        elif any(keyword in task_description for keyword in self.complex_tasks):
            return "high"  # 使用bfloat16，精确推理
        else:
            return "medium"  # 使用8-bit量化，平衡性能
    
    def execute_with_adaptive_precision(self, task):
        """使用自适应精度执行任务"""
        precision = self.get_precision(task)
        
        if precision == "low":
            self.model.set_precision(load_in_4bit=True)
        elif precision == "medium":
            self.model.set_precision(load_in_8bit=True)
        else:
            self.model.set_precision(dtype=torch.bfloat16)
        
        return self.model.generate(task)

5.3.2 异步执行与并行处理

// 异步执行多个独立任务
async function parallelExecution(tasks) {
  // 1. 将任务分为独立子集
  const independentTasks = groupIndependentTasks(tasks);
  
  // 2. 并行执行
  const results = await Promise.all(
    independentTasks.map(taskGroup => 
      executeTaskGroup(taskGroup)
    )
  );
  
  // 3. 合并结果
  return mergeResults(results);
}

// 使用Web Worker进行后台处理
function offloadHeavyComputation(task) {
  return new Promise((resolve, reject) => {
    const worker = new Worker('ui-tars-worker.js');
    
    worker.onmessage = (event) => {
      resolve(event.data);
      worker.terminate();
    };
    
    worker.onerror = (error) => {
      reject(error);
      worker.terminate();
    };
    
    worker.postMessage(task);
  });
}

六、总结与展望

6.1 UI-TARS-desktop的技术价值

完整的Agent技术栈：从感知到执行的端到端解决方案
开箱即用：预配置环境，降低多模态Agent开发门槛
跨平台兼容：一套代码支持Windows/macOS/Linux
本地化部署：数据不出本地，满足企业安全需求

6.2 对开发者生态的影响

降低自动化开发门槛：自然语言即可完成复杂桌面操作
推动VLM落地：为视觉语言模型提供真实应用场景
加速Agent工具链成熟：激励更多开发者贡献技能库

6.3 未来发展方向

多Agent协作：多个UI-TARS实例协同完成复杂任务
长期记忆集成：结合agentmemory等项目，实现跨会话记忆
3D界面理解：扩展到VR/AR环境中的操作
边缘设备部署：在手机、树莓派等设备上运行轻量化模型

七、参考资源

项目地址：https://github.com/bytedance/UI-TARS-desktop
模型权重：HuggingFace - UI-TARS-Qwen3-4B-VL
技术文档：UI-TARS官方文档
社区讨论：Discord - UI-TARS Community

作者简介：程序员茄子，全栈工程师，AI技术爱好者。关注AI Agent、多模态学习、开源生态。
版权声明：本文为原创内容，转载请注明出处。

复制全文生成海报 AI Agent 多模态字节跳动 UI-TARS 桌面自动化 Qwen3