当前位置：首页 » AI新闻

GLM-4.1V-Thinking以90亿参数挑战大型模型的视觉推理

2025-07-02

多模态大模型的发展正在进入一个新阶段，从简单的图像识别（“看得见”），转向了复杂的逻辑推理与深度理解（“看得懂、想得通”）。近日，智谱 AI 发布并开源了 GLM-4.1V-Thinking 系列中的 GLM-4.1V-9B-Thinking 模型，展示了其在视觉语言模型高阶认知能力上的新进展。

该模型的核心创新在于引入了一种名为 课程采样强化学习（RLCS, Reinforcement Learning with Curriculum Sampling） 的训练策略。这种方法通过由易到难的任务安排来训练模型，类似于人类的学习过程，从而在复杂的推理任务上取得了显著进步。

最引人注目的是其性能表现。尽管 GLM-4.1V-9B-Thinking 仅有 90 亿参数，但在 18 项权威评测基准中，其表现却能比肩甚至超越参数量高达 720 亿的 Qwen2.5-VL-72B。这一结果挑战了“模型越大能力越强”的传统观念，凸显了先进模型架构与高效训练策略在提升性能与节约资源上的巨大潜力。

GLM-4.1V-9B-Thinking 与其他模型的性能对比图

相关资源链接：

论文: GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
开源代码库:
在线体验:
- Hugging Face Spaces
- 魔搭社区 Studio
API 文档: 大模型开放平台

模型核心能力与应用展示

GLM-4.1V-9B-Thinking 通过引入“思维链”（Chain-of-Thought）机制，使其在输出答案的同时，能够展示详细的推理过程。这不仅提升了回答的准确性和丰富度，也增强了结果的可解释性。该模型通过混合训练，集成了广泛的多模态处理能力。

视频与图像理解: 能够解析长达两小时的视频，或对复杂的图像内容进行深度问答，展现了强大的逻辑分析能力。
跨学科解题: 支持数学、物理、生物、化学等学科的看图解题，并能给出详细的思考步骤。
高精度信息提取: 可准确识别并结构化输出图片和视频中的文字、图表信息。
文档与界面交互: 能够原生理解金融、政务等领域的文档内容，并可识别 GUI 元素，执行点击、滑动等指令，充当“GUI 智能体”。
视觉到代码生成: 具备根据输入的界面截图自动编写前端代码的能力。

以下为几个典型的应用示例：

示例 1：图表分析与推理

一张包含多家公司年度利润信息的图表

输入指令:

请找出这张图中年度利润最高的公司，以及该公司的最大部门？

模型输出:

Sony has the highest annual profit at $6.3B. For Sony, the note states, “Media is Sony’s second – largest segment after game consoles.” Thus, the largest segment is game consoles.

示例 2：解答高考水平数学题

一道关于立体几何的数学题

输入指令:

请帮我解决这个题目，给出详细过程和答案。

模型输出（部分推理过程）:

已知 PA ⊥ 平面 ABCD ，故 PA ⊥ AD 且 PA ⊥ AB。
又因 AB ⊥ AD ，且 PA 与 AB 的交点为 A ，因此 AD ⊥ 平面 PAB。
由于 AD ⊂ 平面 PAD，故平面 PAB ⊥ 平面 PAD。

示例 3：图形用户界面（GUI）识别与操作

一个日历 APP 的界面截图

输入指令:

在APP中，帮我创建一个两周后3点与史密斯博士的会议。

模型输出（操作指令）:

Action: {“action_type”: “click”, “box_2d”: [[27,840,122,885]]}

示例 4：根据图片生成前端网页代码

此功能展示了模型将视觉设计直接转化为可用代码的强大能力，极大地提升了前端开发效率。

一个订单管理系统的网页设计图

输入指令:

请构建一个与输入图片相似的网页并将其转换为 React 代码。

模型生成的 React 代码:

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>订单管理系统</title>
<script src="https://cdn.jsdelivr.net/npm/react@18.0.0/umd/react.development.js"></script>
<script src="https://cdn.jsdelivr.net/npm/react-dom@18.0.0/umd/react-dom.development.js"></script>
<script src="https://cdn.jsdelivr.net/npm/@babel/standalone/babel.js"></script>
<script src="https://cdn.tailwindcss.com"></script>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.3/css/all.min.css"></link>
<link href="https://fonts.geekzu.org/css2?family=Noto+Sans+SC:wght@400;500;700&display=swap" rel="stylesheet">
<style>
body {
font-family: 'Noto Sans SC', sans-serif;
background-color: #f5f5f5;
}
</style>
</head>
<body>
<div id="root"></div>
<script type="text/babel">
const Header = () => {
return (
<header className="bg-white shadow-sm py-4 px-6 flex justify-between items-center border-b">
<h1 className="text-xl font-medium text-indigo-600">订单管理系统</h1>
<div className="flex items-center">
<img src="https://placehold.co/32x32" alt="管理员头像" className="w-8 h-8 rounded-full mr-2" />
<span>管理员</span>
</div>
</header>
);
};
const SearchForm = () => {
return (
<div className="bg-white rounded-md shadow-sm p-6 mb-4">
<h2 className="text-lg font-medium mb-4">搜索订单</h2>
<div className="grid grid-cols-1 md:grid-cols-4 gap-4">
<div>
<label className="block text-sm mb-1">订单编号</label>
<input type="text" className="w-full border rounded-md px-3 py-2" />
</div>
<div>
<label className="block text-sm mb-1">客户名称</label>
<input type="text" className="w-full border rounded-md px-3 py-2" />
</div>
<div>
<label className="block text-sm mb-1">订单状态</label>
<select className="w-full border rounded-md px-3 py-2 appearance-none bg-white">
<option>全部状态</option>
</select>
</div>
<div>
<label className="block text-sm mb-1">日期范围</label>
<div className="relative">
<input type="text" placeholder="选择日期范围" className="w-full border rounded-md px-3 py-2" />
<div className="absolute right-2 top-2 text-gray-400">
<i className="fas fa-calendar"></i>
</div>
</div>
<div className="flex flex-wrap mt-2 gap-2">
<button className="px-3 py-1 text-sm rounded-md bg-gray-100">今天</button>
<button className="px-3 py-1 text-sm rounded-md bg-gray-100">昨天</button>
<button className="px-3 py-1 text-sm rounded-md bg-gray-100">最近7天</button>
<button className="px-3 py-1 text-sm rounded-md bg-gray-100">最近30天</button>
<button className="px-3 py-1 text-sm rounded-md bg-gray-100">本月</button>
<button className="px-3 py-1 text-sm rounded-md bg-gray-100">上月</button>
</div>
</div>
</div>
<div className="flex items-center justify-between mt-4">
<a href="#" className="text-indigo-600 flex items-center text-sm">
<i className="fas fa-cog mr-1"></i> 高级搜索选项
</a>
<div className="flex gap-2">
<button className="px-4 py-2 border rounded-md text-sm">重置</button>
<button className="px-4 py-2 bg-indigo-600 text-white rounded-md text-sm">搜索</button>
</div>
</div>
</div>
);
};
const OrderStatusFilters = () => {
const filters = [
{ name: "全部订单", count: 152, active: true },
{ name: "待处理", count: 24, active: false },
{ name: "处理中", count: 38, active: false },
{ name: "已发货", count: 45, active: false },
{ name: "已送达", count: 32, active: false },
{ name: "已取消", count: 8, active: false },
{ name: "已退货", count: 5, active: false },
];
return (
<div className="flex flex-wrap gap-2 mb-4">
{filters.map((filter, index) => (
<button 
key={index} 
className={`px-4 py-2 rounded-md text-sm ${
filter.active 
? "bg-indigo-100 text-indigo-700" 
: "bg-white border"
}`}
>
{filter.name} {filter.count > 0 && <span className="ml-1">{filter.count}</span>}
</button>
))}
</div>
);
};
const OrderTable = () => {
const orders = [
{ id: "ORD-2025051301", customer: "张伟", amount: "¥1299.99", status: "待处理", date: "2025-05-13" },
{ id: "ORD-2025051302", customer: "李娜", amount: "¥458.50", status: "处理中", date: "2025-05-12" },
{ id: "ORD-2025051303", customer: "王芳", amount: "¥2199.00", status: "已发货", date: "2025-05-11" },
{ id: "ORD-2025051304", customer: "刘强", amount: "¥899.90", status: "已送达", date: "2025-05-10" },
{ id: "ORD-2025051305", customer: "陈明", amount: "¥3450.00", status: "已取消", date: "2025-05-09" },
{ id: "ORD-2025051306", customer: "赵丽", amount: "¥1788.00", status: "已退货", date: "2025-05-08" },
{ id: "ORD-2025051307", customer: "杨洋", amount: "¥599.99", status: "待处理", date: "2025-05-07" },
{ id: "ORD-2025051308", customer: "周杰", amount: "¥1299.00", status: "处理中", date: "2025-05-06" },
{ id: "ORD-2025051309", customer: "吴秀英", amount: "¥899.50", status: "已发货", date: "2025-05-05" },
{ id: "ORD-2025051310", customer: "郑伟", amount: "¥2499.00", status: "已送达", date: "2025-05-04" },
];
const getStatusClass = (status) => {
switch(status) {
case "待处理": return "bg-yellow-100 text-yellow-800";
case "处理中": return "bg-blue-100 text-blue-800";
case "已发货": return "bg-purple-100 text-purple-800";
case "已送达": return "bg-green-100 text-green-800";
case "已取消": return "bg-red-100 text-red-800";
case "已退货": return "bg-gray-100 text-gray-800";
default: return "bg-gray-100 text-gray-800";
}
};
return (
<div className="overflow-x-auto">
<table className="min-w-full bg-white">
<thead>
<tr className="bg-gray-50 text-left text-sm">
<th className="px-6 py-3 font-medium text-gray-500">订单编号 <i className="fas fa-sort ml-1"></i></th>
<th className="px-6 py-3 font-medium text-gray-500">客户名称 <i className="fas fa-sort ml-1"></i></th>
<th className="px-6 py-3 font-medium text-gray-500">订单金额 <i className="fas fa-sort ml-1"></i></th>
<th className="px-6 py-3 font-medium text-gray-500">状态 <i className="fas fa-sort ml-1"></i></th>
<th className="px-6 py-3 font-medium text-gray-500">创建日期 <i className="fas fa-sort ml-1"></i></th>
<th className="px-6 py-3 font-medium text-gray-500">操作</th>
</tr>
</thead>
<tbody className="divide-y divide-gray-200">
{orders.map((order, index) => (
<tr key={index} className="hover:bg-gray-50">
<td className="px-6 py-4 text-sm">{order.id}</td>
<td className="px-6 py-4 text-sm">{order.customer}</td>
<td className="px-6 py-4 text-sm">{order.amount}</td>
<td className="px-6 py-4">
<span className={`px-2 py-1 rounded-full text-xs ${getStatusClass(order.status)}`}>
{order.status}
</span>
</td>
<td className="px-6 py-4 text-sm">{order.date}</td>
<td className="px-6 py-4 text-sm">
<button className="text-indigo-600 mr-3">查看</button>
<button className="text-indigo-600">编辑</button>
</td>
</tr>
))}
</tbody>
</table>
</div>
);
};
const Pagination = () => {
return (
<div className="flex items-center justify-between mt-4 text-sm">
<div>显示第 1 到 10 条，共 10 条记录</div>
<div className="flex items-center">
<button className="px-3 py-1 border rounded-l-md">
<i className="fas fa-chevron-left"></i>
</button>
<button className="px-3 py-1 border-t border-b bg-indigo-100 text-indigo-600">1</button>
<button className="px-3 py-1 border rounded-r-md">
<i className="fas fa-chevron-right"></i>
</button>
</div>
</div>
);
};
const Footer = () => {
return (
<footer className="py-4 text-center text-gray-500 text-sm">
© 2025 订单管理系统. 保留所有权利.
</footer>
);
};
const App = () => {
return (
<div className="min-h-screen flex flex-col">
<Header />
<main className="flex-1 max-w-7xl mx-auto w-full px-4 py-6">
<h1 className="text-2xl font-medium mb-6">订单管理</h1>
<SearchForm />
<OrderStatusFilters />
<OrderTable />
<Pagination />
</main>
<Footer />
</div>
);
};
ReactDOM.render(<App />, document.getElementById('root'));
</script>
</body>
</html>

渲染后的订单管理系统网页截图

示例 5：对物理现象的原理解析

毛细现象实验图：两杯彩色水通过纸巾流入中间的空杯

输入指令:

这个现象发生的原因是什么？为什么中间水杯会有水？

模型输出:

这个现象是由于**毛细现象（或毛细作用）**导致的。具体来说：纸巾是多孔材料，其纤维之间的缝隙类似于“毛细管”。当左右两个装有不同颜色液体的杯子与中间的空杯之间用纸巾连接时，纸巾会利用自身吸水性和纤维间的毛细作用，将左侧红色液体和右侧蓝色液体通过纤维间隙输送至中间的空杯中。随着这种输送过程的进行，中间的空杯逐渐被液体填满，从而出现了“中间水杯有水”的现象。

技术架构与训练方法

GLM-4.1V-Thinking 的高效性能源于其精巧的模型架构和分阶段的训练流程。

模型架构

该模型架构由视觉编码器（ViT Encoder）、适配器（MLP Projector）和语言解码器（Language Decoder）三部分组成。

GLM-4.1V-Thinking 模型架构图

模型选用了 AIMv2-Huge 作为视觉编码器，并对其中的二维卷积进行了三维扩展，以高效处理视频输入的时间维度。为了增强对任意分辨率和宽高比图像的适应性，模型引入了两项关键改进：

二维旋转位置编码（2D-RoPE）: 这项技术帮助模型更好地理解图像内的空间关系，使其能稳定处理超过 200:1 的极端宽高比和 4K 以上的高分辨率图像。
动态分辨率适配: 通过保留 ViT 预训练模型的绝对位置嵌入，并结合双三次插值法，模型可在训练中动态适应不同分辨率的输入。

在语言解码器部分，模型将原始的旋转位置编码（RoPE）扩展为 三维旋转位置编码（3D-RoPE），显著增强了模型在处理图文视频混合输入时的空间理解能力，同时不影响其纯文本处理性能。

训练流程

模型的训练分为预训练、监督微调（SFT）和强化学习（RL）三个阶段。

预训练阶段: 分为通用多模态预训练和长上下文持续训练两个子阶段。前者旨在建立基础的多模态理解能力；后者则通过引入视频帧序列和超长图文内容，将模型的处理序列长度扩展至 32,768，以增强对高分辨率和长视频的处理能力。
监督微调（SFT）阶段: 在此阶段，模型使用一个高质量的思维链（CoT）数据集进行全参数微调。所有训练样本均采用统一格式，强制模型学习生成详细的推理过程，而非直接给出答案。
```
<think> {推理过程} </think> <answer> {最终答案} </answer>
```
这一步骤有效强化了模型的长篇因果推理能力。
课程采样强化学习（RLCS）阶段: 这是提升模型性能的关键。基于监督微调后的模型，开发团队结合了基于可验证奖励的强化学习（RLVR）和基于人类反馈的强化学习（RLHF）。通过“课程采样”机制，模型从 STEM 解题、GUI 交互、文档理解等多个维度的简单任务开始学习，逐步过渡到复杂任务。这种由易到难的动态学习范式，全面优化了模型在实用性、准确性和稳定性上的表现。

未经允许不得转载：AI生产力工具 » GLM-4.1V-Thinking以90亿参数挑战大型模型的视觉推理

GLM-4.1V-Thinking以90亿参数挑战大型模型的视觉推理