Current Position:fig. beginning " AI News

GLM-4.1 V-Thinking Challenges Visual Reasoning for Large Models with 9 Billion Parameters

2025-07-02

The development of multimodal large models is entering a new stage, moving from simple image recognition ("seeing") to complex logical reasoning and deep understanding ("seeing and thinking"). Recently, Smart Spectrum AI released and open-sourced the GLM-4.1V-Thinking in the series GLM-4.1V-9B-Thinking modeling, demonstrating new advances in its higher-order cognitive capabilities for visual language modeling.

The central innovation of the model is the introduction of a method called Reinforcement Learning with Curriculum Sampling (RLCS, Reinforcement Learning with Curriculum Sampling) The training strategy. This approach trains the model by scheduling tasks from easy to difficult, similar to the human learning process, resulting in significant progress on complex reasoning tasks.

What is most striking is its performance performance. Despite the GLM-4.1V-9B-Thinking With only 9 billion parameters, it matches or even exceeds the 72 billion parameter count in 18 authoritative benchmarks. Qwen2.5-VL-72BThis result challenges the traditional notion that "bigger models are more powerful". This result challenges the traditional notion that "the bigger the model, the more powerful it is", and highlights the great potential of advanced model architectures and efficient training strategies to improve performance and save resources.

GLM-4.1V-9B-Thinking 与其他模型的性能对比图

Links to related resources:

Dissertation. GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Open source code repository.
Online Experience.
- Hugging Face Spaces
- Magic Match Community Studio
API Documentation. Big Model Open Platform

Demonstration of model core competencies and applications

GLM-4.1V-9B-Thinking By introducing the Chain-of-Thought mechanism, it is possible to show the detailed reasoning process while outputting the answer. This not only improves the accuracy and richness of the answers, but also enhances the interpretability of the results. The model integrates extensive multimodal processing capabilities through hybrid training.

Video and Image Understanding. The ability to parse up to two hours of video or conduct in-depth Q&A on complex image content demonstrates strong logical analysis.
Interdisciplinary Problem Solving. Support for math, physics, biology, chemistry and other subjects to look at the map to solve the problem, and can give detailed steps to think.
High-precision information extraction. Accurately recognizes and structures the output of text and graphical information in pictures and videos.
Documentation and Interface Interaction. It can natively understand the content of documents in finance, government, and other fields, and can recognize GUI elements, execute commands such as click and slide, and act as a "GUI intelligent body".
Visualization to Code Generation. Ability to automatically write front-end code based on screenshots of the interface entered.

Below are a few examples of typical applications:

Example 1: Graphical Analysis and Reasoning

一张包含多家公司年度利润信息的图表

Input command.

Identify the company with the highest annual profit in this graph, and the largest division of that company?

Model output.

Sony has the highest annual profit at $6.3B. For Sony, the note states, "Media is Sony's second-largest segment after game consoles." Thus, the largest segment is game consoles. Thus, the largest segment is game consoles.

Example 2: Solving Higher Level Math Questions

一道关于立体几何的数学题

Input command.

Please help me with this topic by giving detailed process and answer.

Model outputs (partial reasoning process).

It is known that PA ⊥ the plane ABCD , so PA ⊥ AD and PA ⊥ AB.
Since AB ⊥ AD and the intersection of PA and AB is A, AD ⊥ plane PAB.
Since AD ⊂ plane PAD, plane PAB ⊥ plane PAD.

Example 3: Graphical User Interface (GUI) Identification and Operation

一个日历 APP 的界面截图

Input command.

In the APP, help me create a meeting with Dr. Smith in two weeks at 3.

Model Outputs (Operational Instructions).

Action: {"action_type": "click", "box_2d": [[27,840,122,885]]}

Example 4: According to the picture to generate the front-end web page code

This feature demonstrates the model's powerful ability to translate visual designs directly into usable code, greatly enhancing front-end development efficiency.

一个订单管理系统的网页设计图

Input command.

Please build a web page similar to the input image and convert it to React code.

Model-generated React code.

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>订单管理系统</title>
<script src="https://cdn.jsdelivr.net/npm/react@18.0.0/umd/react.development.js"></script>
<script src="https://cdn.jsdelivr.net/npm/react-dom@18.0.0/umd/react-dom.development.js"></script>
<script src="https://cdn.jsdelivr.net/npm/@babel/standalone/babel.js"></script>
<script src="https://cdn.tailwindcss.com"></script>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.3/css/all.min.css"></link>
<link href="https://fonts.geekzu.org/css2?family=Noto+Sans+SC:wght@400;500;700&display=swap" rel="stylesheet">
<style>
body {
font-family: 'Noto Sans SC', sans-serif;
background-color: #f5f5f5;
}
</style>
</head>
<body>
<div id="root"></div>
<script type="text/babel">
const Header = () => {
return (
<header className="bg-white shadow-sm py-4 px-6 flex justify-between items-center border-b">
<h1 className="text-xl font-medium text-indigo-600">订单管理系统</h1>
<div className="flex items-center">
<img src="https://placehold.co/32x32" alt="管理员头像" className="w-8 h-8 rounded-full mr-2" />
<span>管理员</span>
</div>
</header>
);
};
const SearchForm = () => {
return (
<div className="bg-white rounded-md shadow-sm p-6 mb-4">
<h2 className="text-lg font-medium mb-4">搜索订单</h2>
<div className="grid grid-cols-1 md:grid-cols-4 gap-4">
<div>
<label className="block text-sm mb-1">订单编号</label>
<input type="text" className="w-full border rounded-md px-3 py-2" />
</div>
<div>
<label className="block text-sm mb-1">客户名称</label>
<input type="text" className="w-full border rounded-md px-3 py-2" />
</div>
<div>
<label className="block text-sm mb-1">订单状态</label>
<select className="w-full border rounded-md px-3 py-2 appearance-none bg-white">
<option>全部状态</option>
</select>
</div>
<div>
<label className="block text-sm mb-1">日期范围</label>
<div className="relative">
<input type="text" placeholder="选择日期范围" className="w-full border rounded-md px-3 py-2" />
<div className="absolute right-2 top-2 text-gray-400">
<i className="fas fa-calendar"></i>
</div>
</div>
<div className="flex flex-wrap mt-2 gap-2">
<button className="px-3 py-1 text-sm rounded-md bg-gray-100">今天</button>
<button className="px-3 py-1 text-sm rounded-md bg-gray-100">昨天</button>
<button className="px-3 py-1 text-sm rounded-md bg-gray-100">最近7天</button>
<button className="px-3 py-1 text-sm rounded-md bg-gray-100">最近30天</button>
<button className="px-3 py-1 text-sm rounded-md bg-gray-100">本月</button>
<button className="px-3 py-1 text-sm rounded-md bg-gray-100">上月</button>
</div>
</div>
</div>
<div className="flex items-center justify-between mt-4">
<a href="#" className="text-indigo-600 flex items-center text-sm">
<i className="fas fa-cog mr-1"></i> 高级搜索选项
</a>
<div className="flex gap-2">
<button className="px-4 py-2 border rounded-md text-sm">重置</button>
<button className="px-4 py-2 bg-indigo-600 text-white rounded-md text-sm">搜索</button>
</div>
</div>
</div>
);
};
const OrderStatusFilters = () => {
const filters = [
{ name: "全部订单", count: 152, active: true },
{ name: "待处理", count: 24, active: false },
{ name: "处理中", count: 38, active: false },
{ name: "已发货", count: 45, active: false },
{ name: "已送达", count: 32, active: false },
{ name: "已取消", count: 8, active: false },
{ name: "已退货", count: 5, active: false },
];
return (
<div className="flex flex-wrap gap-2 mb-4">
{filters.map((filter, index) => (
<button 
key={index} 
className={`px-4 py-2 rounded-md text-sm ${
filter.active 
? "bg-indigo-100 text-indigo-700" 
: "bg-white border"
}`}
>
{filter.name} {filter.count > 0 && <span className="ml-1">{filter.count}</span>}
</button>
))}
</div>
);
};
const OrderTable = () => {
const orders = [
{ id: "ORD-2025051301", customer: "张伟", amount: "¥1299.99", status: "待处理", date: "2025-05-13" },
{ id: "ORD-2025051302", customer: "李娜", amount: "¥458.50", status: "处理中", date: "2025-05-12" },
{ id: "ORD-2025051303", customer: "王芳", amount: "¥2199.00", status: "已发货", date: "2025-05-11" },
{ id: "ORD-2025051304", customer: "刘强", amount: "¥899.90", status: "已送达", date: "2025-05-10" },
{ id: "ORD-2025051305", customer: "陈明", amount: "¥3450.00", status: "已取消", date: "2025-05-09" },
{ id: "ORD-2025051306", customer: "赵丽", amount: "¥1788.00", status: "已退货", date: "2025-05-08" },
{ id: "ORD-2025051307", customer: "杨洋", amount: "¥599.99", status: "待处理", date: "2025-05-07" },
{ id: "ORD-2025051308", customer: "周杰", amount: "¥1299.00", status: "处理中", date: "2025-05-06" },
{ id: "ORD-2025051309", customer: "吴秀英", amount: "¥899.50", status: "已发货", date: "2025-05-05" },
{ id: "ORD-2025051310", customer: "郑伟", amount: "¥2499.00", status: "已送达", date: "2025-05-04" },
];
const getStatusClass = (status) => {
switch(status) {
case "待处理": return "bg-yellow-100 text-yellow-800";
case "处理中": return "bg-blue-100 text-blue-800";
case "已发货": return "bg-purple-100 text-purple-800";
case "已送达": return "bg-green-100 text-green-800";
case "已取消": return "bg-red-100 text-red-800";
case "已退货": return "bg-gray-100 text-gray-800";
default: return "bg-gray-100 text-gray-800";
}
};
return (
<div className="overflow-x-auto">
<table className="min-w-full bg-white">
<thead>
<tr className="bg-gray-50 text-left text-sm">
<th className="px-6 py-3 font-medium text-gray-500">订单编号 <i className="fas fa-sort ml-1"></i></th>
<th className="px-6 py-3 font-medium text-gray-500">客户名称 <i className="fas fa-sort ml-1"></i></th>
<th className="px-6 py-3 font-medium text-gray-500">订单金额 <i className="fas fa-sort ml-1"></i></th>
<th className="px-6 py-3 font-medium text-gray-500">状态 <i className="fas fa-sort ml-1"></i></th>
<th className="px-6 py-3 font-medium text-gray-500">创建日期 <i className="fas fa-sort ml-1"></i></th>
<th className="px-6 py-3 font-medium text-gray-500">操作</th>
</tr>
</thead>
<tbody className="divide-y divide-gray-200">
{orders.map((order, index) => (
<tr key={index} className="hover:bg-gray-50">
<td className="px-6 py-4 text-sm">{order.id}</td>
<td className="px-6 py-4 text-sm">{order.customer}</td>
<td className="px-6 py-4 text-sm">{order.amount}</td>
<td className="px-6 py-4">
<span className={`px-2 py-1 rounded-full text-xs ${getStatusClass(order.status)}`}>
{order.status}
</span>
</td>
<td className="px-6 py-4 text-sm">{order.date}</td>
<td className="px-6 py-4 text-sm">
<button className="text-indigo-600 mr-3">查看</button>
<button className="text-indigo-600">编辑</button>
</td>
</tr>
))}
</tbody>
</table>
</div>
);
};
const Pagination = () => {
return (
<div className="flex items-center justify-between mt-4 text-sm">
<div>显示第 1 到 10 条，共 10 条记录</div>
<div className="flex items-center">
<button className="px-3 py-1 border rounded-l-md">
<i className="fas fa-chevron-left"></i>
</button>
<button className="px-3 py-1 border-t border-b bg-indigo-100 text-indigo-600">1</button>
<button className="px-3 py-1 border rounded-r-md">
<i className="fas fa-chevron-right"></i>
</button>
</div>
</div>
);
};
const Footer = () => {
return (
<footer className="py-4 text-center text-gray-500 text-sm">
© 2025 订单管理系统. 保留所有权利.
</footer>
);
};
const App = () => {
return (
<div className="min-h-screen flex flex-col">
<Header />
<main className="flex-1 max-w-7xl mx-auto w-full px-4 py-6">
<h1 className="text-2xl font-medium mb-6">订单管理</h1>
<SearchForm />
<OrderStatusFilters />
<OrderTable />
<Pagination />
</main>
<Footer />
</div>
);
};
ReactDOM.render(<App />, document.getElementById('root'));
</script>
</body>
</html>

渲染后的订单管理系统网页截图

Example 5: Principle Analysis of a Physical Phenomenon

毛细现象实验图：两杯彩色水通过纸巾流入中间的空杯

Input command.

What is the reason for this phenomenon to occur? Why is there water in the center water glass?

Model output.

This phenomenon is due to the **capillary phenomenon (or capillary action)**. Specifically: Tissue paper is a porous material and the gaps between its fibers are similar to "capillaries". When the left and right cups containing different colored liquids are connected to the empty cup in the middle with a paper towel, the paper towel will make use of its own water absorption and capillary action between the fibers to transport the red liquid on the left and the blue liquid on the right through the fiber gaps to the empty cup in the middle. As this transportation process proceeds, the empty cup in the middle is gradually filled with liquid, resulting in the phenomenon of "water in the middle cup".

Technical architecture and training methodology

GLM-4.1V-Thinking The efficient performance stems from its refined model architecture and staged training process.

model architecture

The model architecture consists of three parts: a visual encoder (ViT Encoder), an adapter (MLP Projector) and a language decoder.

GLM-4.1V-Thinking 模型架构图

The model was chosen AIMv2-Huge as a visual coder, and the 2D convolution therein is extended in 3D to efficiently handle the temporal dimension of the video input. In order to enhance the adaptability to images of arbitrary resolution and aspect ratio, two key improvements are introduced to the model:

Two-dimensional rotational position encoding (2D-RoPE). This technology helps the model better understand the spatial relationships within an image, enabling it to stabilize extreme aspect ratios in excess of 200:1 and high resolution images above 4K.
Dynamic Resolution Adaptation. By preserving the absolute positional embedding of the ViT pre-trained model and combining it with the bicubic interpolation method, the model can be dynamically adapted to inputs of different resolutions during training.

In the language decoder section, the model extends the original rotational position encoding (RoPE) to Three-dimensional rotational position encoding (3D-RoPE), which significantly enhances the model's spatial comprehension when processing mixed graphic-video inputs without affecting its plain text processing performance.

Training process

The training of the model is divided into three stages: pre-training, supervised fine-tuning (SFT) and reinforcement learning (RL).

Pre-training phase. It is divided into two sub-phases: generalized multimodal pre-training and long context continuous training. The former aims to establish a basic multimodal understanding; the latter extends the model's processing sequence length to 32,768 by introducing video frame sequences and very long graphic content to enhance the processing capability for high-resolution and long videos.
Supervised Fine Tuning (SFT) Phase. In this phase, the model is fully parametrically fine-tuned using a high-quality Chain of Thought (CoT) dataset. All training samples are in a uniform format, forcing the model to learn to generate detailed reasoning processes instead of giving direct answers.
```
<think> {推理过程} </think> <answer> {最终答案} </answer>
```
This step effectively strengthens the model's ability to reason causally over long periods of time.
Course Sampling Intensive Learning (RLCS) Phase. This is the key to improving the performance of the model. Based on the supervised fine-tuned model, the development team combined Reinforcement Learning based on Verifiable Rewards (RLVR) and Reinforcement Learning based on Human Feedback (RLHF). Through a "course sampling" mechanism, the model starts learning from simple tasks in multiple dimensions, such as STEM problem solving, GUI interaction, and document understanding, and gradually transitions to complex tasks. This dynamic learning paradigm from easy to difficult optimizes the model's performance in terms of utility, accuracy and stability.

May not be reproduced without permission:Chief AI Sharing Circle " GLM-4.1 V-Thinking Challenges Visual Reasoning for Large Models with 9 Billion Parameters

GLM-4.1 V-Thinking Challenges Visual Reasoning for Large Models with 9 Billion Parameters

Demonstration of model core competencies and applications

Example 1: Graphical Analysis and Reasoning

Example 2: Solving Higher Level Math Questions

Example 3: Graphical User Interface (GUI) Identification and Operation

Example 4: According to the picture to generate the front-end web page code

Example 5: Principle Analysis of a Physical Phenomenon

Technical architecture and training methodology

model architecture

Training process

Related articles

Recommended

Can't find AI tools? Try here!

Recommended Tools

New Releases

GLM-4.1 V-Thinking Challenges Visual Reasoning for Large Models with 9 Billion Parameters

Demonstration of model core competencies and applications

Example 1: Graphical Analysis and Reasoning

Example 2: Solving Higher Level Math Questions

Example 3: Graphical User Interface (GUI) Identification and Operation

Example 4: According to the picture to generate the front-end web page code

Example 5: Principle Analysis of a Physical Phenomenon

Technical architecture and training methodology

model architecture

Training process

Related articles

Recommended

Can't find AI tools? Try here!

Recommended Tools

New Releases

Quick query station AI tool