【论文翻译】Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

0. 摘要


Traditional methods for reasoning segmentation rely on supervised fine-tuning with categorical labels and simple descriptions, limiting its out-of-domain generalization and lacking explicit reasoning processes. To address these limitations, we propose Seg-Zero, a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement. Seg-Zero introduces a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate precious pixel-level masks. We design a sophisticated reward mechanism that integrates both format and accuracy rewards to effectively guide optimization directions. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Seg-Zero achieves robust zero-shot generalization and exhibits emergent testtime reasoning capabilities. Experiments show that SegZero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18%. This significant improvement highlights Seg-Zero’s ability to generalize across domains while presenting an explicit reasoning process. All code will be made publicly available for future research.	传统的推理分割方法依赖于基于分类标签和简单描述的监督式微调，这限制了其跨领域泛化能力，并且缺乏明确的推理过程。为了突破这些限制，我们提出了 Seg-Zero，这是一个新颖的框架，它展现出卓越的泛化能力，并通过认知强化获得明确的思路链推理。Seg-Zero 引入了一个由推理模型和分割模型组成的解耦架构。推理模型能够解读用户意图，生成明确的推理链，并生成位置提示，随后分割模型会利用这些提示生成精确的像素级掩码。我们设计了一个复杂的奖励机制，将格式奖励和准确度奖励相结合，以有效地引导优化方向。Seg-Zero 仅通过强化学习（GRPO）进行训练，无需明确的推理数据，实现了稳健的零样本泛化，并展现出突现的测试时推理能力。实验表明，SegZero-7B 在 ReasonSeg 基准测试中实现了 57.5 的零样本性能，比之前的 LISA-7B 高出 18%。这一显著提升凸显了 Seg-Zero 在呈现清晰推理过程的同时，还具备跨领域泛化的能力。所有代码将公开发布，以供未来研究使用。

Traditional methods for reasoning segmentation rely on supervised fine-tuning with categorical labels and simple descriptions, limiting its out-of-domain generalization and lacking explicit reasoning processes. To address these limitations, we propose Seg-Zero, a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement. Seg-Zero introduces a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate precious pixel-level masks. We design a sophisticated reward mechanism that integrates both format and accuracy rewards to effectively guide optimization directions. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Seg-Zero achieves robust zero-shot generalization and exhibits emergent testtime reasoning capabilities. Experiments show that SegZero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18%. This significant improvement highlights Seg-Zero’s ability to generalize across domains while presenting an explicit reasoning process. All code will be made publicly available for future research.

传统的推理分割方法依赖于基于分类标签和简单描述的监督式微调，这限制了其跨领域泛化能力，并且缺乏明确的推理过程。为了突破这些限制，我们提出了 Seg-Zero，这是一个新颖的框架，它展现出卓越的泛化能力，并通过认知强化获得明确的思路链推理。Seg-Zero 引入了一个由推理模型和分割模型组成的解耦架构。推理模型能够解读用户意图，生成明确的推理链，并生成位置提示，随后分割模型会利用这些提示生成精确的像素级掩码。我们设计了一个复杂的奖励机制，将格式奖励和准确度奖励相结合，以有效地引导优化方向。Seg-Zero 仅通过强化学习（GRPO）进行训练，无需明确的推理数据，实现了稳健的零样本泛化，并展现出突现的测试时推理能力。实验表明，SegZero-7B 在 ReasonSeg 基准测试中实现了 57.5 的零样本性能，比之前的 LISA-7B 高出 18%。这一显著提升凸显了 Seg-Zero 在呈现清晰推理过程的同时，还具备跨领域泛化的能力。所有代码将公开发布，以供未来研究使用。

1.Introduction


Reasoning segmentation generates pixel-wise masks by interpreting implicit queries through logical reasoning. This task shows significant potential in real-world applications, such as robots. Unlike conventional segmentation tasks that rely on simple categorical labels (e.g., “person” or “car”), reasoning segmentation addresses more complex and nuanced queries, such as “identify food that provides sustained energy.” Such queries require logical reasoning and the integration of cross-domain knowledge to produce accurate segmentation masks.	推理分割通过逻辑推理解释隐式查询，生成像素级掩码。这项任务在机器人等实际应用中展现出巨大的潜力。与依赖简单分类标签（例如“人”或“车”）的传统分割任务不同，推理分割能够处理更复杂、更细致的查询，例如“识别能够提供持续能量的食物”。这类查询需要逻辑推理并整合跨领域知识才能生成准确的分割掩码。
Early attempts [3, 17, 32], such as LISA [17], have explored the use of multimodal large language models (MLLMs) to enhance reasoning segmentation capabilities, These methods bridge the gap between MLLMs and segmentation models by leveraging implicit semantic tokens. However, typical methods [7, 17, 32] rely solely on supervised fine-tuning (SFT) applied to mixed datasets containing only simple categorical information or basic factual descriptions [12, 13, 43]. Although this paradigm effectively aligns MLLMs [23, 24, 40] with segmentation models [14] in specific datasets, we observe that it lacks generalization capabilities. This can be demonstrated by: (i) Although existing methods excel on in-domain data, their performance significantly degrades on out-of-distribution (OOD) samples. (ii) SFT inevitably leads to catastrophic forgetting of general capabilities. (iii) The lack of an explicit reasoning process hinders their effectiveness in complex scenarios. These limitations motivate us to enhance general segmentation capabilities and improve reasoning performance by integrating an explicit reasoning process.	早期的尝试 [3, 17, 32]，例如 LISA [17]，探索了使用多模态大型语言模型 (MLLM) 来增强推理分割能力。这些方法通过利用隐式语义标记来弥合 MLLM 和分割模型之间的差距。然而，典型的方法 [7, 17, 32] 仅仅依赖于对仅包含简单分类信息或基本事实描述的混合数据集进行监督微调 (SFT) [12, 13, 43]。虽然这种范式有效地将 MLLM [23, 24, 40] 与特定数据集中的分割模型 [14] 对齐，但我们观察到它缺乏泛化能力。这可以通过以下方式证明：(i) 虽然现有方法在域内数据上表现出色，但它们在分布外 (OOD) 样本上的性能显著下降。(ii) SFT 不可避免地会导致对一般能力的灾难性遗忘。（三）缺乏明确的推理过程阻碍了它们在复杂场景中的有效性。这些局限性促使我们通过整合明确的推理过程来增强通用的分割能力并提升推理性能。
Recent studies [11] demonstrate that training with pure reinforcement learning (RL) activates the emergent testtime reasoning process, highlighting that reward-driven optimization is effective in enhancing model reasoning ability. Moreover, this approach often promotes generalization rather than overfitting to specific datasets. Inspired by this, we introduce Seg-Zero, a novel framework designed to enhance reasoning and cognitive capabilities for reasoning segmentation. Seg-Zero adopts a decoupled architecture, including a reasoning model and a segmentation model. The reasoning model is an MLLM capable of processing both image and user instructions. It outputs not only regionlevel bounding boxes (bbox) but also pixel-level points to precisely localize the target object. Subsequently, the segmentation model utilizes the bbox and points to produce pixel-level segmentation masks.	最近的研究 [11] 表明，纯强化学习 (RL) 训练可以激活紧急测试时推理过程，突出表明奖励驱动的优化对于增强模型推理能力是有效的。此外，这种方法通常会促进泛化，而不是过度拟合到特定数据集。受此启发，我们介绍了 Seg-Zero，这是一个旨在增强推理分割的推理和认知能力的新颖框架。Seg-Zero 采用解耦架构，包括推理模型和分割模型。推理模型是一个能够同时处理图像和用户指令的 MLLM。它不仅输出区域级边界框 (bbox)，还输出像素级点，以精确定位目标对象。随后，分割模型利用 bbox 和点来生成像素级分割蒙版。
During training, we employ pure reinforcement learning, specifically GRPO [34], to fine-tune the reasoning model while keeping the segmentation model frozen. Rather than constructing datasets with explicitly annotated reasoning processes, we investigate the self-evolution potential of MLLM to develop reasoning capabilities, thereby achieving emergent reasoning from zero. To achieve this, we develop a sophisticated reward mechanism to enhance the reasoning process and regulate the output. These reward functions comprise two types: format rewards, which enforce constraints on the structure of the reasoning process and segmentation outputs, and accuracy rewards, which are calculated based on intersection over union (IoU) and L1 distance metrics. As illustrated in Figure 1, by leveraging optimized reward-driven reinforcement learning, our SegZero exhibits emergent test-time reasoning abilities, similar to those demonstrated in LLMs [11, 27]. This reasoning process enables the model to effectively handle complex instructions by breaking them down into sequential analytical steps, thus achieving the precise localization of target objects. Seg-Zero demonstrates exceptional performance on both in-domain and OOD data, significantly exceeding the model trained through SFT. Furthermore, Seg-Zero maintains robust visual QA capability, without the need for VQA training data.	在训练过程中，我们采用纯强化学习，特别是 GRPO [34]，对推理模型进行微调，同时保持分割模型的稳定。我们并未构建带有明确注释推理过程的数据集，而是探究 MLLM 的自进化潜力，以提升推理能力，从而实现从零开始的自发推理。为此，我们开发了一种复杂的奖励机制来增强推理过程并调节输出。这些奖励函数包含两种类型：格式奖励，用于对推理过程的结构和分割输出施加约束；准确度奖励，基于交并比 (IoU) 和 L1 距离指标计算。如图 1 所示，通过利用优化的奖励驱动强化学习，我们的 SegZero 展现出与 LLM [11, 27] 类似的自发测试时推理能力。该推理过程使模型能够通过将复杂指令分解为连续的分析步骤来有效地处理它们，从而实现目标对象的精确定位。 Seg-Zero 在领域内数据和 OOD 数据上均表现出色，显著超越了通过 SFT 训练的模型。此外，Seg-Zero 无需 VQA 训练数据，即可保持强大的视觉问答能力。
Experimental results show that, with only 9,000 training samples derived from RefCOCOg [43], our Seg-Zero-7B exhibits strong test-time reasoning capabilities and achieves superior generalization performance compared to models of the same scale. It achieves a zero-shot performance of 57.5 on ReasonSeg [17], surpassing the previous LISA-7B by 18%. We summarize our contributions as follows: • We propose Seg-Zero, a novel architecture designed for reasoning segmentation. Through the pure RL algorithm, Seg-Zero exhibits emergent reasoning abilities. • We present a detailed comparison between SFT and RL, as well as the integration of reasoning chain. Results demonstrates that RL, combined with the reasoning chain, consistently enhances model performance. • Extensive experiments demonstrate the effectiveness of our design and offer valuable insight for fine-tuning models using RL.	实验结果表明，仅使用来自 RefCOCOg [43] 的 9,000 个训练样本，我们的 Seg-Zero-7B 就展现出强大的测试时推理能力，并且与同规模的模型相比实现了更优异的泛化性能。它在 ReasonSeg [17] 上的零样本性能达到了 57.5，比之前的 LISA-7B 高出 18%。我们的贡献总结如下： • 我们提出了一种专为推理分割而设计的新颖架构 Seg-Zero。通过纯 RL 算法，Seg-Zero 展现出涌现的推理能力。 • 我们对 SFT 和 RL 进行了详细的比较，以及推理链的集成。结果表明，RL 与推理链相结合，可以持续提升模型性能。 • 大量实验证明了我们设计的有效性，并为使用 RL 微调模型提供了宝贵的见解。

在这里插入图片描述
Figure 1. Seg-Zero generates a reasoning chain before producing the final segmentation mask. It utilizes a pure reinforcement learning (RL) strategy, learning the reasoning process from zero. In comparison to supervised fine-tuning (SFT), the RL-based model demonstrates superior performance on both in-domain and out-of-domain data, and the integration of reasoning chain further enhances its effectiveness.
图 1. Seg-Zero 在生成最终分割蒙版之前生成推理链。它采用纯强化学习 (RL) 策略，从零开始学习推理过程。与监督微调 (SFT) 相比，基于 RL 的模型在域内和域外数据上均表现出色，推理链的集成进一步增强了其有效性。

2. Related Works

2.1. Reasoning in Large Models


In recent years, Large Language Models (LLMs) have exhibited remarkable reasoning capabilities. By extending the length of the Chain-of-Thought (CoT) reasoning process, OpenAI-o1 [27] introduces inference-time scaling, significantly improving its reasoning performance. In the research community, several studies have attempted to achieve testtime scaling through various approaches, including processbased reward models [20, 38, 39], reinforcement learning (RL) [15, 34], and search algorithms [10, 37]. In particular, the recent DeepSeek-R1 [11], which uses the GRPO [34] algorithm, achieves superior performance with only a few thousand RL training steps. Building on advances in the LLMs community, several recent works have attempted to leverage the reasoning capabilities of MLLMs [16, 36]. For example, Open-R1-Multimodal [16] emphasizes mathematical reasoning, while R1-V [36] shows exceptional performance in counting tasks. However, these works primarily address high-level reasoning and do not consider finegrained pixel-level understanding of images. To fill this gap, our Seg-Zero is designed to enhance pixel-level reasoning through reinforcement learning.	近年来，大型语言模型 (LLM) 展现出卓越的推理能力。OpenAI-o1 [27] 通过延长思路链 (CoT) 推理过程的长度，引入了推理时间扩展，显著提升了其推理性能。在研究界，一些研究尝试通过各种方法实现测试时间扩展，包括基于过程的奖励模型 [20, 38, 39]、强化学习 (RL) [15, 34] 和搜索算法 [10, 37]。尤其是最近的 DeepSeek-R1 [11]，它采用了 GRPO [34] 算法，仅需几千个强化学习训练步骤就实现了卓越的性能。基于 LLM 社区的进展，最近的一些研究尝试利用 MLLM [16, 36] 的推理能力。例如，Open-R1-Multimodal [16] 强调数学推理，而 R1-V [36] 在计数任务中表现出色。然而，这些研究主要关注高级推理，并未考虑图像的细粒度像素级理解。为了填补这一空白，我们的 Seg-Zero 旨在通过强化学习来增强像素级推理。

In recent years, Large Language Models (LLMs) have exhibited remarkable reasoning capabilities. By extending the length of the Chain-of-Thought (CoT) reasoning process, OpenAI-o1 [27] introduces inference-time scaling, significantly improving its reasoning performance. In the research community, several studies have attempted to achieve testtime scaling through various approaches, including processbased reward models [20, 38, 39], reinforcement learning (RL) [15, 34], and search algorithms [10, 37]. In particular, the recent DeepSeek-R1 [11], which uses the GRPO [34] algorithm, achieves superior performance with only a few thousand RL training steps. Building on advances in the LLMs community, several recent works have attempted to leverage the reasoning capabilities of MLLMs [16, 36]. For example, Open-R1-Multimodal [16] emphasizes mathematical reasoning, while R1-V [36] shows exceptional performance in counting tasks. However, these works primarily address high-level reasoning and do not consider finegrained pixel-level understanding of images. To fill this gap, our Seg-Zero is designed to enhance pixel-level reasoning through reinforcement learning.

近年来，大型语言模型 (LLM) 展现出卓越的推理能力。OpenAI-o1 [27] 通过延长思路链 (CoT) 推理过程的长度，引入了推理时间扩展，显著提升了其推理性能。在研究界，一些研究尝试通过各种方法实现测试时间扩展，包括基于过程的奖励模型 [20, 38, 39]、强化学习 (RL) [15, 34] 和搜索算法 [10, 37]。尤其是最近的 DeepSeek-R1 [11]，它采用了 GRPO [34] 算法，仅需几千个强化学习训练步骤就实现了卓越的性能。基于 LLM 社区的进展，最近的一些研究尝试利用 MLLM [16, 36] 的推理能力。例如，Open-R1-Multimodal [16] 强调数学推理，而 R1-V [36] 在计数任务中表现出色。然而，这些研究主要关注高级推理，并未考虑图像的细粒度像素级理解。为了填补这一空白，我们的 Seg-Zero 旨在通过强化学习来增强像素级推理。

2.2. Semantic Segmentation with Reasoning


Semantic segmentation aims at predicting segmentation masks for specific classes. Numerous studies [1, 4, 5, 8, 21, 25, 33, 44], including DeepLab [6], MaskFormer [9] and SAM [14] have made significant progress in this task, making it a well-addressed problem. Instead of segmenting objects with explicit class labels, referring expression segmentation [13, 43] focuses on segmenting target objects based on short, explicit text queries. LISA [17] advances this field further by introducing the reasoning segmentation task. In this task, text queries are either more intricate or longer, demanding models with strong reasoning capabilities to accurately interpret and segment the target objects.	语义分割旨在预测特定类别的分割掩码。包括 DeepLab [6]、MaskFormer [9] 和 SAM [14] 在内的众多研究 [1, 4, 5, 8, 21, 25, 33, 44] 已在此任务上取得了重大进展，使其成为一个备受关注的课题。与使用明确类别标签进行对象分割不同，表达式分割 [13, 43] 侧重于基于简短、明确的文本查询来分割目标对象。LISA [17] 通过引入推理分割任务，进一步推进了这一领域。在此任务中，文本查询要么更复杂，要么更长，需要具有强大推理能力的模型来准确地解释和分割目标对象。

Semantic segmentation aims at predicting segmentation masks for specific classes. Numerous studies [1, 4, 5, 8, 21, 25, 33, 44], including DeepLab [6], MaskFormer [9] and SAM [14] have made significant progress in this task, making it a well-addressed problem. Instead of segmenting objects with explicit class labels, referring expression segmentation [13, 43] focuses on segmenting target objects based on short, explicit text queries. LISA [17] advances this field further by introducing the reasoning segmentation task. In this task, text queries are either more intricate or longer, demanding models with strong reasoning capabilities to accurately interpret and segment the target objects.

语义分割旨在预测特定类别的分割掩码。包括 DeepLab [6]、MaskFormer [9] 和 SAM [14] 在内的众多研究 [1, 4, 5, 8, 21, 25, 33, 44] 已在此任务上取得了重大进展，使其成为一个备受关注的课题。与使用明确类别标签进行对象分割不同，表达式分割 [13, 43] 侧重于基于简短、明确的文本查询来分割目标对象。LISA [17] 通过引入推理分割任务，进一步推进了这一领域。在此任务中，文本查询要么更复杂，要么更长，需要具有强大推理能力的模型来准确地解释和分割目标对象。

2.3. MLLMs for Segmentation


Since LISA [17, 41] introduced the ’ ’ token to bridge the gap between MLLMs and segmentation models, several subsequent works [3, 7, 32] have explored the use of MLLMs for segmentation tasks. Most of these approaches, including OneTokenSegAll [3] and PixelLM [32], follow LISA’s paradigm by using special tokens to connect MLLMs with segmentation models. However, this design necessitates extensive data to fine-tune both the MLLM and the segmentation decoder, and may even compromise the pixel precious of the original segmentation models. Our proposed Seg-Zero also employs a decoupled design for ease of adoption, while further leveraging the reasoning ability of MLLMs to achieve superior results.	自从 LISA [17, 41] 引入“ ”标记来连接 MLLM 和分割模型以来，后续的一些研究 [3, 7, 32] 也探索了 MLLM 在分割任务中的应用。大多数方法，包括 OneTokenSegAll [3] 和 PixelLM [32]，都遵循 LISA 的范式，使用特殊标记连接 MLLM 和分割模型。然而，这种设计需要大量数据来微调 MLLM 和分割解码器，甚至可能损害原始分割模型的像素精度。我们提出的 Seg-Zero 也采用了解耦设计，以便于应用，同时进一步利用 MLLM 的推理能力来获得更优的结果。

Since LISA [17, 41] introduced the ’ ’ token to bridge the gap between MLLMs and segmentation models, several subsequent works [3, 7, 32] have explored the use of MLLMs for segmentation tasks. Most of these approaches, including OneTokenSegAll [3] and PixelLM [32], follow LISA’s paradigm by using special tokens to connect MLLMs with segmentation models. However, this design necessitates extensive data to fine-tune both the MLLM and the segmentation decoder, and may even compromise the pixel precious of the original segmentation models. Our proposed Seg-Zero also employs a decoupled design for ease of adoption, while further leveraging the reasoning ability of MLLMs to achieve superior results.

自从 LISA [17, 41] 引入“ ”标记来连接 MLLM 和分割模型以来，后续的一些研究 [3, 7, 32] 也探索了 MLLM 在分割任务中的应用。大多数方法，包括 OneTokenSegAll [3] 和 PixelLM [32]，都遵循 LISA 的范式，使用特殊标记连接 MLLM 和分割模型。然而，这种设计需要大量数据来微调 MLLM 和分割解码器，甚至可能损害原始分割模型的像素精度。我们提出的 Seg-Zero 也采用了解耦设计，以便于应用，同时进一步利用 MLLM 的推理能力来获得更优的结果。

3. Method


In this section, we introduce our Seg-Zero model and the associated reinforcement learning framework. We first describe how we address the segmentation problem in Section 3.1. Next, we present the architecture of the Seg-Zero in Section 3.2. Finally, we describe the reward functions (Section 3.3) and the training details (Section 3.4) in the reinforcement learning framework.	在本节中，我们将介绍我们的 Seg-Zero 模型及其相关的强化学习框架。首先，我们将在第 3.1 节中描述如何解决分割问题。接下来，我们将在第 3.2 节中介绍 Seg-Zero 的架构。最后，我们将描述强化学习框架中的奖励函数（第 3.3 节）和训练细节（第 3.4 节）。

3.1. Pipeline Formulation


Given an image $I$ and a label $T$ , the segmentation task aims to produce a binary segmentation mask $M$ that accurately identifies the region corresponding to $T$ . The label $T$ can vary in complexity, ranging from a simple class label (e.g., “bird”), to a straightforward phrase (e.g., “woman in blue”), or even to long and intricate expressions (e.g., “The unusual thing in the image”). The latter two types of expression require the model to perform reasoning to accurately segment the most relevant objects.	给定图像 $I$ 和标签 $T$ ，分割任务旨在生成一个二值分割掩码 $M$ ，以准确识别与 $T$ 对应的区域。标签 T 的复杂程度各不相同，从简单的类别标签（例如“鸟”）到简单的短语（例如“穿蓝衣服的女人”），甚至到冗长复杂的表达（例如“图像中不寻常的东西”）。后两种表达方式需要模型进行推理，以准确分割出最相关的对象。
Inspired by recent advancements in the reasoning capabilities of large models [11, 34, 36], we leverage this ability to develop a pipeline for reasoning-based segmentation. Specifically, we decouple the reasoning process and the segmentation process. We first employ reinforcement learning to an MLLM to activate its reasoning ability, enabling it to generate the reasoning process and produce accurate bounding box $B$ and two points $P_1$ , $P_2$ that best localize the target object. These bounding box and points are then used as prompts for SOTA segmentation models [14, 30] to produce fine-grained segmentation masks. Seg-Zero is trained using reinforcement learning, as illustrated in Figure 2.	受大型模型推理能力近期进展的启发 [11, 34, 36]，我们利用这一能力开发了一套基于推理的分割流程。具体而言，我们将推理过程与分割过程解耦。首先，我们将强化学习应用于多模态学习模型 (MLLM)，以激活其推理能力，使其能够生成推理过程并生成精确的边界框 $B$ 以及两个点 $P_1$ 和 $P_2$ ，从而最佳地定位目标对象。然后，这些边界框和点将作为 SOTA 分割模型 [14, 30] 的提示，以生成细粒度的分割蒙版。Seg-Zero 采用强化学习进行训练，如图 2 所示。

Given an image

I

and a label

T

, the segmentation task aims to produce a binary segmentation mask

M

that accurately identifies the region corresponding to

T

. The label

T

can vary in complexity, ranging from a simple class label (e.g., “bird”), to a straightforward phrase (e.g., “woman in blue”), or even to long and intricate expressions (e.g., “The unusual thing in the image”). The latter two types of expression require the model to perform reasoning to accurately segment the most relevant objects.

给定图像

I

和标签

T

，分割任务旨在生成一个二值分割掩码

M

，以准确识别与

T

对应的区域。标签 T 的复杂程度各不相同，从简单的类别标签（例如“鸟”）到简单的短语（例如“穿蓝衣服的女人”），甚至到冗长复杂的表达（例如“图像中不寻常的东西”）。后两种表达方式需要模型进行推理，以准确分割出最相关的对象。

Inspired by recent advancements in the reasoning capabilities of large models [11, 34, 36], we leverage this ability to develop a pipeline for reasoning-based segmentation. Specifically, we decouple the reasoning process and the segmentation process. We first employ reinforcement learning to an MLLM to activate its reasoning ability, enabling it to generate the reasoning process and produce accurate bounding box

B

and two points

P_1

P_2

that best localize the target object. These bounding box and points are then used as prompts for SOTA segmentation models [14, 30] to produce fine-grained segmentation masks. Seg-Zero is trained using reinforcement learning, as illustrated in Figure 2.

受大型模型推理能力近期进展的启发 [11, 34, 36]，我们利用这一能力开发了一套基于推理的分割流程。具体而言，我们将推理过程与分割过程解耦。首先，我们将强化学习应用于多模态学习模型 (MLLM)，以激活其推理能力，使其能够生成推理过程并生成精确的边界框

B

以及两个点

P_1

和

P_2

，从而最佳地定位目标对象。然后，这些边界框和点将作为 SOTA 分割模型 [14, 30] 的提示，以生成细粒度的分割蒙版。Seg-Zero 采用强化学习进行训练，如图 2 所示。

Figure 2. Illustration of our RL training process. In this case, the model generates three samples by itself, calculates the rewards, and optimizes towards samples that achieve higher rewards.
图 2. 我们的 RL 训练过程说明。在本例中，模型自行生成三个样本，计算奖励，并针对获得更高奖励的样本进行优化。

3.2. Seg-Zero Model


Current MLLMs [2, 18, 24, 40, 45] exhibit impressive performance in processing multi-modal inputs but are unable to generate fine-grained segmentation masks. Conversely, modern segmentation models [14, 30] provide fine-grained segmentation ability but lack robust reasoning capabilities. To bridge this gap, we propose Seg-Zero, a framework that includes a reasoning model and a segmentation model. Additionally, we introduce the novel strategy to effectively activate the reasoning ability of MLLM within the framework. Its whole architecture is shown in Figure 3.	当前的MLLM [2, 18, 24, 40, 45] 在处理多模态输入方面表现出色，但无法生成细粒度的分割掩模。相反，现代分割模型 [14, 30] 提供了细粒度的分割能力，但缺乏鲁棒的推理能力。为了弥补这一差距，我们提出了 Seg-Zero，一个包含推理模型和分割模型的框架。此外，我们引入了一种新颖的策略，可以在框架内有效激活MLLM的推理能力。其整体架构如图3所示。
Reasoning Model. We employ Qwen2.5-VL [2] as our reasoning model $F_{reason}$ . Although Qwen2.5-VL demonstrates exceptional performance in object detection by predicting the bbox, this region-level bbox is insufficient to provide more fine-grained pixel-level localization. Unlike object detection, segmentation requires a more precise understanding of pixel-level details, as multiple objects may exist within a single bounding box. Therefore, in addition to the bounding box, we also incorporate points that lie within the target object to improve localization accuracy. During the reinforcement learning stage, the format rewards are employed to ensure the model generates structured outputs, which are subsequently processed by a postprocessing function $G$ to extract the bounding box B and the two points $P_1$ , $P_2$ . This process can be formulated as follows:	推理模型。我们采用 Qwen2.5-VL [2] 作为我们的推理模型 $F_{reason}$ 。尽管 Qwen2.5-VL 通过预测边界框在目标检测中表现出色，但这种区域级边界框不足以提供更细粒度的像素级定位。与目标检测不同，分割需要更精确地理解像素级细节，因为单个边界框内可能存在多个目标。因此，除了边界框之外，我们还结合了位于目标对象内的点以提高定位精度。在强化学习阶段，采用格式奖励来确保模型生成结构化输出，随后由后处理函数 $G$ 处理这些输出以提取边界框 $B$ 和两个点 $P_1$ 、 $P_2$ 。该过程可以表述如下：

Segmentation Model. Modern segmentation models [14, 30] accept various types of prompt, including bounding boxes and points, to generate accurate segmentation masks. We employ SAM2 [30] as our segmentation model $F_{seg}$ due to its superior performance and efficient inference speed. Leveraging the bounding boxes and points provided by the reasoning model, the segmentation model can generate a precise, fine-grained mask for the target object. This process can be formally expressed as follows:	分割模型。现代分割模型 [14, 30] 接受各种类型的提示，包括边界框和点，以生成精确的分割蒙版。我们采用 SAM2 [30] 作为分割模型 $F_{seg}$ ，因为它性能卓越且推理速度快。利用推理模型提供的边界框和点，分割模型可以为目标对象生成精确的细粒度蒙版。该过程可以正式表示如下：

Test-time Reasoning. Reasoning is the crucial part in reasoning segmentation tasks. Inspired by DeepSeek-R1-Zero, we intentionally avoid using any explicit Chain-of-Thought (CoT) data to teach Seg-Zero reasoning skills. Instead, we aim to activate its reasoning capabilities from zero, enabling the model to autonomously generate a logical CoT before producing the final answer. To achieve this, we design a structured user prompt and a sophisticated reward mechanism to guide the reasoning model toward the correct optimization direction. As shown in Figure 4, the user prompt instructs Seg-Zero to analyze and compare objects in the image, beginning by generating a reasoning process, followed by the final answer in a pre-defined format. The reward mechanism then evaluates the answers and directs the optimization process, as illustrated in Figure 2.	测试时推理。推理是推理分割任务中的关键环节。受 DeepSeek-R1-Zero 的启发，我们刻意避免使用任何显式的思路链 (CoT) 数据来教授 Seg-Zero 的推理技能。相反，我们的目标是从零开始激活其推理能力，使模型能够在得出最终答案之前自主生成逻辑思路链 (CoT)。为此，我们设计了结构化的用户提示和完善的奖励机制，以引导推理模型朝着正确的优化方向发展。如图 4 所示，用户提示指示 Seg-Zero 分析和比较图像中的对象，首先生成推理过程，然后以预定义格式提供最终答案。之后，奖励机制评估答案并指导优化过程，如图 2 所示。

在这里插入图片描述
Figure 3. Seg-Zero includes a reasoning model and a segmentation model. The reasoning model is a MLLM that generates a reasoning chain and provides segmentation prompts. Subsequently the segmentation model produces pixel-wise mask.
图 3. Seg-Zero 包含一个推理模型和一个分割模型。推理模型是一个 MLLM，它生成推理链并提供分割提示。随后，分割模型生成逐像素掩码。

在这里插入图片描述
" 请使用 bbox 和 points 找出 ’ {Question} '。
“” 比较对象之间的差异，并找出最接近的匹配项。 “”
在 <think> <\think> 中输出思考过程，在 <answer> <\answer> 标签中输出最终答案。
“” 以 JSON 格式输出感兴趣对象内部的一个 bbox 和两个最大内切圆的中心点。 “” 即 <think> 思考过程在这里 <\think> “” <answer> { ’ bbox ’ : [10,100,200,210], ’ points 1 ’ : [30,110], ’ points 2 ’ : [35,180] } <\answer> "

Figure 4. User prompt for Seg-Zero. ‘{Question}’ is replaced with object description T in the training and inference.

3.3. Reward Functions


Reward functions play a pivotal role in reinforcement learning, as they determine the optimization directions of the model. We manually design the following five reward functions for reinforcement learning.	奖励函数在强化学习中起着至关重要的作用，它决定了模型的优化方向。我们手动设计了以下五个用于强化学习的奖励函数。
Thinking Format Reward. This reward is designed to force the model engage in a structured thinking process. It guides the model output its reasoning steps within the <think> and </think>tags, and the final answer is included between the <answer> and </answer>tags.	思考格式奖励。此奖励旨在强制模型进行结构化的思考过程。它引导模型在 <think> 和 </think> 标签内输出其推理步骤，最终答案则包含在 <answer> 和 </answer> 标签之间。
Segmentation Format Reward. Different from counting or other QA tasks, the segmentation task is highly dependent on the format of the answer. We provide two types of segmentation format rewards: soft and strict. Under soft constraints, if the keywords bbox and points appear in the answer, and their corresponding values consist of four and two coordinates, respectively, the format is considered correct. Under strict constraints, the format is only considered correct if the model outputs exact keywords (e.g., bbox,points 1, points 2) in the required structure.	分割格式奖励。与计数或其他问答任务不同，分割任务高度依赖于答案的格式。我们提供两种类型的分割格式奖励：软约束和严格约束。在软约束下，如果关键词 bbox 和 points 出现在答案中，并且它们对应的值分别由四个和两个坐标组成，则格式被认为是正确的。在严格约束下，只有当模型输出符合要求结构的精确关键词（例如，bbox,points 1, points 2）时，格式才被认为是正确的。
Bbox IoU Reward. This reward evaluates the IoU between the predicted bbox and the ground-truth bbox. A reward of 1 is assigned if their IoU greater than 0.5; otherwise, the reward is 0.	边界框 IoU 奖励。此奖励评估预测边界框与真实边界框之间的 IoU。如果 IoU 大于 0.5，则奖励为 1；否则，奖励为 0。
Bbox L1 Reward. This reward evaluates the L1 distance between the predicted bbox and the ground-truth bbox. A reward of 1 is assigned if their L1 distance less than 10 pixels; otherwise, the reward is 0.	边界框 L1 奖励。此奖励评估预测边界框与真实边界框之间的 L1 距离。如果 L1 距离小于 10 像素，则奖励为 1；否则，奖励为 0。
Point L1 Reward. This reward evaluates the L1 distance between the predicted points and the ground-truth points. We first determine whether the predicted points are inside the bounding box. Then the reward is set to 1 if the minimal distance between the predicted points and the ground-truth points is less than 100 pixels; otherwise, the reward is 0.	点 L1 奖励。此奖励评估预测点与真实点之间的 L1 距离。我们首先判断预测点是否位于边界框内。如果预测点与真实点之间的最小距离小于 100 像素，则奖励为 1；否则，奖励为 0。

3.4. Training


We build the training data from publicly available segmentation datasets and train our Seg-Zero using the GRPO algorithm.	我们从公开可用的分割数据集构建训练数据，并使用 GRPO 算法训练我们的 Seg-Zero。
Data Preparation. The training data is generated using the original mask annotations from existing referring expression segmentation datasets (e.g., RefCOCOg [43]). Based on the mask, we extract the leftmost, topmost, rightmost, and bottommost pixels of the mask to generate the bounding box B. Additionally, we compute the center points of the two largest inscribe circles within the mask, denoted as $P 1$ and $P 2$ . Consequently, the ground truth data comprises the bbox coordinates $B_{x1}, B_{y1}, B_{x2}, B_{y2}]$ and the coordinates of the two center points $P_{1x}, P_{1y} ]$ and $P_{2x}, P_{2y} ]$ . We do not incorporate any CoT processing into the training data. To ensure consistency, all images are rescaled to a uniform resolution of 840x840 pixels.	数据准备。训练数据使用现有referring expression分割数据集（例如 RefCOCOg [43]）中的原始掩码注释生成。基于掩码，我们提取掩码最左、最上、最右和最下的像素，以生成边界框 $B$ 。此外，我们计算掩码内两个最大内切圆的中心点，分别记为 $P_1$ 和 $P_2$ 。因此，ground truth 数据包含边界框坐标 $B_{x1}, B_{y1}, B_{x2}, B_{y2}]$ 以及两个中心点 $P_{1x}, P_{1y} ]$ 和 $P_{2x}, P_{2y} ]$ 的坐标。我们没有在训练数据中加入任何 CoT 处理。为了确保一致性，所有图像都重新缩放为 840x840 像素的统一分辨率。
GRPO. We do not include any reasoning data for a coldstart training process to teach the model’s reasoning ability. Instead, we let our Seg-Zero evolve from zero. Specifically, we initiate training directly from the pre-trained Qwen2.5-VL-3B model, utilizing the aforementioned rewards and applying the GRPO algorithm [34]. We illustrate our RL training process in Figure 2.	GRPO。我们没有在冷启动训练过程中包含任何推理数据来训练模型的推理能力。相反，我们让 Seg-Zero 从零开始演化。具体来说，我们直接从预训练的 Qwen2.5-VL-3B 模型开始训练，利用前面提到的奖励并应用 GRPO 算法 [34]。图 2 展示了我们的强化学习训练流程。

4. Experiment

4.1. Experimental Settings


Datasets. We training our Seg-Zero with only 9,000 samples adopted from RefCOCOg, using the data preparation strategy mentioned in Section 3.4. The test data includes ReasonSeg [17] and RefCOCO(+/g) [43].	数据集。我们仅使用来自 RefCOCOg 的 9,000 个样本来训练 Seg-Zero，并使用 3.4 节中提到的数据准备策略。测试数据包括 ReasonSeg [17] 和 RefCOCO(+/g) [43]。
Implementation Details. We employ Qwen2.5-VL-3B [2] and SAM2-Large [30] as our default reasoning model and segmentation model, respectively. Seg-Zero is trained using the DeepSpeed [29] library. During training, we use a total batch size of 16 with a sampling number of 8 per training step. The initial learning rate is set to 1e-6 and the weight decay is 0.01.	实施细节。我们分别采用 Qwen2.5-VL-3B [2] 和 SAM2-Large [30] 作为默认推理模型和分割模型。Seg-Zero 使用 DeepSpeed [29] 库进行训练。训练期间，我们使用的总批次大小为 16，每个训练步骤的采样数为 8。初始学习率设置为 1e-6，权重衰减为 0.01。
Evaluation Metrics. Following previous works [13, 43], we calculate gIoU and cIoU. The gIoU is the average of all per-image Intersection-over-Unions (IoUs), while the cIoU calculates the cumulative intersection over the cumulative union. Unless specified, we use gIoU as our default metric, as it equally considers both large and small objects.	评估指标。参照先前的研究 [13, 43]，我们计算 gIoU 和 cIoU。gIoU 是每幅图像所有 IoU 的平均值，而 cIoU 计算的是累积交集除以累积并集。除非另有说明，否则我们使用 gIoU 作为默认指标，因为它能够同时考虑大目标和小目标。

4.2. SFT vs. RL


We compare the performance of SFT and RL. The baseline model is Qwen2.5-VL-3B + SAM2-Large. For the nonCoT setting, we eliminate the thinking format reward, thus the model does not generate a CoT reasoning process before outputting the final answer. Our comparison includes both in-domain and OOD segmentation tasks [26, 35], as well as general QA tasks. The corresponding results are shown in Table 1, Figure 1 and Figure 5.	我们比较了 SFT 和 RL 的性能。基线模型是 Qwen2.5-VL-3B + SAM2-Large。对于非 CoT 设置，我们消除了思维格式奖励，因此模型在输出最终答案之前不会生成 CoT 推理过程。我们的比较涵盖了领域内和面向对象 (OOD) 分割任务 [26, 35]，以及通用问答任务。相应结果如表 1、图 1 和图 5 所示。
SFT vs. RL without CoT. From the first two rows in Table 1, we observe that on the in-domain dataset RefCOCOg, SFT achieves nearly the same performance as the baseline model. This may be due to the strong baseline performance of the original Qwen2.5-VL-3B. However, its performance significantly declines on the OOD ReasonSeg dataset, suggesting that SFT negatively impacts the model’s generalization ability. In contrast, comparing the first and third rows, we find that RL consistently improves performance on both in-domain and OOD datasets, demonstrating the effectiveness of RL. Besides, from Figure 5, we observe that the SFT model suffers from catastrophic forgetting of its original visual QA ability, while the RL model effectively preserves this capability.	SFT 与不使用 CoT 的 RL。从表 1 的前两行可以看出，在领域内数据集 RefCOCOg 上，SFT 取得了与基线模型几乎相同的性能。这可能是由于原始 Qwen2.5-VL-3B 的强劲基线性能。然而，其在 OOD ReasonSeg 数据集上的性能显著下降，这表明 SFT 对模型的泛化能力产生了负面影响。相比之下，比较第一行和第三行，我们发现 RL 在领域内数据集和 OOD 数据集上的性能均持续提升，证明了 RL 的有效性。此外，从图 5 中我们观察到，SFT 模型遭受了其原有视觉问答能力的灾难性遗忘，而 RL 模型则有效地保留了这种能力。
RL without CoT vs. RL with CoT. From the last two rows in Table 1, we find that both RL and RL with CoT achieve superior performance on both the in-domain RefCOCOg and OOD ReasonSeg datasets, significantly outperforming the baseline. This indicates that RL effectively boosts the models’ capabilities. However, with CoT, our Seg-Zero demonstrates even better performance compared to its counterparts without CoT, indicating that the reasoning process enhances the model’s ability to handle OOD data samples. From Figure 5, it is noteworthy that the introduction of CoT reasoning leads to a slight performance improvement in visual QA tasks for models trained without CoT.	不带 CoT 的强化学习 vs. 带 CoT 的强化学习。从表 1 的最后两行可以看出，强化学习和带 CoT 的强化学习在领域内 RefCOCOg 和 OOD ReasonSeg 数据集上均取得了优异的性能，显著超越了基线。这表明强化学习有效地提升了模型的性能。然而，在引入 CoT 后，我们的 Seg-Zero 的表现甚至优于不带 CoT 的同类模型，这表明推理过程增强了模型处理 OOD 数据样本的能力。从图 5 可以看出，值得注意的是，对于未使用 CoT 训练的模型，引入 CoT 推理后，其在视觉问答任务上的性能略有提升。

在这里插入图片描述

4.3. Ablation Study


We conduct several ablation studies to verify the effectiveness of our design. For the ablation study, the default settings are as follows: we perform reinforcement learning using the GRPO algorithm on 9,000 samples and evaluate the model on the RefCOCOg test and the ReasonSeg test.	我们进行了多项消融研究来验证设计的有效性。消融研究的默认设置如下：我们使用 GRPO 算法对 9,000 个样本进行强化学习，并在 RefCOCOg 测试和 ReasonSeg 测试上对模型进行评估。
Design of Bbox and Points. Table 2 demonstrates the effectiveness of our bbox and points prompt design. We observe that using only point prompts results in worst performance. When both bbox and point prompts are utilized, Seg-Zero achieves its best performance, indicating that the combination of these prompts enhances pixel-level localization accuracy.	边界框和点的设计。表 2 展示了我们边界框和点提示设计的有效性。我们观察到，仅使用点提示会导致性能最差。当同时使用边界框和点提示时，Seg-Zero 的性能最佳，这表明这些提示的组合可以提高像素级定位精度。
KL Loss Coefficient. The KL loss coefficient balances the model’s ‘pre-existing knowledge’ with ‘new knowledge’. Table 3 presents the performance variations across different KL loss coefficients. We find that a coefficient of 5e-3 performs optimally on both in-domain and OOD data. A higher coefficient leads to performance degradation.	KL 损失系数。KL 损失系数平衡了模型的“既有知识”和“新知识”。表 3 展示了不同 KL 损失系数下的性能变化。我们发现，当系数为 5e-3 时，无论是在领域内数据还是 OOD 数据上，其性能都达到最佳。系数过高会导致性能下降。
Number of Samples. We investigate the impact of the number of samples during the sampling stage. As shown in Table 4, we observe that as the number of samples increases, the model achieves better performance on both in-domain and out-of-distribution (OOD) data. This is reasonable because a larger number of samples expands the exploration space, enabling the model to identify more effective optimization directions.	样本数量。我们研究了采样阶段样本数量的影响。如表 4 所示，我们观察到，随着样本数量的增加，模型在域内和分布外 (OOD) 数据上都取得了更好的性能。这是合理的，因为更大的样本数量扩展了探索空间，使模型能够识别更有效的优化方向。
User Prompt Sensitivity. The last two rows of Figure 4 show that we include output examples in the user prompt. We investigate the impact of this example in Table 5 and observe that its inclusion significantly enhances the model’s performance. Through analysis of the output, we find that models without this example often fail to generate a reasoning process in their responses.	用户提示敏感度。图 4 的最后两行表明我们在用户提示中包含了输出示例。我们在表 5 中研究了此示例的影响，并观察到其加入显著提升了模型的性能。通过对输出的分析，我们发现，没有此示例的模型通常无法在其响应中生成推理过程。
Soft vs. Hard Accuracy Rewards. In Section 3.3, we describe the bbox IoU reward, the bbox L1 reward, and the point L1 reward. We apply specific thresholds to convert these metrics into binary rewards. Additionally, we conduct ablation studies on soft counterparts. For the bbox IoU reward, we directly use the IoU value as the soft reward. For L1-based rewards, we define the soft reward as $1−L1distmax{imagesize}1−\frac{L1 dist}{max\{image size\}}$ . From Table 6, we observe that while the soft reward achieves a minor improvement on ReasonSeg, it significantly underperforms compared to the hard reward on RefCOCOg.	软奖励 vs. 硬奖励。在第 3.3 节中，我们描述了 bbox IoU 奖励、bbox L1 奖励和点 L1 奖励。我们应用特定阈值将这些指标转换为二元奖励。此外，我们还对软奖励进行了消融研究。对于 bbox IoU 奖励，我们直接使用 IoU 值作为软奖励。对于基于 L1 的奖励，我们将软奖励定义为 $1−L1distmax{imagesize}1−\frac{L1 dist}{max\{image size\}}$ 。从表 6 中我们观察到，虽然软奖励在 ReasonSeg 上取得了微小的改进，但与 RefCOCOg 上的硬奖励相比，其表现明显不佳。
Soft vs. Strict Format Rewards. In Section 3.3, we introduce two types of segmentation format rewards: the soft and strict. From Table 7, we find that the strict format reward significantly improves performance gain on OOD data in ReasonSeg. Through qualitative analysis of the training steps, we find that the strict format reward progresses slowly in the initial stages, as it is more challenging to sample formats that precisely match the strict criteria. However, as training step increases, model with strict format reward tend to output longer response.	软格式奖励 vs. 严格格式奖励。在第 3.3 节中，我们介绍了两种分割格式奖励：软格式奖励和严格格式奖励。从表 7 可以看出，严格格式奖励显著提升了 ReasonSeg 在 OOD 数据上的性能。通过对训练步骤的定性分析，我们发现严格格式奖励在初始阶段进展缓慢，因为采样完全符合严格标准的格式更具挑战性。然而，随着训练步骤的增加，具有严格格式奖励的模型往往会输出更长的响应。
Reasoning Model Scale. We conduct an ablation study on reasoning models of varying scales, ranging from 2B to 7B parameters, under the same rewards and training settings. As shown in Table 8, we observe that model performance on both in-domain and OOD data improves as the model scale increases.	推理模型规模。在相同的奖励和训练设置下，我们对不同规模（参数范围从 2B 到 7B）的推理模型进行了消融研究。如表 8 所示，我们观察到，随着模型规模的增加，模型在领域内数据和 OOD 数据上的性能均有所提升。
Changes in Completion Length. Figure 6 illustrates the trends in completion lengths between different model sizes.The results indicate that a larger model tends to generate longer responses. As training progresses, the minimal completion length gradually increases. However, there is a drop in average completion length during the initial few steps. By analyzing the output during the training process, we find that this occurs because the model initially prioritizes learning the correct output format, which often results in shorter responses. Once the format reward saturates, the model shifts its focus to generating answer with higher accuracy, leading to longer and more detailed responses. Supplementary materials provide more analysis.	完成长度的变化。图 6 展示了不同模型大小之间完成长度的趋势。结果表明，较大的模型往往会生成更长的响应。随着训练的进行，最小完成长度逐渐增加。然而，在最初的几个步骤中，平均完成长度有所下降。通过分析训练过程中的输出，我们发现这是因为模型最初优先学习正确的输出格式，这通常会导致响应较短。一旦格式奖励达到饱和，模型就会将重点转移到生成更准确率的答案，从而产生更长、更详细的响应。补充材料提供了更多分析。

在这里插入图片描述

4.4. Comparison with Other Methods


In this part, we train our Seg-Zero using hard accuracy rewards and strict format rewards. The sampling number is set to 16. And we only train our Seg-Zero on 9,000 samples from RefCOCOg. We compare OVSeg [19], Grounded-SAM [31], LISA [17], SAM4MLLM [7], LAVT [42], ReLA [22], PixelLM [32], PerceptionGPT [28].	在本部分中，我们使用硬精度奖励和严格格式奖励来训练 Seg-Zero。样本数量设置为 16。我们仅使用来自 RefCOCOg 的 9,000 个样本来训练 Seg-Zero。我们比较了 OVSeg [19]、Grounded-SAM [31]、LISA [17]、SAM4MLLM [7]、LAVT [42]、ReLA [22]、PixelLM [32] 和 PerceptionGPT [28]。
Reasoning Segmentation. We compare the zero-shot performance on ReasonSeg [17], results are shown in Table 9. We can find our Seg-Zero achieves the SOTA zero-shot performance across various methods.	推理分割。我们比较了 ReasonSeg [17] 上的零样本性能，结果如表 9 所示。我们发现，我们的 Seg-Zero 在各种方法中都实现了 SOTA 零样本性能。
Referring Expression Segmentation. The results on referring expression segmentation are shown on Table 10. Moreover, we find that the ground-truth annotations in RefCOCO(+/g) are not precise enough, which suggests that our Seg-Zero model should, in principle, achieve better performance than values in the table. Supplementary materialsprovide detailed analysis.	指称表情分割。指称表情分割的结果如表 10 所示。此外，我们发现 RefCOCO(+/g) 中的真实标注不够精确，这表明我们的 Seg-Zero 模型原则上应该能够取得比表中数值更好的性能。补充材料提供了详细的分析。

在这里插入图片描述

4.5. Qualitative Results


We provide several examples in Figure 7. We can easily observe that the reasoning process is helpful in analyzing user instructions, especially when there are multiple objects within the same class categories. For instance, Seg-Zero demonstrates its ability to discern that a ‘recreational vehicle’ is more appropriate than a ‘truck’ in the context of a ‘road trip’, and correctly identifies that a ‘conductor’ is ‘positioned at the front of the stage’.	我们在图 7 中提供了几个示例。我们可以很容易地观察到，推理过程在分析用户指令方面非常有效，尤其是在同一类别中有多个对象的情况下。例如，Seg-Zero 展示了其在“公路旅行”语境中辨别“休闲车”比“卡车”更合适的能力，并正确识别出“售票员”位于“舞台前方”。

We provide several examples in Figure 7. We can easily observe that the reasoning process is helpful in analyzing user instructions, especially when there are multiple objects within the same class categories. For instance, Seg-Zero demonstrates its ability to discern that a ‘recreational vehicle’ is more appropriate than a ‘truck’ in the context of a ‘road trip’, and correctly identifies that a ‘conductor’ is ‘positioned at the front of the stage’.

我们在图 7 中提供了几个示例。我们可以很容易地观察到，推理过程在分析用户指令方面非常有效，尤其是在同一类别中有多个对象的情况下。例如，Seg-Zero 展示了其在“公路旅行”语境中辨别“休闲车”比“卡车”更合适的能力，并正确识别出“售票员”位于“舞台前方”。

在这里插入图片描述

5. Conclusion


In this paper, we propose Seg-Zero, a novel framework that integrates the CoT reasoning process into segmentation tasks. We design a sophisticated reward mechanism, incorporating both format and accuracy constraints, to guide the optimization directions. By training exclusively with RL, Seg-Zero emerges reasoning capabilities without relying on any supervised reasoning data. We present a detailed comparison between SFT and RL, as well as the introduction of reason chain. Additionally, we offer insightful perspectives on the design of RL and the reward functions.	在本文中，我们提出了一个新颖的框架 Seg-Zero，它将 CoT 推理过程集成到分割任务中。我们设计了一个复杂的奖励机制，结合了格式和准确度约束，以指导优化方向。通过专门使用强化学习进行训练，Seg-Zero 无需依赖任何监督推理数据即可展现推理能力。我们对 SFT 和强化学习进行了详细的比较，并引入了推理链。此外，我们还对强化学习和奖励函数的设计提出了深刻的见解。

In this paper, we propose Seg-Zero, a novel framework that integrates the CoT reasoning process into segmentation tasks. We design a sophisticated reward mechanism, incorporating both format and accuracy constraints, to guide the optimization directions. By training exclusively with RL, Seg-Zero emerges reasoning capabilities without relying on any supervised reasoning data. We present a detailed comparison between SFT and RL, as well as the introduction of reason chain. Additionally, we offer insightful perspectives on the design of RL and the reward functions.

在本文中，我们提出了一个新颖的框架 Seg-Zero，它将 CoT 推理过程集成到分割任务中。我们设计了一个复杂的奖励机制，结合了格式和准确度约束，以指导优化方向。通过专门使用强化学习进行训练，Seg-Zero 无需依赖任何监督推理数据即可展现推理能力。我们对 SFT 和强化学习进行了详细的比较，并引入了推理链。此外，我们还对强化学习和奖励函数的设计提出了深刻的见解。