ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers
Authors: Qianhao Yuan, Qingyu Zhang, Yanjiang Liu, Jiawei Chen, Yaojie Lu, Hongyu Lin, Jia Zheng, Xianpei Han, Le Sun
Venue: ICCV 2025
First: 2025-04-01T07:47:55+00:00 · Latest: 2025-11-03T17:23:02+00:00
Comments: Published as a conference paper at ICCV 2025. Project page:
https://github.com/icip-cas/ShortV
Abstract
Multimodal Large Language Models (MLLMs) suffer from high computational costs
due to their massive size and the large number of visual tokens. In this paper,
we investigate layer-wise redundancy in MLLMs by introducing a novel metric,
Layer Contribution (LC), which quantifies the impact of a layer's
transformations on visual and text tokens, respectively. The calculation of LC
involves measuring the divergence in model output that results from removing
the layer's transformations on the specified tokens. Our pilot experiment
reveals that many layers of MLLMs exhibit minimal contribution during the
processing of visual tokens. Motivated by this observation, we propose ShortV,
a training-free method that leverages LC to identify ineffective layers, and
freezes visual token updates in these layers. Experiments show that ShortV can
freeze visual token in approximately 60\% of the MLLM layers, thereby
dramatically reducing computational costs related to updating visual tokens.
For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while
maintaining superior performance. The code will be publicly available at
https://github.com/icip-cas/ShortV
中文标题/摘要
标题:ShortV:通过冻结无效层中的视觉标记提高多模态大型语言模型的效率
多模态大型语言模型(MLLMs)由于其庞大的规模和大量的视觉标记而面临高昂的计算成本。本文通过引入一个新的度量标准——层贡献(LC),研究了MLLMs中的层间冗余性,该度量标准量化了层的变换对视觉和文本标记的影响。LC的计算涉及测量移除层对指定标记的变换后模型输出的差异。我们的初步实验表明,在处理视觉标记时,MLLMs中的许多层几乎没有贡献。受此观察的启发,我们提出了一种无需训练的方法——ShortV,利用LC来识别无效层,并在这些层中冻结视觉标记的更新。实验表明,ShortV可以在大约60%的MLLM层中冻结视觉标记,从而大幅降低与更新视觉标记相关的计算成本。例如,它在LLaVA-NeXT-13B上实现了50%的FLOPs减少,同时保持了优越的性能。代码将在https://github.com/icip-cas/ShortV公开。
Summary / 总结
This paper addresses the high computational costs of Multimodal Large Language Models (MLLMs) by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of each layer on visual and text tokens. The authors found that many layers have minimal contribution to visual token processing. They propose ShortV, a training-free method that freezes visual token updates in these ineffective layers, reducing computational costs by approximately 50% in FLOPs on LLaVA-NeXT-13B while maintaining performance. The code is publicly available.
论文旨在通过识别并冻结无效层来减少多模态大型语言模型(MLLMs)的计算成本。它引入了一种新的度量标准,层贡献(LC),以量化每层对视觉和文本标记的影响。实验表明,ShortV,一种使用LC的无训练方法,可以在大约60%的MLLM层中冻结视觉标记的更新,从而在LLaVA-NeXT-13B上实现50%的FLOPs减少,同时保持性能。代码已公开发布在https://github.com/icip-cas/ShortV。
Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes
Authors: Yehna Kim, Young-Eun Kim, Seong-Whan Lee
First: 2025-10-31T07:45:44+00:00 · Latest: 2025-11-03T07:33:58+00:00
Abstract
Vision-Language Models (VLMs) have demonstrated impressive capabilities in
zero-shot action recognition by learning to associate video embeddings with
class embeddings. However, a significant challenge arises when relying solely
on action classes to provide semantic context, particularly due to the presence
of multi-semantic words, which can introduce ambiguity in understanding the
intended concepts of actions. To address this issue, we propose an innovative
approach that harnesses web-crawled descriptions, leveraging a large-language
model to extract relevant keywords. This method reduces the need for human
annotators and eliminates the laborious manual process of attribute data
creation. Additionally, we introduce a spatio-temporal interaction module
designed to focus on objects and action units, facilitating alignment between
description attributes and video content. In our zero-shot experiments, our
model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and
68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the
model's adaptability and effectiveness across various downstream tasks.
中文标题/摘要
标题:利用语言驱动描述属性增强时空零样本动作识别
视觉-语言模型(VLMs)在零样本动作识别方面通过学习将视频嵌入与类别嵌入关联起来,展示了令人印象深刻的性能。然而,仅依赖动作类别来提供语义上下文时存在重大挑战,尤其是由于多义词的存在,这可能导致对动作意图概念的理解产生歧义。为了解决这一问题,我们提出了一种创新方法,利用网络抓取的描述,并利用大型语言模型提取相关关键词。这种方法减少了对人工注释者的依赖,并消除了属性数据创建的繁琐手动过程。此外,我们引入了一个时空交互模块,旨在关注对象和动作单元,促进描述属性与视频内容之间的对齐。在我们的零样本实验中,我们的模型取得了令人印象深刻的结果,在UCF-101、HMDB-51和Kinetics-600上的准确率分别为81.0%、53.1%和68.9%,突显了模型在各种下游任务中的适应性和有效性。
Summary / 总结
The research aims to enhance zero-shot action recognition by addressing the ambiguity introduced by multi-semantic words. The method involves using web-crawled descriptions and a large-language model to extract relevant keywords, reducing the need for human annotation. The model also includes a spatio-temporal interaction module to align description attributes with video content. Experimental results show the model achieves accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, demonstrating its effectiveness in various tasks.
论文通过使用网络抓取的描述和大型语言模型提取相关关键词来解决由于多义词导致的动作零样本识别中的语义模糊问题,减少了人工标注的需求,并引入了时空交互模块以对齐描述属性和视频内容。该模型在UCF-101、HMDB-51和Kinetics-600上的准确率分别为81.0%、53.1%和68.9%,展示了其在各种下游任务中的适应性和有效性。
ChartAB: A Benchmark for Chart Grounding & Dense Alignment
Authors: Aniruddh Bansal, Davit Soselia, Dang Nguyen, Tianyi Zhou
First: 2025-10-30T17:56:31+00:00 · Latest: 2025-11-03T06:01:32+00:00
Abstract
Charts play an important role in visualization, reasoning, data analysis, and
the exchange of ideas among humans. However, existing vision-language models
(VLMs) still lack accurate perception of details and struggle to extract
fine-grained structures from charts. Such limitations in chart grounding also
hinder their ability to compare multiple charts and reason over them. In this
paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a
comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting
tabular data, localizing visualization elements, and recognizing various
attributes from charts of diverse types and complexities. We design a JSON
template to facilitate the calculation of evaluation metrics specifically
tailored for each grounding task. By incorporating a novel two-stage inference
workflow, the benchmark can further evaluate VLMs capability to align and
compare elements/attributes across two charts. Our analysis of evaluations on
several recent VLMs reveals new insights into their perception biases,
weaknesses, robustness, and hallucinations in chart understanding. These
findings highlight the fine-grained discrepancies among VLMs in chart
understanding tasks and point to specific skills that need to be strengthened
in current models.
中文标题/摘要
标题:ChartAB:图表定位与密集对齐基准
图表在可视化、推理、数据分析以及人类思想交流中起着重要作用。然而,现有的视觉-语言模型(VLMs)在细节感知方面仍存在不足,难以从图表中提取精细结构。这种图表定位的限制也阻碍了它们比较多个图表和推理的能力。在本文中,我们引入了一个新的“图表对齐基准(ChartAB)”,以全面评估VLMs在图表定位任务中的表现,即提取表格数据、定位可视化元素以及从不同类型和复杂度的图表中识别各种属性。我们设计了一个JSON模板,以方便计算每个定位任务的评估指标。通过引入一种新颖的两阶段推理工作流,基准还可以进一步评估VLMs在两个图表之间对齐和比较元素/属性的能力。我们对几个最近的VLMs的评估分析揭示了它们在图表理解中的感知偏差、弱点、鲁棒性和幻觉。这些发现突显了VLMs在图表理解任务中的细微差异,并指出了当前模型需要加强的具体技能。
Summary / 总结
The paper introduces ChartAB, a benchmark for evaluating vision-language models in chart grounding tasks, including extracting tabular data, localizing visualization elements, and recognizing attributes. It uses a JSON template to calculate specific evaluation metrics and a two-stage inference workflow to assess models' ability to align and compare elements across charts. The analysis reveals biases, weaknesses, and hallucinations in recent models, highlighting the need to improve their fine-grained understanding of charts.
论文提出了ChartAB,一个用于评估视觉-语言模型在图表定位任务中的基准,包括提取表格数据、定位可视化元素和识别属性。它使用JSON模板来计算特定的评估指标,并采用两阶段推理工作流来评估模型在跨图表对齐和比较元素的能力。分析揭示了模型中的偏见、弱点和幻觉,强调了需要改进其对图表的细粒度理解。
MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification
Authors: Anh-Tien Nguyen, Duy Minh Ho Nguyen, Nghiem Tuong Diep, Trung Quoc Nguyen, Nhat Ho, Jacqueline Michelle Metsch, Miriam Cindy Maurer, Daniel Sonntag, Hanibal Bohnenberger, Anne-Christin Hauschild
Venue: Transactions on Machine Learning Research (09/2025)
First: 2025-02-11T09:42:13+00:00 · Latest: 2025-11-03T05:18:33+00:00
Comments: Published in Transactions on Machine Learning Research (09/2025)
Abstract
Whole slide pathology image classification presents challenges due to
gigapixel image sizes and limited annotation labels, hindering model
generalization. This paper introduces a prompt learning method to adapt large
vision-language models for few-shot pathology classification. We first extend
the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology
image tiles, into a vision-language model by adding adaptors and aligning it
with medical text encoders via contrastive learning on 923K image-text pairs.
The model is then used to extract visual features and text embeddings from
few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike
prior methods that combine prompts with frozen features using prefix embeddings
or self-attention, we propose multi-granular attention that compares
interactions between learnable prompts with individual image patches and groups
of them. This approach improves the model's ability to capture both
fine-grained details and broader context, enhancing its recognition of complex
patterns across sub-regions. To further improve accuracy, we leverage
(unbalanced) optimal transport-based visual-text distance to secure model
robustness by mitigating perturbations that might occur during the data
augmentation process. Empirical experiments on lung, kidney, and breast
pathology modalities validate the effectiveness of our approach; thereby, we
surpass several of the latest competitors and consistently improve performance
across diverse architectures, including CLIP, PLIP, and Prov-GigaPath
integrated PLIP.
中文标题/摘要
标题:MGPATH:多粒度提示学习的视觉-语言模型在少量样本WSI分类中的应用
全切片病理图像分类由于其吉像素级的图像大小和有限的标注标签而面临挑战,阻碍了模型的泛化能力。本文介绍了一种提示学习方法,以适应大型视觉-语言模型进行少量样本病理分类。我们首先将预训练在13亿病理图像块上的Prov-GigaPath视觉基础模型扩展为视觉-语言模型,通过对比学习92.3万张图像-文本对与医学文本编码器对齐。然后,该模型用于从少量样本注释中提取视觉特征和文本嵌入,并通过可学习的提示嵌入进行微调。与先前方法将提示与冻结特征结合使用前缀嵌入或自注意力不同,我们提出了一种多粒度注意力机制,比较可学习提示与单个图像块及其组之间的交互。这种方法提高了模型捕捉细微细节和更广泛上下文的能力,增强了其在子区域复杂模式识别中的表现。为了进一步提高准确性,我们利用不平衡最优传输视觉-文本距离来确保模型的鲁棒性,以减轻数据增强过程中可能出现的扰动。在肺、肾和乳腺病理模态上的实验证明了我们方法的有效性;因此,我们超越了几个最新竞争对手,并在多种架构中持续改进性能,包括CLIP、PLIP和Prov-GigaPath集成PLIP。
Summary / 总结
MGPATH is a vision-language model that uses multi-granular prompt learning to enhance few-shot classification of whole slide pathology images. It builds on a pre-trained Prov-GigaPath model, adding adaptors and aligning it with medical text encoders. The model fine-tunes with learnable prompt embeddings and employs multi-granular attention to compare interactions between prompts and image patches, improving the capture of both fine-grained details and broader context. Empirical results show that MGPATH outperforms several state-of-the-art methods across different architectures and pathology modalities.
MGPATH 是一种使用多粒度提示学习的视觉-语言模型,旨在解决少量标注的全切片病理图像分类挑战。该模型扩展了 Prov-GigaPath 模型,并通过对比学习与医学文本编码器对齐。模型通过可学习的提示嵌入进行微调,并采用多粒度注意力机制比较提示与图像块及其组之间的交互,从而增强其对细粒度细节和更广泛上下文的捕捉能力。实验证明,MGPATH 在各种架构(包括 CLIP、PLIP 和 Prov-GigaPath 集成的 PLIP)上均优于多个最新竞争对手。
Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
Authors: Pei Peng, MingKun Xie, Hang Hao, Tong Jin, ShengJun Huang
First: 2025-10-30T13:11:23+00:00 · Latest: 2025-11-03T05:03:18+00:00
Abstract
Object-context shortcuts remain a persistent challenge in vision-language
models, undermining zero-shot reliability when test-time scenes differ from
familiar training co-occurrences. We recast this issue as a causal inference
problem and ask: Would the prediction remain if the object appeared in a
different environment? To answer this at inference time, we estimate object and
background expectations within CLIP's representation space, and synthesize
counterfactual embeddings by recombining object features with diverse
alternative contexts sampled from external datasets, batch neighbors, or
text-derived descriptions. By estimating the Total Direct Effect and simulating
intervention, we further subtract background-only activation, preserving
beneficial object-context interactions while mitigating hallucinated scores.
Without retraining or prompt design, our method substantially improves both
worst-group and average accuracy on context-sensitive benchmarks, establishing
a new zero-shot state of the art. Beyond performance, our framework provides a
lightweight representation-level counterfactual approach, offering a practical
causal avenue for debiased and reliable multimodal reasoning.
中文标题/摘要
标题:代表级反事实校准以实现无偏零样本识别
物体-上下文捷径仍然是视觉-语言模型中的一个持续性挑战,当测试场景与熟悉的训练共现情况不同时,会削弱零样本识别的可靠性。我们将此问题重新表述为因果推理问题,并提出:如果物体出现在不同的环境中,预测结果会如何?为了在推理时回答这一问题,我们估计CLIP表示空间中的物体和背景期望,并通过重新组合来自外部数据集、批邻居或文本描述的多样化替代上下文中的物体特征,合成反事实嵌入。通过估计总直接效应和模拟干预,我们进一步减去背景激活,保留有益的物体-上下文交互,同时减轻幻觉得分。无需重新训练或设计提示,我们的方法在上下文敏感基准测试中显著提高了最差群体和平均准确率,建立了新的零样本状态最先进水平。除了性能,我们的框架提供了一种轻量级的代表级反事实方法,为无偏和可靠的多模态推理提供了实用的因果途径。
Summary / 总结
The paper addresses the challenge of object-context shortcuts in vision-language models, which can affect zero-shot recognition reliability. It proposes a method to estimate object and background expectations within CLIP's representation space and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts. This approach improves both worst-group and average accuracy on context-sensitive benchmarks, setting a new zero-shot state of the art without requiring retraining or prompt design. The method provides a lightweight causal framework for debiased and reliable multimodal reasoning.
论文通过将问题重新定义为因果推理问题,解决了视觉-语言模型中的对象-上下文捷径问题。提出了一种方法,在CLIP的表示空间中估计对象和背景的期望,并通过重新组合对象特征与多样化的替代上下文来合成反事实嵌入。该方法在上下文敏感基准测试中提高了最坏群体和平均准确率,无需重新训练或设计提示,建立了新的零样本状态的前沿。该方法提供了一种轻量级的因果框架,用于实现去偏见和可靠的多模态推理。
LiteTracker: Leveraging Temporal Causality for Accurate Low-latency Tissue Tracking
Authors: Mert Asim Karaoglu, Wenbo Ji, Ahmed Abbas, Nassir Navab, Benjamin Busam, Alexander Ladikos
First: 2025-04-14T05:53:57+00:00 · Latest: 2025-11-02T18:40:58+00:00
Abstract
Tissue tracking plays a critical role in various surgical navigation and
extended reality (XR) applications. While current methods trained on large
synthetic datasets achieve high tracking accuracy and generalize well to
endoscopic scenes, their runtime performances fail to meet the low-latency
requirements necessary for real-time surgical applications. To address this
limitation, we propose LiteTracker, a low-latency method for tissue tracking in
endoscopic video streams. LiteTracker builds on a state-of-the-art long-term
point tracking method, and introduces a set of training-free runtime
optimizations. These optimizations enable online, frame-by-frame tracking by
leveraging a temporal memory buffer for efficient feature reuse and utilizing
prior motion for accurate track initialization. LiteTracker demonstrates
significant runtime improvements being around 7x faster than its predecessor
and 2x than the state-of-the-art. Beyond its primary focus on efficiency,
LiteTracker delivers high-accuracy tracking and occlusion prediction,
performing competitively on both the STIR and SuPer datasets. We believe
LiteTracker is an important step toward low-latency tissue tracking for
real-time surgical applications in the operating room. Our code is publicly
available at https://github.com/ImFusionGmbH/lite-tracker.
中文标题/摘要
标题:LiteTracker:利用时间因果性实现准确的低延迟组织追踪
组织追踪在各种手术导航和扩展现实(XR)应用中起着关键作用。虽然当前方法在大型合成数据集上训练,能够实现高追踪精度并很好地泛化到内窥镜场景,但它们的运行时性能无法满足实时手术应用所需的低延迟要求。为了解决这一限制,我们提出了一种名为LiteTracker的低延迟组织追踪方法,用于内窥镜视频流。LiteTracker基于最先进的长期点追踪方法,并引入了一组无需训练的运行时优化。这些优化通过利用时间记忆缓冲区高效重用特征和利用先验运动进行准确的追踪初始化,实现了在线逐帧追踪。LiteTracker在运行时性能上取得了显著改进,比其前身快约7倍,比最先进的方法快约2倍。除了主要关注效率外,LiteTracker还实现了高精度追踪和遮挡预测,在STIR和SuPer数据集上表现竞争力。我们认为LiteTracker是朝着手术室中的实时手术应用低延迟组织追踪迈出的重要一步。我们的代码已公开发布在https://github.com/ImFusionGmbH/lite-tracker。
Summary / 总结
LiteTracker is designed to improve the low-latency tissue tracking in endoscopic video streams, addressing the performance limitations of existing methods. It builds on a state-of-the-art long-term point tracking method and introduces runtime optimizations that enable efficient feature reuse and accurate track initialization. LiteTracker shows a significant runtime improvement, being around 7 times faster than its predecessor and 2 times faster than the state-of-the-art, while maintaining high accuracy and occlusion prediction capabilities on the STIR and SuPer datasets.
LiteTracker旨在提高内窥镜视频中组织跟踪的效率,以满足实时手术应用的需求。它基于最先进的长期点跟踪方法,并包含运行时优化,以实现高效的特征重用和准确的跟踪初始化。LiteTracker显著减少了运行时间,相比之前的方法快约7倍,并在基准数据集上保持了高精度和遮挡预测能力。
CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward
Authors: Zhiqiang Wang, Pengbin Feng, Yanbin Lin, Shuzhang Cai, Zongao Bian, Jinghua Yan, Xingquan Zhu
Venue: 2025 IEEE International Conference on Big Data (IEEE BigData 2025)
First: 2025-03-31T03:57:16+00:00 · Latest: 2025-11-02T15:32:31+00:00
Comments: 10 pages, 6 figures and 4 tables
Abstract
We propose Fuzzy Group Relative Policy Reward (FGRPR), a novel framework that
integrates Group Relative Policy Optimization (GRPO) with a fuzzy reward
function to enhance learning efficiency. Unlike the conventional binary 0/1
accuracy reward, our fuzzy reward model provides nuanced incentives,
encouraging more precise outputs. Experimental results demonstrate that GRPO
with a standard 0/1 accuracy reward underperforms compared to supervised
fine-tuning (SFT). In contrast, FGRPR, applied to Qwen2.5-VL(3B and 7B),
surpasses all baseline models, including GPT4o, LLaMA2(90B), and SFT, across
five in-domain datasets. On an out-of-domain dataset, FGRPR achieves
performance comparable to SFT but excels when target values are larger, as its
fuzzy reward function assigns higher rewards to closer approximations. This
approach is broadly applicable to tasks where the precision of the answer is
critical. Code and data: https://github.com/yeyimilk/CrowdVLM-R1
中文标题/摘要
标题:CrowdVLM-R1:利用模糊组相对策略奖励扩展R1能力以视觉语言模型进行人群计数
我们提出了一种新颖的框架Fuzzy Group Relative Policy Reward (FGRPR),该框架将Group Relative Policy Optimization (GRPO)与模糊奖励函数结合,以提高学习效率。与传统的二元0/1准确度奖励不同,我们的模糊奖励模型提供了更细致的激励,鼓励更精确的输出。实验结果表明,使用标准0/1准确度奖励的GRPO在性能上不如监督微调(SFT)。相比之下,FGRPR应用于Qwen2.5-VL(3B和7B)时,超越了所有基线模型,包括GPT4o、LLaMA2(90B)和SFT,在五个领域内数据集上均表现出色。在领域外数据集上,FGRPR的性能与SFT相当,但在目标值较大时,其模糊奖励函数会给予更接近的近似值更高的奖励。这种方法广泛适用于答案精度至关重要的任务。代码和数据:https://github.com/yeyimilk/CrowdVLM-R1
Summary / 总结
The research proposes Fuzzy Group Relative Policy Reward (FGRPR), which integrates Group Relative Policy Optimization (GRPO) with a fuzzy reward function to improve learning efficiency. Compared to supervised fine-tuning (SFT) and other baseline models like GPT4o and LLaMA2(90B), FGRPR outperforms across five in-domain datasets and achieves comparable performance to SFT on an out-of-domain dataset, especially when target values are larger. This method is particularly useful for tasks requiring high precision in outputs.
研究引入了Fuzzy Group Relative Policy Reward (FGRPR)框架,结合了Group Relative Policy Optimization (GRPO)和模糊奖励函数,以提高学习效率。与传统的二元准确度奖励不同,FGRPR提供更细腻的激励,导致更精确的输出。实验表明,FGRPR在五个领域内数据集上优于监督微调和其他基线模型,并且在领域外数据集上与监督微调具有可比性,尤其是在目标值较大时表现更优。这种方法适用于需要高精度答案的任务。
Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models
Authors: Wei Chen, Xin Yan, Bin Wen, Fan Yang, Tingting Gao, Di Zhang, Long Chen
First: 2025-04-09T02:59:18+00:00 · Latest: 2025-11-02T15:10:26+00:00
Comments: 17 pages, 4 figures
Abstract
Although multimodal large language models (MLLMs) exhibit remarkable
reasoning capabilities on complex multimodal understanding tasks, they still
suffer from the notorious hallucination issue: generating outputs misaligned
with obvious visual or factual evidence. Currently, training-based solutions,
like direct preference optimization (DPO), leverage paired preference data to
suppress hallucinations. However, they risk sacrificing general reasoning
capabilities due to the likelihood displacement. Meanwhile, training-free
solutions, like contrastive decoding, achieve this goal by subtracting the
estimated hallucination pattern from a distorted input. Yet, these handcrafted
perturbations (e.g., add noise to images) may poorly capture authentic
hallucination patterns. To avoid these weaknesses of existing methods, and
realize robust hallucination mitigation (i.e., maintaining general reasoning
performance), we propose a novel framework: Decoupling Contrastive Decoding
(DCD). Specifically, DCD decouples the learning of positive and negative
samples in preference datasets, and trains separate positive and negative image
projections within the MLLM. The negative projection implicitly models real
hallucination patterns, which enables vision-aware negative images in the
contrastive decoding inference stage. Our DCD alleviates likelihood
displacement by avoiding pairwise optimization and generalizes robustly without
handcrafted degradation. Extensive ablations across hallucination benchmarks
and general reasoning tasks demonstrate the effectiveness of DCD, i.e., it
matches DPO's hallucination suppression while preserving general capabilities
and outperforms the handcrafted contrastive decoding methods.
中文标题/摘要
标题:解耦对比解码:多模态大型语言模型中幻觉抑制的稳健性
尽管多模态大型语言模型(MLLMs)在复杂的多模态理解任务中表现出显著的推理能力,但它们仍然受到著名的幻觉问题困扰:生成与明显视觉或事实证据不符的输出。目前,基于训练的解决方案,如直接偏好优化(DPO),利用配对偏好数据来抑制幻觉。然而,它们可能会因似然性位移而牺牲一般推理能力。同时,基于训练的解决方案,如对比解码,通过从失真的输入中减去估计的幻觉模式来实现这一目标。然而,这些手工制作的扰动(例如,向图像中添加噪声)可能无法很好地捕捉真实的幻觉模式。为避免现有方法的这些弱点,并实现稳健的幻觉抑制(即保持一般推理性能),我们提出了一种新的框架:解耦对比解码(DCD)。具体而言,DCD 解耦了偏好数据集中正样本和负样本的学习,并在 MLLM 中分别训练正样本和负样本的图像投影。负投影隐式地建模了真实的幻觉模式,这使得在对比解码推理阶段能够生成具有视觉感知的负图像。我们的 DCD 通过避免成对优化来缓解似然性位移,并在无需手工制作降级的情况下稳健地泛化。广泛的消融实验跨越了幻觉基准和一般推理任务,证明了 DCD 的有效性,即它在幻觉抑制方面与 DPO 相当,同时保留了通用能力,并优于手工制作的对比解码方法。
Summary / 总结
The paper proposes Decoupling Contrastive Decoding (DCD) to mitigate hallucination in multimodal large language models (MLLMs) while maintaining general reasoning capabilities. DCD decouples positive and negative sample learning and trains separate projections within the MLLM. This approach avoids likelihood displacement and handcrafted perturbations, leading to robust hallucination mitigation and better general reasoning performance compared to existing methods. Extensive experiments show that DCD matches DPO's hallucination suppression while preserving general capabilities and outperforms handcrafted contrastive decoding methods.
研究旨在通过减轻多模态大型语言模型(MLLM)的幻觉现象,同时保持其一般推理能力。提出的解耦对比解码(DCD)框架将正样本和负样本的学习分离,在MLLM中分别训练投影。这种方法避免了似然性偏移和手工制作的扰动,从而实现稳健的幻觉缓解。实验表明,DCD在幻觉基准测试和一般推理任务中均能匹配DPO的幻觉抑制效果,并且优于手工制作的对比解码方法。
Does FLUX Already Know How to Perform Physically Plausible Image Composition?
Authors: Shilin Lu, Zhuming Lian, Zihan Zhou, Shaocong Zhang, Chen Zhao, Adams Wai-Kin Kong
First: 2025-09-25T15:01:49+00:00 · Latest: 2025-11-02T12:29:52+00:00
Comments: Preprint
Abstract
Image composition aims to seamlessly insert a user-specified object into a
new scene, but existing models struggle with complex lighting (e.g., accurate
shadows, water reflections) and diverse, high-resolution inputs. Modern
text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential
physical and resolution priors, yet lack a framework to unleash them without
resorting to latent inversion, which often locks object poses into contextually
inappropriate orientations, or brittle attention surgery. We propose SHINE, a
training-free framework for Seamless, High-fidelity Insertion with Neutralized
Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained
customization adapters (e.g., IP-Adapter) to guide latents for faithful subject
representation while preserving background integrity. Degradation-suppression
guidance and adaptive background blending are proposed to further eliminate
low-quality outputs and visible seams. To address the lack of rigorous
benchmarks, we introduce ComplexCompo, featuring diverse resolutions and
challenging conditions such as low lighting, strong illumination, intricate
shadows, and reflective surfaces. Experiments on ComplexCompo and
DreamEditBench show state-of-the-art performance on standard metrics (e.g.,
DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward).
Code and benchmark will be publicly available upon publication.
中文标题/摘要
标题:FLUX 是否已经掌握了进行物理上可信的图像合成的方法?
图像合成旨在无缝地将用户指定的对象插入到新场景中,但现有模型在处理复杂光照(例如准确的阴影、水面反射)和多样、高分辨率输入方面存在困难。现代文本到图像的扩散模型(例如SD3.5、FLUX)已经编码了重要的物理和分辨率先验知识,但缺乏一个框架来释放这些先验知识而不依赖于潜在空间反转,这通常会将物体姿态锁定为上下文不合适的姿态,或者导致脆弱的注意力手术。我们提出了SHINE,一种无需训练的无缝、高保真插入框架,以中和错误。SHINE引入了流形导向的锚点损失,利用预训练的自定义适配器(例如IP-Adapter)引导潜在空间,以实现忠实的主题表示,同时保留背景完整性。我们提出了降级抑制指导和自适应背景融合,以进一步消除低质量输出和可见接缝。为了解决缺乏严格的基准问题,我们引入了ComplexCompo,它包含多种分辨率和具有挑战性的条件,如低光照、强照明、复杂的阴影和反射表面。在ComplexCompo和DreamEditBench上的实验表明,SHINE在标准指标(例如DINOv2)和人类对齐评分(例如DreamSim、ImageReward、VisionReward)上表现出最先进的性能。代码和基准将在发表后公开。
SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model
Authors: Lin Lin, Jiefeng Long, Zhihe Wan, Yuchi Wang, Dingkang Yang, Shuang Yang, Yueyang Yao, Xu Chen, Zirui Guo, Shengqiang Li, Weiran Li, Hanyu Li, Yaling Mou, Yan Qiu, Haiyang Yu, Xiao Liang, Hongsheng Li, Chao Feng
First: 2025-10-14T16:43:22+00:00 · Latest: 2025-11-02T11:09:19+00:00
Comments: Technical Report
Abstract
Multimodal embedding models aim to yield informative unified representations
that empower diverse cross-modal tasks. Despite promising developments in the
evolution from CLIP-based dual-tower architectures to large vision-language
models, prior works still face unavoidable challenges in real-world
applications and business scenarios, such as the limited modality support,
unstable training mechanisms, and industrial domain gaps. In this work, we
introduce SAIL-Embedding, an omni-modal embedding foundation model that
addresses these issues through tailored training strategies and architectural
design. In the optimization procedure, we propose a multi-stage training scheme
to boost the multifaceted effectiveness of representation learning.
Specifically, the content-aware progressive training aims to enhance the
model's adaptability to diverse downstream tasks and master enriched
cross-modal proficiency. The collaboration-aware recommendation enhancement
training further adapts multimodal representations for recommendation scenarios
by distilling knowledge from sequence-to-item and ID-to-item embeddings while
mining user historical interests. Concurrently, we develop the stochastic
specialization and dataset-driven pattern matching to strengthen model training
flexibility and generalizability. Experimental results show that SAIL-Embedding
achieves SOTA performance compared to other methods in different retrieval
tasks. In online experiments across various real-world scenarios integrated
with our model, we observe a significant increase in Lifetime (LT), which is a
crucial indicator for the recommendation experience. For instance, the model
delivers the 7-day LT gain of +0.5% in the Douyin-Selected scenario. For the
Douyin feed rank model, the match features produced by SAIL-Embedding yield a
+0.1% AUC gain.
中文标题/摘要
标题:SAIL-嵌入技术报告:全模态嵌入基础模型
多模态嵌入模型旨在生成具有信息性的统一表示,以赋能多模态任务。尽管从基于CLIP的双塔架构到大型视觉语言模型的发展取得了令人鼓舞的进展,但先前的工作在实际应用和商业场景中仍面临诸多挑战,如模态支持有限、训练机制不稳定以及工业领域差距。在本工作中,我们引入了SAIL-嵌入,这是一种通过定制化的训练策略和架构设计来解决这些问题的全模态嵌入基础模型。在优化过程中,我们提出了一种多阶段训练方案,以增强表示学习的多面有效性。具体而言,内容感知渐进式训练旨在增强模型对多种下游任务的适应性,并掌握丰富的跨模态能力。协作感知推荐增强训练进一步通过从序列到项目和ID到项目的嵌入中提炼知识,并挖掘用户历史兴趣,来适应推荐场景中的多模态表示。同时,我们开发了随机专业化和数据驱动的模式匹配,以增强模型训练的灵活性和泛化能力。实验结果表明,SAIL-嵌入在不同检索任务中实现了SOTA性能。在与我们的模型集成的各种实际场景中的在线实验中,我们观察到显著的生命周期(LT)提升,这是推荐体验的关键指标。例如,在抖音精选场景中,模型实现了7天LT提升+0.5%。对于抖音信息流排名模型,SAIL-嵌入生成的匹配特征实现了+0.1%的AUC提升。
Summary / 总结
SAIL-Embedding is an omni-modal embedding foundation model designed to address limitations in multimodal embedding models, such as limited modality support and unstable training mechanisms. It employs a multi-stage training scheme, including content-aware progressive training and collaboration-aware recommendation enhancement training, to improve model adaptability and cross-modal proficiency. Experimental results demonstrate that SAIL-Embedding outperforms other methods in various retrieval tasks and achieves a 7-day Lifetime gain of +0.5% in the Douyin-Selected scenario and a +0.1% AUC gain in the Douyin feed rank model.
SAIL-Embedding 是一种全模态嵌入基础模型,旨在解决多模态嵌入模型中的限制,如模态支持有限和训练机制不稳定。它采用多阶段训练方案,包括内容感知渐进训练和协作感知推荐增强训练,以提高模型的适应性和跨模态能力。实验结果表明,SAIL-Embedding 在各种检索任务中表现出色,并在抖音精选场景中实现了 7 天 Lifetime 增益 +0.5%,在抖音信息流排名模型中实现了 +0.1% 的 AUC 增益。
CoralVQA: A Large-Scale Visual Question Answering Dataset for Coral Reef Image Understanding
Authors: Hongyong Han, Wei Wang, Gaowei Zhang, Mingjie Li, Yi Wang
First: 2025-07-14T16:29:10+00:00 · Latest: 2025-11-02T08:33:28+00:00
Abstract
Coral reefs are vital yet vulnerable ecosystems that require continuous
monitoring to support conservation. While coral reef images provide essential
information in coral monitoring, interpreting such images remains challenging
due to the need for domain expertise. Visual Question Answering (VQA), powered
by Large Vision-Language Models (LVLMs), has great potential in user-friendly
interaction with coral reef images. However, applying VQA to coral imagery
demands a dedicated dataset that addresses two key challenges: domain-specific
annotations and multidimensional questions. In this work, we introduce
CoralVQA, the first large-scale VQA dataset for coral reef analysis. It
contains 12,805 real-world coral images from 67 coral genera collected from 3
oceans, along with 277,653 question-answer pairs that comprehensively assess
ecological and health-related conditions. To construct this dataset, we develop
a semi-automatic data construction pipeline in collaboration with marine
biologists to ensure both scalability and professional-grade data quality.
CoralVQA presents novel challenges and provides a comprehensive benchmark for
studying vision-language reasoning in the context of coral reef images. By
evaluating several state-of-the-art LVLMs, we reveal key limitations and
opportunities. These insights form a foundation for future LVLM development,
with a particular emphasis on supporting coral conservation efforts.
中文标题/摘要
标题:CoralVQA:珊瑚礁图像理解的大规模视觉问答数据集
珊瑚礁是至关重要的但又脆弱的生态系统,需要持续监测以支持保护工作。虽然珊瑚礁图像为珊瑚监测提供了重要信息,但由于需要领域专业知识,解读这些图像仍然具有挑战性。视觉问答(VQA),借助大型视觉-语言模型(LVLM),在用户友好地与珊瑚礁图像互动方面具有巨大潜力。然而,将VQA应用于珊瑚图像需要一个专门的数据集,以解决两个关键挑战:领域特定的注释和多维度问题。在本文中,我们介绍了CoralVQA,这是首个用于珊瑚礁分析的大规模VQA数据集。它包含来自3个海洋67种珊瑚属的12,805张真实珊瑚图像,以及277,653个问题-答案对,全面评估生态和健康状况。为了构建此数据集,我们与海洋生物学家合作开发了一种半自动数据构建管道,以确保可扩展性和专业级数据质量。CoralVQA提出了新的挑战,并为研究珊瑚礁图像中的视觉-语言推理提供了全面基准。通过评估几种最先进的LVLM,我们揭示了关键的局限性和机会。这些见解为未来LVLM的发展奠定了基础,特别强调支持珊瑚保护工作。
Summary / 总结
CoralVQA is a large-scale VQA dataset for coral reef images, addressing the challenges of domain-specific annotations and multidimensional questions. It includes 12,805 images from 67 coral genera and 277,653 question-answer pairs. The dataset was constructed using a semi-automatic pipeline with marine biologists to ensure quality. Evaluations of state-of-the-art LVLMs on this dataset highlight key limitations and opportunities for future development in supporting coral conservation efforts.
CoralVQA 是一个大规模的视觉问答数据集,用于珊瑚礁图像理解,解决了领域特定注释和多维度问题。它包含来自 67 种珊瑚属的 12,805 张图像和 277,653 个问答对。该数据集通过与海洋生物学家合作开发的半自动管道构建,确保了质量和规模。对最先进的 LVLM 在此数据集上的评估揭示了关键的局限性和改进机会,特别是在珊瑚礁图像上下文中的视觉语言推理方面。
Federated Vision-Language-Recommendation with Personalized Fusion
Authors: Zhiwei Li, Guodong Long, Jing Jiang, Chengqi Zhang, Qiang Yang
First: 2024-10-11T03:10:09+00:00 · Latest: 2025-11-02T00:09:30+00:00
Comments: 15 pages, 10 figures, 7 tables, conference
Abstract
Applying large pre-trained Vision-Language Models to recommendation is a
burgeoning field, a direction we term Vision-Language-Recommendation (VLR).
Bringing VLR to user-oriented on-device intelligence within a federated
learning framework is a crucial step for enhancing user privacy and delivering
personalized experiences. This paper introduces FedVLR, a federated VLR
framework specially designed for user-specific personalized fusion of
vision-language representations. At its core is a novel bi-level fusion
mechanism: The server-side multi-view fusion module first generates a diverse
set of pre-fused multimodal views. Subsequently, each client employs a
user-specific mixture-of-expert mechanism to adaptively integrate these views
based on individual user interaction history. This designed lightweight
personalized fusion module provides an efficient solution to implement a
federated VLR system. The effectiveness of our proposed FedVLR has been
validated on seven benchmark datasets.
中文标题/摘要
标题:联邦视觉-语言-推荐个性化融合
将大型预训练视觉-语言模型应用于推荐是一个新兴领域,我们称之为视觉-语言-推荐(VLR)。在联邦学习框架下将VLR引入用户导向的设备端智能是增强用户隐私和提供个性化体验的关键步骤。本文介绍了FedVLR,这是一种特别设计用于用户特定个性化融合视觉-语言表示的联邦VLR框架。其核心是一种新颖的双层融合机制:服务器端多视图融合模块首先生成一组多模态预融合视图。随后,每个客户端使用用户特定的专家混合机制,根据个人用户交互历史自适应地整合这些视图。这种设计的轻量级个性化融合模块为实现联邦VLR系统提供了一个高效的解决方案。我们提出的FedVLR的有效性已在七个基准数据集上得到验证。
Summary / 总结
The research motivation is to enhance user privacy and provide personalized experiences by applying large pre-trained Vision-Language Models in a federated learning framework for recommendation. The main method involves a bi-level fusion mechanism where the server generates diverse multimodal views, and each client uses a user-specific mixture-of-expert mechanism to adaptively integrate these views based on user interaction history. Key experimental findings show that the proposed FedVLR framework effectively improves recommendation accuracy on seven benchmark datasets.
研究动机是通过在联邦学习框架下应用大型预训练的视觉-语言模型来进行推荐,以增强用户隐私并提供个性化体验。主要方法是采用双层融合机制,服务器生成多种多模态视图,每个客户端根据用户的交互历史使用特定用户的专家混合机制来适应性地整合这些视图。实验结果表明,提出的FedVLR框架在七个基准数据集上有效提高了推荐准确性。
OpinioRAG: Towards Generating User-Centric Opinion Highlights from Large-scale Online Reviews
Authors: Mir Tafseer Nayeem, Davood Rafiei
First: 2025-08-30T00:00:34+00:00 · Latest: 2025-11-01T20:48:15+00:00
Comments: COLM 2025
Abstract
We study the problem of opinion highlights generation from large volumes of
user reviews, often exceeding thousands per entity, where existing methods
either fail to scale or produce generic, one-size-fits-all summaries that
overlook personalized needs. To tackle this, we introduce OpinioRAG, a
scalable, training-free framework that combines RAG-based evidence retrieval
with LLMs to efficiently produce tailored summaries. Additionally, we propose
novel reference-free verification metrics designed for sentiment-rich domains,
where accurately capturing opinions and sentiment alignment is essential. These
metrics offer a fine-grained, context-sensitive assessment of factual
consistency. To facilitate evaluation, we contribute the first large-scale
dataset of long-form user reviews, comprising entities with over a thousand
reviews each, paired with unbiased expert summaries and manually annotated
queries. Through extensive experiments, we identify key challenges, provide
actionable insights into improving systems, pave the way for future research,
and position OpinioRAG as a robust framework for generating accurate, relevant,
and structured summaries at scale.
中文标题/摘要
标题:OpinioRAG:从大规模在线评论中生成用户中心的意见要点
我们研究了从大量用户评论中生成意见要点的问题,这些评论通常每个实体超过数千条,现有方法要么无法扩展,要么生成通用的、一刀切的摘要,忽视了个性化需求。为了解决这一问题,我们引入了OpinioRAG,这是一种可扩展、无需训练的框架,结合了基于RAG的证据检索与LLMs,以高效地生成定制化的摘要。此外,我们还提出了针对情感丰富的领域的新颖无参考验证指标,这些指标旨在准确捕捉意见和情感一致性。这些指标提供了细粒度的、上下文敏感的事实一致性评估。为了便于评估,我们贡献了第一个大规模的长格式用户评论数据集,包含每个实体超过一千条评论,配以无偏见的专家摘要和手动标注的查询。通过广泛的实验,我们确定了关键挑战,提供了改进系统的可操作见解,为未来研究铺平了道路,并将OpinioRAG定位为一种在大规模生成准确、相关和结构化摘要方面的稳健框架。
Vision-Language Model-Based Semantic-Guided Imaging Biomarker for Lung Nodule Malignancy Prediction
Authors: Luoting Zhuang, Seyed Mohammad Hossein Tabatabaei, Ramin Salehi-Rad, Linh M. Tran, Denise R. Aberle, Ashley E. Prosper, William Hsu
Venue: Journal of Biomedical Informatics 172 (2025) 104947
First: 2025-04-30T06:11:34+00:00 · Latest: 2025-11-01T19:04:13+00:00
Abstract
Machine learning models have utilized semantic features, deep features, or
both to assess lung nodule malignancy. However, their reliance on manual
annotation during inference, limited interpretability, and sensitivity to
imaging variations hinder their application in real-world clinical settings.
Thus, this research aims to integrate semantic features derived from
radiologists' assessments of nodules, guiding the model to learn clinically
relevant, robust, and explainable imaging features for predicting lung cancer.
We obtained 938 low-dose CT scans from the National Lung Screening Trial (NLST)
with 1,261 nodules and semantic features. Additionally, the Lung Image Database
Consortium dataset contains 1,018 CT scans, with 2,625 lesions annotated for
nodule characteristics. Three external datasets were obtained from UCLA Health,
the LUNGx Challenge, and the Duke Lung Cancer Screening. We fine-tuned a
pretrained Contrastive Language-Image Pretraining (CLIP) model with a
parameter-efficient fine-tuning approach to align imaging and semantic text
features and predict the one-year lung cancer diagnosis. Our model outperformed
state-of-the-art (SOTA) models in the NLST test set with an AUROC of 0.901 and
AUPRC of 0.776. It also showed robust results in external datasets. Using CLIP,
we also obtained predictions on semantic features through zero-shot inference,
such as nodule margin (AUROC: 0.807), nodule consistency (0.812), and pleural
attachment (0.840). Our approach surpasses the SOTA models in predicting lung
cancer across datasets collected from diverse clinical settings, providing
explainable outputs, aiding clinicians in comprehending the underlying meaning
of model predictions. This approach also prevents the model from learning
shortcuts and generalizes across clinical settings. The code is available at
https://github.com/luotingzhuang/CLIP_nodule.
中文标题/摘要
标题:基于视觉-语言模型的语义引导影像生物标志物用于肺癌结节恶性程度预测
机器学习模型利用语义特征、深度特征或两者来评估肺结节的恶性程度。然而,它们在推理过程中依赖手动注释、解释性有限以及对影像变异的敏感性限制了其在临床环境中的应用。因此,本研究旨在结合放射科医生对结节的评估中提取的语义特征,引导模型学习临床相关、稳健且可解释的影像特征,用于预测肺癌。我们从国家肺癌筛查试验(NLST)获得了938例低剂量CT扫描和1,261个结节的语义特征。此外,肺部影像数据库联盟数据集中包含1,018例CT扫描,其中2,625个病灶被标注了结节特征。我们从UCLA Health、LUNGx挑战和杜克肺癌筛查获得了三个外部数据集。我们使用参数高效微调方法对预训练的对比语言-图像预训练(CLIP)模型进行微调,以对齐影像和语义文本特征并预测一年后的肺癌诊断。我们的模型在NLST测试集中的AUROC为0.901,AUPRC为0.776,并在外部数据集中表现出稳健的结果。使用CLIP,我们还通过零样本推理获得了结节边缘(AUROC:0.807)、结节一致性(0.812)和胸膜附着(0.840)等语义特征的预测。我们的方法在来自不同临床环境的数据集中预测肺癌方面超越了SOTA模型,提供了可解释的输出,帮助临床医生理解模型预测的含义。这种方法还防止模型学习捷径,并在不同临床环境中泛化。代码可在https://github.com/luotingzhuang/CLIP_nodule/获取。
Summary / 总结
This research aims to improve lung nodule malignancy prediction by integrating semantic features from radiologists' assessments with deep learning models. The study fine-tuned a pretrained Contrastive Language-Image Pretraining (CLIP) model to align imaging and semantic text features, achieving an AUROC of 0.901 and AUPRC of 0.776 in the NLST test set. The model also showed robust performance in external datasets and provided explainable outputs for nodule characteristics like margin, consistency, and pleural attachment, surpassing state-of-the-art models across diverse clinical settings.
研究旨在开发一种结合放射科医生评估的语义特征的模型,以预测肺结节恶性程度,解决之前模型依赖手动注释和解释性差的问题。该模型使用了细调的对比语言-图像预训练(CLIP)方法来对齐影像和语义文本特征,在NLST测试集上实现了0.901的AUROC和0.776的AUPRC,并在外部数据集上表现出稳健的结果。该模型在预测结节边缘、一致性及胸膜附着等特定特征方面也表现出高AUROC,为临床医生提供可解释的输出。
A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models
Authors: Sriram Balasubramanian, Samyadeep Basu, Soheil Feizi
Venue: EMNLP 2025
First: 2025-05-29T18:55:05+00:00 · Latest: 2025-11-01T17:22:45+00:00
Comments: Accepted in EMNLP 2025, 34 pages, 25 figures
Abstract
Chain-of-thought (CoT) reasoning enhances performance of large language
models, but questions remain about whether these reasoning traces faithfully
reflect the internal processes of the model. We present the first comprehensive
study of CoT faithfulness in large vision-language models (LVLMs),
investigating how both text-based and previously unexplored image-based biases
affect reasoning and bias articulation. Our work introduces a novel,
fine-grained evaluation pipeline for categorizing bias articulation patterns,
enabling significantly more precise analysis of CoT reasoning than previous
methods. This framework reveals critical distinctions in how models process and
respond to different types of biases, providing new insights into LVLM CoT
faithfulness. Our findings reveal that subtle image-based biases are rarely
articulated compared to explicit text-based ones, even in models specialized
for reasoning. Additionally, many models exhibit a previously unidentified
phenomenon we term ``inconsistent'' reasoning - correctly reasoning before
abruptly changing answers, serving as a potential canary for detecting biased
reasoning from unfaithful CoTs. We then apply the same evaluation pipeline to
revisit CoT faithfulness in LLMs across various levels of implicit cues. Our
findings reveal that current language-only reasoning models continue to
struggle with articulating cues that are not overtly stated.
中文标题/摘要
标题:大型(视觉)语言模型中的偏见与链式思维忠实性探究
链式思维(CoT)推理增强了大型语言模型的性能,但关于这些推理过程是否忠实反映模型内部过程的问题仍然存在。我们首次全面研究了大型视觉-语言模型(LVLM)中的CoT忠实性,探讨了基于文本和以前未探索的基于图像的偏见如何影响推理和偏见表达。我们的工作引入了一种新颖的细粒度评估管道,用于分类偏见表达模式,使CoT推理的分析比以前的方法更加精确。该框架揭示了模型处理和响应不同类型偏见的关键区别,提供了关于LVLM CoT忠实性的新见解。我们的研究发现,与明确的基于文本的偏见相比,微妙的基于图像的偏见很少被表达,即使在专门用于推理的模型中也是如此。此外,许多模型表现出一种以前未被识别的现象,我们称之为“不一致”的推理——正确推理后突然改变答案,这可能是一个检测不忠实CoT推理的潜在指标。然后,我们使用相同的评估管道重新审视了各种隐含线索水平下LLM的CoT忠实性。我们的研究发现,当前的语言仅推理模型仍然难以表达未明确陈述的线索。
Summary / 总结
This study investigates the faithfulness of chain-of-thought (CoT) reasoning in large vision-language models (LVLMs), focusing on both text-based and image-based biases. It introduces a novel evaluation pipeline to categorize bias articulation patterns, revealing that subtle image-based biases are rarely articulated compared to explicit text-based ones. The research also identifies a new phenomenon called 'inconsistent' reasoning, where models may change answers abruptly after correct reasoning, potentially indicating biased reasoning from unfaithful CoTs. The findings suggest that current language-only reasoning models still struggle with implicit cues that are not explicitly stated.
该研究探讨了大型视觉语言模型(LVLM)中链式思考(CoT)推理的忠实性,重点关注文本和图像偏见。研究引入了一种新的评估管道来分类偏见表达模式,发现微妙的图像偏见很少被表达,相比之下,明确的文本偏见则更为常见。此外,许多模型表现出一种先前未识别的现象,即不一致的推理——在正确推理后突然改变答案,这可能是检测不忠实CoT偏见的潜在指标。研究结果表明,当前的语言-only推理模型仍然难以处理未明确陈述在输入中的暗示线索。