arXiv 论文速递

2025-12-25 03:29
Snapshot: 20251225_0329
FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models
Authors: Kaitong Cai, Jusheng Zhang, Jing Yang, Yijia Fan, Pengtao Xie, Jian Wang, Keze Wang
First: 2025-12-23T18:05:43+00:00 · Latest: 2025-12-23T18:05:43+00:00
Comments: Under submission
Abstract
Large vision-language models (VLMs) typically process hundreds or thousands of visual tokens per image or video frame, incurring quadratic attention cost and substantial redundancy. Existing token reduction methods often ignore the textual query or rely on deep attention maps, whose instability under aggressive pruning leads to degraded semantic alignment. We propose FlashVLM, a text guided visual token selection framework that dynamically adapts visual inputs to the query. Instead of relying on noisy attention weights, FlashVLM computes an explicit cross modal similarity between projected image tokens and normalized text embeddings in the language model space. This extrinsic relevance is fused with intrinsic visual saliency using log domain weighting and temperature controlled sharpening. In addition, a diversity preserving partition retains a minimal yet representative set of background tokens to maintain global context. Under identical token budgets and evaluation protocols, FlashVLM achieves beyond lossless compression, slightly surpassing the unpruned baseline while pruning up to 77.8 percent of visual tokens on LLaVA 1.5, and maintaining 92.8 percent accuracy even under 94.4 percent compression. Extensive experiments on 14 image and video benchmarks demonstrate that FlashVLM delivers state of the art efficiency performance trade offs while maintaining strong robustness and generalization across mainstream VLMs.
中文标题/摘要
标题:FlashVLM:文本引导的视觉标记选择框架用于大型多模态模型
大型视觉-语言模型(VLMs)通常每张图像或视频帧处理数百或数千个视觉标记,导致二次注意力成本和大量冗余。现有的标记减少方法往往忽视了文本查询或依赖于深度注意力图,这些图在剧烈剪枝下的不稳定性导致语义对齐下降。 我们提出了一种FlashVLM,这是一种文本引导的视觉标记选择框架,能够动态适应查询。FlashVLM 不依赖于嘈杂的注意力权重,而是计算投影图像标记与语言模型空间中归一化文本嵌入之间的显式跨模态相似性。这种外在的相关性与内在的视觉显著性通过对数域加权和温度控制锐化进行融合。此外,一种保留多样性的划分保留了少量但具有代表性的背景标记,以保持全局上下文。 在相同的标记预算和评估协议下,FlashVLM 实现了超越无损压缩的效果,即使在对LLaVA 1.5剪枝高达77.8%的情况下,仍略优于未剪枝的基线,同时保持92.8%的准确率,即使在高达94.4%的压缩下也是如此。在14个图像和视频基准上的广泛实验表明,FlashVLM 在保持强大鲁棒性和泛化能力的同时,提供了最先进的效率性能折衷。
Summary / 总结
FlashVLM is a text-guided visual token selection framework that dynamically adapts visual inputs to textual queries. It computes explicit cross-modal similarity between image tokens and text embeddings, fusing it with visual saliency and retaining a minimal set of background tokens. This method achieves up to 77.8% visual token pruning while maintaining 92.8% accuracy, surpassing the unpruned baseline and delivering state-of-the-art efficiency performance across 14 benchmarks.
FlashVLM 是一种文本引导的视觉标记选择框架,通过计算显式的跨模态相似性并将其与内在的视觉显著性融合,动态适应视觉输入到文本查询。它实现了超越无损压缩,超越未剪枝基线,在LLaVA 1.5 上剪枝高达 77.8% 的视觉标记,同时在 94.4% 压缩下保持 92.8% 的准确性。在 14 个图像和视频基准上的广泛实验表明,FlashVLM 提供了最先进的效率性能折衷,具有强大的鲁棒性和泛化能力。
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
Authors: Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi
First: 2025-12-23T17:56:36+00:00 · Latest: 2025-12-23T17:56:36+00:00
Abstract
Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.
中文标题/摘要
标题:在四维中学习推理:视觉语言模型的动态空间理解
视觉语言模型(VLM)在一般理解方面表现出色,但在动态空间推理(DSR)方面仍然较弱,即在时间维度上对3D空间中物体几何形状和关系的变化进行推理,这主要是由于缺乏可扩展的四维感知训练资源。为了在数据集、基准和模型的各个方面弥合这一差距,我们引入了DSR套件。首先,我们提出了一种自动流水线,从野外视频中生成DSR的多项选择题-答案对。通过利用现代视觉基础模型,该流水线提取了丰富的几何和运动信息,包括相机姿态、局部点云、物体掩码、方向和3D轨迹。这些几何线索使DSR-Train得以构建,进一步通过人工精炼构建DSR-基准用于评估。与以往工作相比,我们的数据强调(i)野外视频来源,(ii)物体和场景级别的3D需求,(iii)视角变换,(iv)多物体交互,以及(v)细粒度、程序化的答案。除了数据,我们还提出了一种轻量级的几何选择模块(GSM),以无缝地将几何先验整合到VLM中,该模块压缩了问题语义并从预训练的四维重建先验中提取与问题相关的信息,形成一组紧凑的几何标记。这种有针对性的提取避免了向模型灌输无关知识。实验表明,将DSR-Train和GSM集成到Qwen2.5-VL-7B中显著增强了其动态空间推理能力,同时在通用视频理解基准测试中保持了准确性。
Summary / 总结
The research aims to improve vision-language models' dynamic spatial reasoning (DSR) by addressing the scarcity of 4D-aware training resources. The authors introduce DSR Suite, which includes an automated pipeline for generating DSR question-answer pairs from in-the-wild videos and a lightweight Geometry Selection Module (GSM) to integrate geometric priors into VLMs. Experimental results show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability while maintaining accuracy on general video understanding benchmarks.
研究旨在通过解决4D感知训练资源稀缺问题,提升视觉语言模型的动态空间推理能力(DSR)。提出了DSR套件,包括从野生视频中自动生成DSR问答对的自动化管道和轻量级的几何选择模块(GSM),以将几何先验整合到VLM中。该套件强调野生视频来源、3D需求、视角变换、多对象交互和细粒度答案。实验表明,将DSR-Train和GSM整合到Qwen2.5-VL-7B中可以显著增强其DSR能力,同时保持一般视频理解的准确性。
Multi-Grained Text-Guided Image Fusion for Multi-Exposure and Multi-Focus Scenarios
Authors: Mingwei Tang, Jiahao Nie, Guang Yang, Ziqing Cui, Jie Li
Venue: WACV 2026
First: 2025-12-23T17:55:35+00:00 · Latest: 2025-12-23T17:55:35+00:00
Comments: Accepted to WACV 2026
Abstract
Image fusion aims to synthesize a single high-quality image from a pair of inputs captured under challenging conditions, such as differing exposure levels or focal depths. A core challenge lies in effectively handling disparities in dynamic range and focus depth between the inputs. With the advent of vision-language models, recent methods incorporate textual descriptions as auxiliary guidance to enhance fusion quality. However, simply incorporating coarse-grained descriptions hampers the understanding of fine-grained details and poses challenges for precise cross-modal alignment. To address these limitations, we propose Multi-grained Text-guided Image Fusion (MTIF), a novel fusion paradigm with three key designs. First, it introduces multi-grained textual descriptions that separately capture fine details, structural cues, and semantic content, guiding image fusion through a hierarchical cross-modal modulation module. Second, it involves supervision signals at each granularity to facilitate alignment between visual and textual features and enhance the utility of auxiliary text. Third, it adopts a saliency-driven enrichment module to augment training data with dense semantic content, further strengthening the cross-modal modulation and alignment. Extensive experiments show that MTIF consistently outperforms previous methods on both multi-exposure and multi-focus image fusion tasks.
中文标题/摘要
标题:多粒度文本引导图像融合以应对多曝光和多焦点场景
图像融合旨在从在具有挑战性条件下拍摄的一对输入中合成一张高质量的图像,例如不同的曝光水平或焦深。核心挑战在于有效处理输入之间的动态范围和焦深差异。随着视觉语言模型的出现,最近的方法将文本描述作为辅助指导以提高融合质量。然而,简单地引入粗粒度描述会阻碍对细粒度细节的理解,并且对跨模态对齐提出了挑战。为了解决这些限制,我们提出了多粒度文本引导图像融合(MTIF),这是一种具有三个关键设计的新型融合范式。首先,它引入了多粒度文本描述,分别捕捉细粒度细节、结构线索和语义内容,通过分层跨模态调制模块引导图像融合。其次,它在每个粒度级别引入监督信号,以促进视觉和文本特征之间的对齐并增强辅助文本的实用性。第三,它采用了一种基于显著性的增强模块,通过密集的语义内容增强训练数据,进一步加强跨模态调制和对齐。广泛的实验表明,MTIF在多曝光和多焦点图像融合任务中始终优于先前的方法。
Summary / 总结
The paper addresses the challenge of fusing images with different exposure levels and focus depths by proposing Multi-grained Text-guided Image Fusion (MTIF). MTIF uses hierarchical textual descriptions to guide the fusion process, including fine details, structural cues, and semantic content. It also incorporates granularity-specific supervision signals and a saliency-driven enrichment module to improve cross-modal alignment. Experimental results demonstrate that MTIF outperforms existing methods in both multi-exposure and multi-focus image fusion tasks.
论文提出了一种多粒度文本引导图像融合方法(MTIF),以解决不同曝光和焦深的图像融合问题。MTIF通过层次化的文本描述来引导融合过程,包含不同粒度的监督信号,并采用注意力驱动的增强模块。实验表明,MTIF在多曝光和多焦点场景中均优于现有方法。
Video Generation Models Are Good Latent Reward Models
Authors: Xiaoyue Mi, Wenqing Yu, Jiesong Lian, Shibo Jie, Ruizhe Zhong, Zijun Liu, Guozhen Zhang, Zixiang Zhou, Zhiyong Xu, Yuan Zhou, Qinglin Lu, Fan Tang
First: 2025-11-26T16:14:18+00:00 · Latest: 2025-12-23T15:17:06+00:00
Abstract
Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.
中文标题/摘要
标题:视频生成模型是良好的潜在奖励模型
奖励反馈学习(ReFL)已被证明对于使图像生成与人类偏好对齐非常有效。然而,将其扩展到视频生成面临着重大挑战。现有的视频奖励模型依赖于为像素空间输入设计的视觉语言模型,这将ReFL优化限制在昂贵的VAE解码之后的近完全去噪步骤中。这种像素空间的方法会产生大量的内存开销并增加训练时间,而且其后期优化缺乏早期监督,仅能优化视觉质量而不能优化基本的运动动态和结构一致性。在本文中,我们展示了预训练的视频生成模型自然适合在噪声潜在空间中进行奖励建模,因为它们明确设计为可以处理任意时间步的噪声潜在表示,并通过其序列建模能力内在地保留时间信息。因此,我们提出了过程奖励反馈学习(PRFL)框架,该框架在潜在空间中完全进行偏好优化,从而在整个去噪链中实现高效的梯度反向传播,而无需VAE解码。广泛的实验表明,PRFL在与人类偏好对齐方面显著提高,同时与RGB ReFL相比在内存消耗和训练时间上实现了显著减少。
Summary / 总结
This work addresses the challenge of applying reward feedback learning (ReFL) to video generation by proposing Process Reward Feedback Learning (PRFL). PRFL leverages pre-trained video generation models to optimize preferences in the noisy latent space, avoiding the need for computationally expensive VAE decoding. This approach reduces memory usage and training time while improving alignment with human preferences. Key findings include significant improvements in alignment with human preferences and substantial reductions in memory consumption and training time compared to traditional RGB ReFL methods.
本文针对将奖励反馈学习(ReFL)应用于视频生成所面临的挑战,提出了Process Reward Feedback Learning(PRFL)框架。PRFL利用预训练的视频生成模型直接在噪声的潜在空间中优化偏好,避免了昂贵的VAE解码。这种方法减少了内存使用和训练时间,同时在人类偏好匹配方面优于传统的像素空间ReFL方法。
Scaling Laws for Energy Efficiency of Local LLMs
Authors: Ander Alvarez, Alessandro Genuardi, Nilotpal Sinha, Antonio Tiene, Mikail Okyay, Bakbergen Ryskulov, David Montero, Samuel Mugel, Román Orús
First: 2025-12-18T13:40:33+00:00 · Latest: 2025-12-23T15:02:39+00:00
Abstract
Deploying local large language models and vision-language models on edge devices requires balancing accuracy with constrained computational and energy budgets. Although graphics processors dominate modern artificial-intelligence deployment, most consumer hardware--including laptops, desktops, industrial controllers, and embedded systems--relies on central processing units. Despite this, the computational laws governing central-processing-unit-only inference for local language and vision-language workloads remain largely unexplored. We systematically benchmark large language and vision-language models on two representative central-processing-unit tiers widely used for local inference: a MacBook Pro M2, reflecting mainstream laptop-class deployment, and a Raspberry Pi 5, representing constrained, low-power embedded settings. Using a unified methodology based on continuous sampling of processor and memory usage together with area-under-curve integration, we characterize how computational load scales with input text length for language models and with image resolution for vision-language models. We uncover two empirical scaling laws: (1) computational cost for language-model inference scales approximately linearly with token length; and (2) vision-language models exhibit a preprocessing-driven "resolution knee", where compute remains constant above an internal resolution clamp and decreases sharply below it. Beyond these laws, we show that quantum-inspired compression reduces processor and memory usage by up to 71.9% and energy consumption by up to 62%, while preserving or improving semantic accuracy. These results provide a systematic quantification of multimodal central-processing-unit-only scaling for local language and vision-language workloads, and they identify model compression and input-resolution preprocessing as effective, low-cost levers for sustainable edge inference.
中文标题/摘要
标题:局部LLM能效的标度律
在边缘设备上部署局部大型语言模型和视觉-语言模型需要在准确性与受限的计算和能源预算之间进行平衡。尽管图形处理器主导了现代人工智能部署,但大多数消费级硬件——包括笔记本电脑、台式机、工业控制器和嵌入式系统——仍然依赖于中央处理器。尽管如此,仅中央处理器的推理计算法则对局部语言和视觉-语言工作负载的研究仍然相对较少。我们系统地在两个广泛用于局部推理的中央处理器级别上对大型语言和视觉-语言模型进行了基准测试:一台搭载M2芯片的MacBook Pro,代表主流笔记本电脑级部署,以及一个Raspberry Pi 5,代表受限的、低功耗嵌入式设置。通过基于连续采样处理器和内存使用情况并结合面积-曲线积分的方法,我们描述了计算负载随输入文本长度对语言模型和随图像分辨率对视觉-语言模型的标度关系。我们发现了两条经验标度律:(1)语言模型推理的计算成本大约与标记长度成线性关系;(2)视觉-语言模型表现出一种预处理驱动的“分辨率拐点”,其中计算在内部分辨率限制以上保持恒定,在以下则急剧下降。除了这些定律,我们展示了量子启发式压缩可以将处理器和内存使用量最多减少71.9%,能源消耗最多减少62%,同时保持或提高语义准确性。这些结果提供了对局部语言和视觉-语言工作负载的多模态仅中央处理器计算法则的系统量化,并指出了模型压缩和输入分辨率预处理作为可持续边缘推理的有效、低成本杠杆。
Summary / 总结
This study investigates the energy efficiency of deploying large language models and vision-language models on edge devices, focusing on central processing units. By benchmarking these models on a MacBook Pro M2 and a Raspberry Pi 5, the researchers discovered two scaling laws: the computational cost for language models scales linearly with token length, while vision-language models show a resolution knee where compute remains constant above a certain resolution and decreases below it. Additionally, quantum-inspired compression was found to reduce processor and memory usage by up to 71.9% and energy consumption by up to 62%, while maintaining or improving semantic accuracy. These findings offer a systematic understanding of computational scaling for local inference tasks and highlight effective strategies for sustainable edge inference.
研究探讨了在边缘设备上部署大规模语言模型和视觉-语言模型时的能效问题,重点关注中央处理单元。通过在MacBook Pro M2和Raspberry Pi 5上进行基准测试,研究人员发现两个缩放定律:语言模型的计算成本随词元长度线性增加,而视觉-语言模型则表现出一个预处理驱动的“分辨率拐点”,即在某一分辨率以上,计算量保持不变,在此之下则急剧下降。此外,量子启发式压缩最多可减少71.9%的处理器和内存使用,以及62%的能量消耗,同时保持或提高语义准确性。这些发现为局部推理任务的计算缩放提供了系统性理解,并指出了可持续边缘推理的有效策略。
Chain-of-Anomaly Thoughts with Large Vision-Language Models
Authors: Pedro Domingos, João Pereira, Vasco Lopes, João Neves, David Semedo
First: 2025-12-23T15:01:05+00:00 · Latest: 2025-12-23T15:01:05+00:00
Comments: 2 pages, 3 figures, 1 table. Accepted for RECPAD 2025
Abstract
Automated video surveillance with Large Vision-Language Models is limited by their inherent bias towards normality, often failing to detect crimes. While Chain-of-Thought reasoning strategies show significant potential for improving performance in language tasks, the lack of inductive anomaly biases in their reasoning further steers the models towards normal interpretations. To address this, we propose Chain-of-Anomaly-Thoughts (CoAT), a multi-agent reasoning framework that introduces inductive criminal bias in the reasoning process through a final, anomaly-focused classification layer. Our method significantly improves Anomaly Detection, boosting F1-score by 11.8 p.p. on challenging low-resolution footage and Anomaly Classification by 3.78 p.p. in high-resolution videos.
中文标题/摘要
标题:大型视觉语言模型中的异常链思考
大型视觉语言模型在自动化视频监控中受限于其对正常情况的固有偏见,往往无法检测犯罪。虽然链式思考推理策略在语言任务中显示出显著的潜力,但在推理过程中缺乏归纳异常偏见进一步引导模型向正常解释。为了解决这一问题,我们提出了一种名为异常链思考(CoAT)的多智能体推理框架,通过最终的异常分类层引入归纳犯罪偏见。我们的方法显著提高了异常检测,F1分数在低分辨率视频中提高了11.8个百分点,在高分辨率视频中的异常分类提高了3.78个百分点。
Summary / 总结
The paper addresses the limitation of large vision-language models in detecting crimes during automated video surveillance due to their bias towards normality. It proposes Chain-of-Anomaly-Thoughts (CoAT), a multi-agent reasoning framework that introduces inductive criminal bias through a final anomaly-focused classification layer. The method significantly improves anomaly detection and classification, with an F1-score increase of 11.8 percentage points on low-resolution footage and a 3.78 percentage point increase in high-resolution videos.
研究旨在通过解决大型视觉-语言模型在检测异常,尤其是犯罪方面的局限性,来提升自动视频监控。提出的Chain-of-Anomaly-Thoughts (CoAT)框架通过引入异常分类层来改进推理。关键实验结果显示,在低分辨率视频上异常检测的F1分数提高了11.8个百分点,在高分辨率视频上异常分类提高了3.78个百分点。
LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer
Authors: Yuzhuo Chen, Zehua Ma, Jianhua Wang, Kai Kang, Shunyu Yao, Weiming Zhang
First: 2025-08-01T09:51:54+00:00 · Latest: 2025-12-23T14:27:42+00:00
Comments: 8 pages, 5 figures, 3 tables
Abstract
In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC's performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/LAMIC.
中文标题/摘要
标题:LAMIC:基于布局感知的多图像合成通过多模态扩散变换器的可扩展性
在可控图像合成中,从多个参考中生成具有空间布局意识的连贯且一致的图像仍然是一个开放的挑战。我们提出了LAMIC,一种布局感知的多图像合成框架,首次以无需训练的方式将单参考扩散模型扩展到多参考场景。基于MMDiT模型,LAMIC引入了两种即插即用的注意力机制:1)组隔离注意力(GIA)以增强实体分离;2)区域调节注意力(RMA)以实现布局感知生成。为了全面评估模型能力,我们进一步引入了三个指标:1)包含比(IN-R)和填充比(FI-R)以评估布局控制;2)背景相似度(BG-S)以衡量背景一致性。大量实验表明,LAMIC在大多数主要指标上均取得了最先进的性能:在所有设置中,它在ID-S、BG-S、IN-R和AVG得分上始终优于现有的多参考基线,并在复杂合成任务中实现了最佳的DPG。这些结果表明,LAMIC在保持身份、保存背景、布局控制和遵循提示方面具有优越的能力,所有这些均无需任何训练或微调,展示了强大的零样本泛化能力。通过继承先进的单参考模型的优势并使其无缝扩展到多图像场景,LAMIC为可控多图像合成建立了一个新的无需训练的范式。随着基础模型的不断进化,LAMIC的性能预计会相应地扩展。我们的实现可在以下链接获取:https://github.com/Suchenl/LAMIC。
Summary / 总结
LAMIC is a Layout-Aware Multi-Image Composition framework that extends single-reference diffusion models to multi-reference scenarios without training. It introduces GIA for entity disentanglement and RMA for layout-aware generation. LAMIC outperforms existing multi-reference baselines in metrics like Inclusion Ratio, Fill Ratio, and Background Similarity, demonstrating strong zero-shot generalization and superior abilities in identity keeping, background preservation, and prompt-following. The framework achieves state-of-the-art performance in complex composition tasks without any fine-tuning or training, showcasing its robustness and scalability potential as foundation models evolve.
LAMIC 是一个框架,旨在从多个参考中生成既连贯又一致的图像,并保持空间布局意识。它基于 MMDiT,并引入了两种注意力机制:Group Isolation Attention (GIA) 用于实体分离,以及 Region-Modulated Attention (RMA) 用于布局感知生成。LAMIC 在 Inclusion Ratio、Fill Ratio 和 Background Similarity 等指标上优于现有方法,展示了在身份保持、背景保留和指令跟随方面的优越能力,且无需任何训练或微调,展示了强大的零样本泛化能力,并建立了新的无训练框架范式用于可控多图像合成。
CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation
Authors: V. Kovalev, A. Kuvshinov, A. Buzovkin, D. Pokidov, D. Timonin
First: 2025-12-23T13:44:41+00:00 · Latest: 2025-12-23T13:44:41+00:00
Comments: 37 pages, 42 figures
Abstract
Recent work has shown that inference-time reasoning and reflection can improve text-to-image generation without retraining. However, existing approaches often rely on implicit, holistic critiques or unconstrained prompt rewrites, making their behavior difficult to interpret, control, or stop reliably. In contrast, large language models have benefited from explicit, structured forms of **thinking** based on verification, targeted correction, and early stopping. We introduce CRAFT (Continuous Reasoning and Agentic Feedback Tuning), a training-free, model-agnostic framework that brings this structured reasoning paradigm to multimodal image generation. CRAFT decomposes a prompt into dependency-structured visual questions, veries generated images using a vision-language model, and applies targeted prompt edits through an LLM agent only where constraints fail. The process iterates with an explicit stopping criterion once all constraints are satised, yielding an interpretable and controllable inference-time renement loop. Across multiple model families and challenging benchmarks, CRAFT consistently improves compositional accuracy, text rendering, and preference-based evaluations, with particularly strong gains for lightweight generators. Importantly, these improvements incur only a negligible inference-time overhead, allowing smaller or cheaper models to approach the quality of substantially more expensive systems. Our results suggest that explicitly structured, constraint-driven inference-time reasoning is a key ingredient for improving the reliability of multimodal generative models.
中文标题/摘要
标题:CRAFT:连续推理和代理反馈调优的多模态文本到图像生成
近期研究表明,在不重新训练的情况下,推理时的推理和反思可以提高文本到图像生成的效果。然而,现有方法往往依赖于隐式的整体批评或不受限制的提示重写,这使得它们的行为难以解释、控制或可靠地停止。相比之下,大型语言模型得益于基于验证、目标修正和早期停止的明确结构化形式的**思考**。 我们提出了CRAFT(连续推理和代理反馈调优),这是一种无需训练、模型无关的框架,将这种结构化推理范式引入多模态图像生成。CRAFT 将提示分解为依赖结构化的视觉问题,使用视觉语言模型验证生成的图像,并仅在约束失败时通过LLM代理应用目标提示编辑。该过程在所有约束条件都满足时使用明确的停止标准进行迭代,从而产生可解释且可控的推理时细化循环。 在多个模型家族和具有挑战性的基准测试中,CRAFT 一致地提高了组合准确性、文本呈现和基于偏好的评估,特别是对于轻量级生成器而言效果显著。重要的是,这些改进仅带来微不足道的推理时开销,使得较小或更便宜的模型能够接近成本高昂系统的质量。我们的结果表明,明确结构化、基于约束的推理是提高多模态生成模型可靠性的关键成分。
Summary / 总结
CRAFT is a training-free, model-agnostic framework that enhances text-to-image generation by incorporating structured reasoning and targeted prompt edits. It decomposes prompts into visual questions, verifies generated images using a vision-language model, and applies edits through an LLM agent where necessary. This process iterates until all constraints are satisfied, resulting in improved compositional accuracy, text rendering, and preference-based evaluations, especially for lightweight generators, with minimal inference-time overhead.
CRAFT 是一个无需训练、适用于多种模型的框架,通过引入结构化推理和目标导向的提示编辑来提升文本到图像的生成。它将提示分解为视觉问题,使用视觉语言模型验证生成的图像,并在必要时通过LLM代理应用编辑。这一过程在所有约束条件满足时迭代进行,从而提高了组合准确性、文本渲染和基于偏好的评估,尤其是对于轻量级生成器,同时仅产生微小的推理时间开销。
ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge
Authors: Yuntao Dai, Hang Gu, Teng Wang, Qianyu Cheng, Yifei Zheng, Zhiyong Qiu, Lei Gong, Wenqi Lou, Xuehai Zhou
First: 2025-12-23T11:29:03+00:00 · Latest: 2025-12-23T11:29:03+00:00
Abstract
Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long-horizon task execution. However, their deployment in dynamic, real-world environments is severely hin dered by high inference latency. While smooth robotic interaction requires control frequencies of 20 to 30 Hz, current VLA models typi cally operate at only 3-5 Hz on edge devices due to the memory bound nature of autoregressive decoding. Existing optimizations often require extensive retraining or compromise model accuracy. To bridge this gap, we introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge plat forms. At the core of ActionFlow is a Cross-Request Pipelin ing strategy, a novel scheduler that redefines VLA inference as a macro-pipeline of micro-requests. The strategy intelligently batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps to maximize hardware utilization. Furthermore, to support this scheduling, we propose a Cross Request State Packed Forward operator and a Unified KV Ring Buffer, which fuse fragmented memory operations into efficient dense computations. Experimental results demonstrate that ActionFlow achieves a 2.55x improvement in FPS on the OpenVLA-7B model without retraining, enabling real-time dy namic manipulation on edge hardware. Our work is available at https://anonymous.4open.science/r/ActionFlow-1D47.
中文标题/摘要
标题:ActionFlow:边缘设备上视觉语言动作模型的流水线加速
视觉-语言-动作(VLA)模型已成为机器人感知和控制的统一范式,使长期任务执行成为可能。然而,它们在动态现实环境中的部署受到高推理延迟的严重阻碍。虽然平滑的机器人交互需要20到30赫兹的控制频率,但当前的VLA模型由于自回归解码的内存限制,通常在边缘设备上只能以3到5赫兹的速度运行。现有的优化往往需要大量的重新训练或牺牲模型的准确性。为了解决这一问题,我们提出了ActionFlow,这是一种针对资源受限边缘平台的系统级推理框架。ActionFlow的核心是一种跨请求流水线策略,这是一种新颖的调度器,重新定义了VLA推理为宏流水线中的微请求。该策略在连续时间步中智能地将内存受限的解码阶段与计算受限的预填充阶段进行批处理,以最大化硬件利用率。此外,为了支持这种调度,我们提出了跨请求状态打包前向运算符和统一的KV环形缓冲区,将碎片化的内存操作融合为高效的密集计算。实验结果表明,ActionFlow在OpenVLA-7B模型上实现了2.55倍的FPS提升,无需重新训练,从而在边缘硬件上实现实时动态操作。我们的工作可在https://anonymous.4open.science/r/ActionFlow-1D47获取。
Summary / 总结
ActionFlow is a system-level inference framework designed to reduce the high inference latency of Vision-Language-Action (VLA) models on edge devices, enabling real-time dynamic manipulation. It employs a Cross-Request Pipelining strategy that batches memory-bound Decode phases with compute-bound Prefill phases to maximize hardware utilization. Experimental results show a 2.55x improvement in FPS on the OpenVLA-7B model without retraining, allowing for 20 to 30 Hz control frequencies necessary for smooth robotic interaction.
ActionFlow 是一个系统级推理框架,旨在减少 Vision-Language-Action 模型在边缘设备上的高推理延迟,实现实时的机器人交互。它采用了一种跨请求流水线策略,将内存受限的解码阶段与计算受限的预填充阶段进行批处理,以最大化硬件利用率。实验结果显示,ActionFlow 在 OpenVLA-7B 模型上将 FPS 提高了 2.55 倍,无需重新训练模型即可实现边缘硬件上的实时动态操作。
LADLE-MM: Limited Annotation based Detector with Learned Ensembles for Multimodal Misinformation
Authors: Daniele Cardullo, Simone Teglia, Irene Amerini
First: 2025-12-23T11:14:58+00:00 · Latest: 2025-12-23T11:14:58+00:00
Abstract
With the rise of easily accessible tools for generating and manipulating multimedia content, realistic synthetic alterations to digital media have become a widespread threat, often involving manipulations across multiple modalities simultaneously. Recently, such techniques have been increasingly employed to distort narratives of important events and to spread misinformation on social media, prompting the development of misinformation detectors. In the context of misinformation conveyed through image-text pairs, several detection methods have been proposed. However, these approaches typically rely on computationally intensive architectures or require large amounts of annotated data. In this work we introduce LADLE-MM: Limited Annotation based Detector with Learned Ensembles for Multimodal Misinformation, a model-soup initialized multimodal misinformation detector designed to operate under a limited annotation setup and constrained training resources. LADLE-MM is composed of two unimodal branches and a third multimodal one that enhances image and text representations with additional multimodal embeddings extracted from BLIP, serving as fixed reference space. Despite using 60.3% fewer trainable parameters than previous state-of-the-art models, LADLE-MM achieves competitive performance on both binary and multi-label classification tasks on the DGM4 benchmark, outperforming existing methods when trained without grounding annotations. Moreover, when evaluated on the VERITE dataset, LADLE-MM outperforms current state-of-the-art approaches that utilize more complex architectures involving Large Vision-Language-Models, demonstrating the effective generalization ability in an open-set setting and strong robustness to unimodal bias.
中文标题/摘要
标题:LADLE-MM:基于有限标注的学习集成多模态虚假信息检测器
随着生成和操控多媒体内容的工具变得易于获取,数字媒体中的现实合成篡改已成为一种普遍威胁,通常涉及多种模态的同时篡改。近年来,此类技术被越来越多地用于扭曲重要事件的叙述并在社交媒体上传播虚假信息,促使开发虚假信息检测器。在图像-文本对的虚假信息传播背景下,已经提出了几种检测方法。然而,这些方法通常依赖于计算密集型架构或需要大量标注数据。在本工作中,我们引入了LADLE-MM:基于有限标注的学习集成多模态虚假信息检测器,这是一种在有限标注设置和受限训练资源下运行的多模态虚假信息检测器。LADLE-MM 由两个单模态分支和一个增强图像和文本表示的第三个多模态分支组成,该分支使用从BLIP提取的附加多模态嵌入作为固定参考空间。尽管LADLE-MM 的可训练参数比之前最先进的模型少60.3%,但在DGM4基准上的二分类和多标签分类任务中,LADLE-MM 达到了竞争性性能,且在未使用语义标注进行训练时优于现有方法。此外,在VERITE数据集上评估时,LADLE-MM 超越了使用更复杂架构(涉及大型视觉-语言模型)的现有方法,展示了在开放集设置下的有效泛化能力和对单模态偏差的强大鲁棒性。
Summary / 总结
LADLE-MM is a multimodal misinformation detector designed for limited annotation scenarios. It uses two unimodal branches and a multimodal branch enhanced by multimodal embeddings from BLIP. Despite having fewer parameters, LADLE-MM performs competitively on binary and multi-label classification tasks and outperforms existing methods on the VERITE dataset, showing strong robustness and generalization ability.
LADLE-MM 是一种针对有限标注场景设计的多模态 misinformation 检测器。它包含两个单模态分支和一个通过 BLIP 提取的多模态嵌入增强的多模态分支,所需的可训练参数比之前的方法少得多。尽管如此,LADLE-MM 在 DGM4 基准上的性能与现有方法相当,并且在 VERITE 数据集上的表现优于使用更复杂架构的方法,展示了其在开放集环境中的有效泛化能力和对单模态偏差的强大鲁棒性。
Asynchronous Fast-Slow Vision-Language-Action Policies for Whole-Body Robotic Manipulation
Authors: Teqiang Zou, Hongliang Zeng, Yuxuan Nong, Yifan Li, Kehui Liu, Haotian Yang, Xinyang Ling, Xin Li, Lianyang Ma
First: 2025-12-23T09:28:20+00:00 · Latest: 2025-12-23T09:28:20+00:00
Abstract
Most Vision-Language-Action (VLA) systems integrate a Vision-Language Model (VLM) for semantic reasoning with an action expert generating continuous action signals, yet both typically run at a single unified frequency. As a result, policy performance is constrained by the low inference speed of large VLMs. This mandatory synchronous execution severely limits control stability and real-time performance in whole-body robotic manipulation, which involves more joints, larger motion spaces, and dynamically changing views. We introduce a truly asynchronous Fast-Slow VLA framework (DuoCore-FS), organizing the system into a fast pathway for high-frequency action generation and a slow pathway for rich VLM reasoning. The system is characterized by two key features. First, a latent representation buffer bridges the slow and fast systems. It stores instruction semantics and action-reasoning representation aligned with the scene-instruction context, providing high-level guidance to the fast pathway. Second, a whole-body action tokenizer provides a compact, unified representation of whole-body actions. Importantly, the VLM and action expert are still jointly trained end-to-end, preserving unified policy learning while enabling asynchronous execution. DuoCore-FS supports a 3B-parameter VLM while achieving 30 Hz whole-body action-chunk generation, approximately three times as fast as prior VLA models with comparable model sizes. Real-world whole-body manipulation experiments demonstrate improved task success rates and significantly enhanced responsiveness compared to synchronous Fast-Slow VLA baselines. The implementation of DuoCore-FS, including training, inference, and deployment, is provided to commercial users by Astribot as part of the Astribot robotic platform.
中文标题/摘要
标题:异步快速-缓慢视觉-语言-行动策略用于全身机器人操作
大多数视觉-语言-行动(VLA)系统将视觉-语言模型(VLM)用于语义推理,并由动作专家生成连续的动作信号,但两者通常以单一的统一频率运行。因此,策略性能受限于大型VLM的低推理速度。这种强制同步执行严重限制了全身机器人操作中的控制稳定性和实时性能,因为这种操作涉及更多的关节、更大的运动空间以及动态变化的视角。我们引入了一个真正异步的快速-缓慢VLA框架(DuoCore-FS),将系统组织成一个高频动作生成的快速路径和一个丰富的VLM推理的缓慢路径。该系统具有两个关键特征。首先,一个潜在表示缓冲区连接慢速和快速系统。它存储与场景指令上下文对齐的指令语义和动作推理表示,为快速路径提供高层次的指导。其次,一个全身动作分词器提供了一个紧凑的、统一的全身动作表示。重要的是,VLM和动作专家仍然以端到端的方式联合训练,保持统一策略学习的同时允许异步执行。DuoCore-FS 支持一个3B参数的VLM,同时实现30 Hz的全身动作片段生成,大约比具有可比模型大小的先前VLA模型快三倍。现实世界的全身操作实验表明,与同步快速-缓慢VLA基线相比,任务成功率和响应性显著提高。DuoCore-FS 的实现,包括训练、推理和部署,由Astribot提供给商业用户,作为Astribot机器人平台的一部分。
Summary / 总结
The research addresses the limitations of synchronous execution in Vision-Language-Action (VLA) systems for whole-body robotic manipulation, which are constrained by the slow inference speed of large Vision-Language Models (VLMs). It introduces DuoCore-FS, an asynchronous Fast-Slow VLA framework that decouples the high-frequency action generation from rich VLM reasoning. The system includes a latent representation buffer and a whole-body action tokenizer, enabling a 30 Hz whole-body action-chunk generation, approximately three times faster than previous models. Experiments show improved task success rates and enhanced responsiveness compared to synchronous VLA baselines.
该论文提出了DuoCore-FS,一种异步快速-缓慢视觉-语言-动作框架,用于全身机器人操作。它通过将系统分为高速路径进行高频动作生成和缓慢路径进行VLM推理来解决同步执行的限制。关键特性包括潜在表示缓冲区和全身动作分词器。DuoCore-FS 允许每秒30次全身动作片段生成,显著快于之前的模型,同时在真实世界实验中保持高任务成功率。该系统支持3B参数的VLM,并作为Astribot机器人平台的一部分提供给商业用户使用。
Drifting Away from Truth: GenAI-Driven News Diversity Challenges LVLM-Based Misinformation Detection
Authors: Fanxiao Li, Jiaying Wu, Tingchao Fu, Yunyun Dong, Bingbing Song, Wei Zhou
First: 2025-08-18T08:19:43+00:00 · Latest: 2025-12-23T09:17:31+00:00
Abstract
The proliferation of multimodal misinformation poses growing threats to public discourse and societal trust. While Large Vision-Language Models (LVLMs) have enabled recent progress in multimodal misinformation detection (MMD), the rise of generative AI (GenAI) tools introduces a new challenge: GenAI-driven news diversity, characterized by highly varied and complex content. We show that this diversity induces multi-level drift, comprising (1) model-level misperception drift, where stylistic variations disrupt a model's internal reasoning, and (2) evidence-level drift, where expression diversity degrades the quality or relevance of retrieved external evidence. These drifts significantly degrade the robustness of current LVLM-based MMD systems. To systematically study this problem, we introduce DriftBench, a large-scale benchmark comprising 16,000 news instances across six categories of diversification. We design three evaluation tasks: (1) robustness of truth verification under multi-level drift; (2) susceptibility to adversarial evidence contamination generated by GenAI; and (3) analysis of reasoning consistency across diverse inputs. Experiments with six state-of-the-art LVLM-based detectors show substantial performance drops (average F1 -14.8%) and increasingly unstable reasoning traces, with even more severe failures under adversarial evidence injection. Our findings uncover fundamental vulnerabilities in existing MMD systems and suggest an urgent need for more resilient approaches in the GenAI era.
中文标题/摘要
标题:远离真相:由GenAI驱动的新闻多样性挑战LVLM基的 misinformation检测
多模态misinformation的泛滥对公共话语和社会信任构成了日益增长的威胁。虽然大型视觉-语言模型(LVLM)在多模态misinformation检测(MMD)方面取得了近期进展,但生成式AI(GenAI)工具的兴起引入了一个新的挑战:由GenAI驱动的新闻多样性,其特征是内容高度多样化和复杂化。我们表明,这种多样性导致了多级漂移,包括(1)模型级感知漂移,其中风格变化干扰了模型的内部推理,以及(2)证据级漂移,其中表达多样性降低了检索外部证据的质量或相关性。这些漂移显著削弱了当前基于LVLM的MMD系统的稳健性。为了系统地研究这一问题,我们引入了DriftBench,这是一个包含16,000个新闻实例的大规模基准,涵盖了六类多样化的类别。我们设计了三个评估任务:(1)在多级漂移下的事实验证稳健性;(2)对抗由GenAI生成的虚假证据污染的易感性;以及(3)对多样化输入的推理一致性分析。六种最先进的基于LVLM的检测器的实验显示,性能下降显著(平均F1 -14.8%),推理轨迹越来越不稳定,并且在对抗虚假证据注入下表现更加严重。我们的研究揭示了现有MMD系统的根本性漏洞,并建议在GenAI时代迫切需要更稳健的方法。
Summary / 总结
This paper addresses the challenge of GenAI-driven news diversity in multimodal misinformation detection (MMD) systems, which can induce model-level and evidence-level drifts, degrading the robustness of current LVLM-based MMD systems. The authors introduce DriftBench, a large-scale benchmark, and evaluate six state-of-the-art LVLM-based detectors, showing significant performance drops and unstable reasoning traces. The study highlights the need for more resilient MMD approaches in the GenAI era.
该论文探讨了生成AI驱动的新闻多样性对多模态虚假信息检测的挑战,这种多样性导致了模型级和证据级的漂移,削弱了当前基于LVLM的系统的鲁棒性。作者引入了包含16,000个新闻实例的DriftBench基准,并评估了六种最先进的LVLM基检测器,在这些漂移下表现出显著的性能下降和不稳定的推理轨迹,尤其是在对抗性证据注入的情况下。这项研究揭示了现有MMD系统的根本性漏洞,并强调了在生成AI时代需要更加稳健的方法。
Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark
Authors: Hao Guo, Xugong Qin, Jun Jie Ou Yang, Peng Zhang, Gangyan Zeng, Yubo Li, Hailun Lin
Venue: CVPR 2025
First: 2025-12-23T09:14:16+00:00 · Latest: 2025-12-23T09:14:16+00:00
Comments: CVPR 2025
Abstract
Document image retrieval (DIR) aims to retrieve document images from a gallery according to a given query. Existing DIR methods are primarily based on image queries that retrieve documents within the same coarse semantic category, e.g., newspapers or receipts. However, these methods struggle to effectively retrieve document images in real-world scenarios where textual queries with fine-grained semantics are usually provided. To bridge this gap, we introduce a new Natural Language-based Document Image Retrieval (NL-DIR) benchmark with corresponding evaluation metrics. In this work, natural language descriptions serve as semantically rich queries for the DIR task. The NL-DIR dataset contains 41K authentic document images, each paired with five high-quality, fine-grained semantic queries generated and evaluated through large language models in conjunction with manual verification. We perform zero-shot and fine-tuning evaluations of existing mainstream contrastive vision-language models and OCR-free visual document understanding (VDU) models. A two-stage retrieval method is further investigated for performance improvement while achieving both time and space efficiency. We hope the proposed NL-DIR benchmark can bring new opportunities and facilitate research for the VDU community. Datasets and codes will be publicly available at huggingface.co/datasets/nianbing/NL-DIR.
中文标题/摘要
标题:基于自然语言的文档图像检索:新数据集与基准
文档图像检索(DIR)旨在根据给定的查询从图像库中检索文档图像。现有的DIR方法主要基于图像查询,检索同一粗略语义类别中的文档,例如报纸或收据。然而,这些方法在现实场景中难以有效检索提供有细粒度语义的文本查询的文档图像。为弥合这一差距,我们引入了一个新的基于自然语言的文档图像检索(NL-DIR)基准及其相应的评估指标。在此工作中,自然语言描述作为DIR任务的语义丰富的查询。NL-DIR数据集包含41000张真实的文档图像,每张图像配对五个高质量的细粒度语义查询,这些查询通过大型语言模型生成并结合人工验证进行评估。我们对现有的主流对比视觉-语言模型和无OCR视觉文档理解(VDU)模型进行了零样本和微调评估。进一步研究了两阶段检索方法以提高性能,同时实现时间和空间效率。我们希望提出的NL-DIR基准能带来新的机遇并促进VDU社区的研究。数据集和代码将在huggingface.co/datasets/nianbing/NL-DIR公开。
Summary / 总结
This paper addresses the challenge of document image retrieval using natural language queries, which are more fine-grained and realistic than existing image-based queries. It introduces a new benchmark, NL-DIR, with 41K document images paired with five high-quality semantic queries. The study evaluates mainstream contrastive vision-language models and OCR-free VDU models, and proposes a two-stage retrieval method to enhance performance. The results show improved retrieval accuracy with the new benchmark.
该研究旨在通过自然语言查询解决文档图像检索问题,这些查询比现有的图像查询更精细和现实。它引入了一个新的基准NL-DIR,包含41K文档图像,每张图像配以五个高质量的语义查询。研究评估了现有的对比视觉-语言模型和无OCR视觉文档理解(VDU)模型,发现微调这些模型可以提高性能。还提出了一种两阶段检索方法以提高效率。NL-DIR基准旨在推动视觉文档理解领域的研究进展。
Vision Language Models are Confused Tourists
Authors: Patrick Amadeus Irawan, Ikhlasul Akmal Hanif, Muhammad Dehan Al Kautsar, Genta Indra Winata, Fajri Koto, Alham Fikri Aji
First: 2025-11-21T07:14:46+00:00 · Latest: 2025-12-23T08:46:41+00:00
Abstract
Although the cultural dimension has been one of the key aspects in evaluating Vision-Language Models (VLMs), their ability to remain stable across diverse cultural inputs remains largely untested, despite being crucial to support diversity and multicultural societies. Existing evaluations often rely on benchmarks featuring only a singular cultural concept per image, overlooking scenarios where multiple, potentially unrelated cultural cues coexist. To address this gap, we introduce ConfusedTourist, a novel cultural adversarial robustness suite designed to assess VLMs' stability against perturbed geographical cues. Our experiments reveal a critical vulnerability, where accuracy drops heavily under simple image-stacking perturbations and even worsens with its image-generation-based variant. Interpretability analyses further show that these failures stem from systematic attention shifts toward distracting cues, diverting the model from its intended focus. These findings highlight a critical challenge: visual cultural concept mixing can substantially impair even state-of-the-art VLMs, underscoring the urgent need for more culturally robust multimodal understanding.
中文标题/摘要
标题:视觉语言模型是困惑的游客
尽管文化维度一直是评估视觉-语言模型(VLMs)的关键方面之一,但它们在面对多样文化输入时的稳定性仍然很少被测试,尽管这对于支持多样性和多文化社会至关重要。现有的评估往往依赖于仅包含每张图像一个单一文化概念的基准测试,忽视了多个可能无关的文化线索共存的场景。为解决这一缺口,我们引入了ConfusedTourist,这是一种新的文化对抗鲁棒性套件,旨在评估VLMs在受到地理线索干扰时的稳定性。我们的实验揭示了一个关键的脆弱性,即在简单的图像堆叠干扰下准确率大幅下降,甚至在基于图像生成的变体中进一步恶化。可解释性分析进一步表明,这些失败源于系统性地将注意力转移到分散注意力的线索上,使模型偏离其预期的焦点。这些发现突显了一个关键挑战:视觉文化概念混杂可以显著损害最先进的VLMs,强调了对更具有文化鲁棒性的跨模态理解的迫切需求。
Summary / 总结
The research aims to evaluate the stability of Vision-Language Models (VLMs) across diverse cultural inputs, addressing the gap in existing evaluations that often only test with singular cultural concepts. The study introduces ConfusedTourist, a new suite to assess VLMs' robustness against perturbed geographical cues. Key findings show that VLMs significantly drop in accuracy under simple image-stacking perturbations and even worse with image-generation-based variants, due to systematic attention shifts towards distracting cues. This highlights a critical challenge for VLMs in handling mixed cultural concepts, emphasizing the need for more culturally robust multimodal understanding.
研究旨在评估视觉语言模型(VLMs)在面对多样文化输入时的稳定性,填补现有评估中仅测试单一文化概念的空白。研究引入了ConfusedTourist,一个新套件来评估模型在受到地理线索扰动时的鲁棒性。关键发现表明,模型在简单图像堆叠扰动下显著降低准确性,甚至在基于图像生成的变体中更差,原因是系统性地将注意力转移到了分散注意力的线索上。这突显了处理混合文化概念对VLMs构成的重大挑战,强调了需要更具有文化鲁棒性的多模态理解的迫切性。
Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models
Authors: Ruofan Wang, Xin Wang, Yang Yao, Xuan Tong, Xingjun Ma
First: 2025-08-03T12:51:47+00:00 · Latest: 2025-12-23T05:52:33+00:00
Abstract
Fine-tuning open-source Vision-Language Models (VLMs) creates a critical yet underexplored attack surface: vulnerabilities in the base VLM could be retained in fine-tuned variants, rendering them susceptible to transferable jailbreak attacks. To demonstrate this risk, we introduce the Simulated Ensemble Attack (SEA), a novel grey-box jailbreak method in which the adversary has full access to the base VLM but no knowledge of the fine-tuned target's weights or training configuration. To improve jailbreak transferability across fine-tuned VLMs, SEA combines two key techniques: Fine-tuning Trajectory Simulation (FTS) and Targeted Prompt Guidance (TPG). FTS generates transferable adversarial images by simulating the vision encoder's parameter shifts, while TPG is a textual strategy that steers the language decoder toward adversarially optimized outputs. Experiments on the Qwen2-VL family (2B and 7B) demonstrate that SEA achieves high transfer attack success rates exceeding 86.5% and toxicity rates near 49.5% across diverse fine-tuned variants, even those specifically fine-tuned to improve safety behaviors. Notably, while direct PGD-based image jailbreaks rarely transfer across fine-tuned VLMs, SEA reliably exploits inherited vulnerabilities from the base model, significantly enhancing transferability. These findings highlight an urgent need to safeguard fine-tuned proprietary VLMs against transferable vulnerabilities inherited from open-source foundations, motivating the development of holistic defenses across the entire model lifecycle.
中文标题/摘要
标题:模拟集成攻击:跨微调视觉语言模型转移破解
对开源视觉语言模型(VLMs)进行微调创建了一个关键但尚未充分探索的攻击面:基础VLM中的漏洞可能保留在微调变体中,使其容易受到转移性破解攻击。为了展示这种风险,我们引入了模拟集成攻击(SEA),这是一种新颖的灰盒破解方法,其中攻击者可以完全访问基础VLM,但不知道微调目标的权重或训练配置。为了提高跨微调VLM的破解转移性,SEA结合了两种关键技术:微调轨迹模拟(FTS)和目标提示引导(TPG)。FTS通过模拟视觉编码器参数的变化生成可转移的对抗图像,而TPG是一种文本策略,引导语言解码器朝着对抗优化的输出方向发展。在Qwen2-VL家族(2B和7B)上的实验表明,SEA实现了超过86.5%的高转移攻击成功率和接近49.5%的毒性率,即使是在那些特别微调以提高安全行为的多种变体中也是如此。值得注意的是,虽然直接基于PGD的图像破解很少在微调VLM之间转移,但SEA可靠地利用了从基础模型继承的漏洞,显著提高了转移性。这些发现强调了迫切需要保护微调的专有VLM免受从开源基础继承的转移性漏洞的影响,从而推动了整个模型生命周期中全面防御的发展。
Summary / 总结
The research aims to highlight the risk of transferable jailbreak attacks on fine-tuned Vision-Language Models (VLMs) by introducing the Simulated Ensemble Attack (SEA), which combines Fine-tuning Trajectory Simulation (FTS) and Targeted Prompt Guidance (TPG). SEA demonstrates high transfer attack success rates of over 86.5% and near 49.5% toxicity rates across various fine-tuned VLMs, even those fine-tuned for safety. Unlike direct PGD-based image jailbreaks, SEA reliably exploits inherited vulnerabilities from the base model, significantly enhancing transferability.
研究旨在通过引入Simulated Ensemble Attack (SEA)这一新颖的灰盒方法,强调细调Vision-Language Models (VLMs)面临的可转移劫持攻击风险。SEA结合了Fine-tuning Trajectory Simulation (FTS)和Targeted Prompt Guidance (TPG),分别生成可转移的对抗图像和引导语言输出。实验结果显示,SEA在各种细调VLMs中实现了高转移攻击成功率和接近49.5%的毒性率,即使那些专门细调以提高安全行为的VLMs也不例外,这表明需要全面防御继承自开源基础的漏洞。
CardAIc-Agents: A Multimodal Framework with Hierarchical Adaptation for Cardiac Care Support
Authors: Yuting Zhang, Karina V. Bunting, Asgher Champsi, Xiaoxia Wang, Wenqi Lu, Alexander Thorley, Sandeep S Hothi, Zhaowen Qiu, Baturalp Buyukates, Dipak Kotecha, Jinming Duan
First: 2025-08-18T16:17:12+00:00 · Latest: 2025-12-23T04:17:03+00:00
Abstract
Cardiovascular diseases (CVDs) remain the foremost cause of mortality worldwide, a burden worsened by a severe deficit of healthcare workers. Artificial intelligence (AI) agents have shown potential to alleviate this gap through automated detection and proactive screening, yet their clinical application remains limited by: 1) rigid sequential workflows, whereas clinical care often requires adaptive reasoning that select specific tests and, based on their results, guides personalised next steps; 2) reliance solely on intrinsic model capabilities to perform role assignment without domain-specific tool support; 3) general and static knowledge bases without continuous learning capability; and 4) fixed unimodal or bimodal inputs and lack of on-demand visual outputs when clinicians require visual clarification. In response, a multimodal framework, CardAIc-Agents, was proposed to augment models with external tools and adaptively support diverse cardiac tasks. First, a CardiacRAG agent generated task-aware plans from updatable cardiac knowledge, while the Chief agent integrated tools to autonomously execute these plans and deliver decisions. Second, to enable adaptive and case-specific customization, a stepwise update strategy was developed to dynamically refine plans based on preceding execution results, once the task was assessed as complex. Third, a multidisciplinary discussion team was proposed which was automatically invoked to interpret challenging cases, thereby supporting further adaptation. In addition, visual review panels were provided to assist validation when clinicians raised concerns. Experiments across three datasets showed the efficiency of CardAIc-Agents compared to mainstream Vision-Language Models (VLMs) and state-of-the-art agentic systems.
中文标题/摘要
标题:CardAIc-Agents:一种具有层次适应性的多模态框架,用于心脏护理支持
心血管疾病(CVDs)仍然是全球首要的致死原因,而严重的医疗工作者短缺加剧了这一负担。人工智能(AI)代理展示了通过自动化检测和主动筛查来缓解这一差距的潜力,但其临床应用受限于:1)僵化的顺序工作流程,而临床护理往往需要适应性推理,根据特定测试的结果指导个性化的下一步;2)仅依赖模型本身的内在能力进行角色分配,而缺乏特定领域的工具支持;3)通用且静态的知识库,缺乏持续学习能力;4)固定的一模态或二模态输入,以及在临床医生需要视觉澄清时缺乏按需的视觉输出。为此,提出了一种多模态框架CardAIc-Agents,以增强模型的外部工具并适应性地支持多种心脏任务。首先,CardiacRAG代理从可更新的心脏知识中生成任务感知计划,而Chief代理则整合工具以自主执行这些计划并提供决策。其次,为实现适应性和案例特定的定制,开发了一种逐步更新策略,根据先前执行结果动态细化计划,一旦任务被评估为复杂。第三,提出了一支多学科讨论团队,自动调用以解释具有挑战性的病例,从而支持进一步的适应。此外,提供了视觉审查小组以协助验证,当临床医生提出疑虑时。在三个数据集上的实验表明,CardAIc-Agents相比主流的视觉-语言模型(VLMs)和最先进的代理系统更有效。
Summary / 总结
CardAIc-Agents is a multimodal framework designed to support cardiac care by addressing limitations in existing AI agents, such as rigid workflows and lack of adaptability. It uses a CardiacRAG agent to generate task-aware plans from updatable cardiac knowledge, and a Chief agent to execute these plans and deliver decisions. The framework also includes a stepwise update strategy for refining plans based on execution results and a multidisciplinary discussion team to interpret complex cases. Visual review panels are provided for clinician validation. Experiments demonstrated that CardAIc-Agents outperformed mainstream Vision-Language Models and state-of-the-art agentic systems in terms of efficiency across three datasets.
CardAIc-Agents 是一个多模态框架,旨在通过解决当前 AI 系统的局限性(如僵化的流程和缺乏适应性)来支持心脏护理。该框架包括一个 CardiacRAG 代理以生成任务感知的计划和一个首席代理以执行这些计划。此外,该框架还包含一个逐步更新策略,根据执行结果动态改进计划,并有一个多学科讨论团队来解释复杂的案例。视觉审查面板有助于验证决策。实验表明,CardAIc-Agents 在效率上优于主流的视觉-语言模型和最先进的代理系统。
Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting
Authors: Sangoh Lee, Sangwoo Mo, Wook-Shin Han
First: 2025-12-23T03:13:39+00:00 · Latest: 2025-12-23T03:13:39+00:00
Abstract
While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bring my cup", where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two simulation benchmarks, Personalized-SIMPLER and Personalized-VLABench, and a real-world tabletop benchmark to evaluate personalized manipulation across multiple robots and tasks. Experiments show that VAP consistently outperforms generic policies and token-learning baselines in both success rate and correct-object manipulation, helping to bridge the gap between semantic understanding and instance-level control.
中文标题/摘要
标题:Bring My Cup!使用视觉注意提示个性化视觉-语言-动作模型
尽管视觉-语言-动作(VLA)模型在通用指令上表现出色,但在处理个性化命令如“bring my cup”时却遇到困难,其中机器人必须在视觉上相似的对象中执行特定实例的操作。我们研究了操作个人物品的场景,在这种场景中,VLA 必须使用少量参考图像来识别和控制训练期间未见过的用户特定对象。为了解决这一挑战,我们提出了视觉注意提示(VAP),这是一种简单而有效的无需训练的感知适配器,为冻结的VLA配备自上而下的选择性注意力。VAP 将参考图像视为非参数化的视觉记忆,通过开放式词汇检测和基于嵌入的匹配将个人对象定位在场景中,然后通过突出显示对象并重写指令将这种定位作为视觉提示注入。我们构建了两个模拟基准 Personalized-SIMPLER 和 Personalized-VLABench,以及一个真实世界的桌面基准,以评估跨多个机器人和任务的个性化操作。实验表明,VAP 在成功率和正确对象操作方面始终优于通用策略和标记学习基线,有助于弥合语义理解与实例级控制之间的差距。
Summary / 总结
The research aims to improve VLA models' ability to handle personalized commands like 'bring my cup', where the robot must identify and manipulate a specific object among similar ones. The study proposes Visual Attentive Prompting (VAP), a training-free method that uses reference images to guide the model's attention to the correct object. Experiments show that VAP enhances success rates and correct-object manipulation compared to generic policies and token-learning baselines, bridging the gap between semantic understanding and instance-level control.
研究旨在提高VLA模型处理个性化命令如'bring my cup'的能力,其中机器人需要在相似的物体中识别并操作特定的物体。研究提出了一种名为视觉注意提示(VAP)的方法,该方法使用参考图像引导模型关注正确的物体。实验表明,VAP在成功率和正确物体操作方面优于通用策略和基于标记的学习基线,填补了语义理解和实例级控制之间的差距。
VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?
Authors: Hongbo Zhao, Meng Wang, Fei Zhu, Wenzhuo Liu, Bolin Ni, Fanhu Zeng, Gaofeng Meng, Zhaoxiang Zhang
First: 2025-12-17T17:58:35+00:00 · Latest: 2025-12-23T03:03:58+00:00
Abstract
The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model's ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-processed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.
中文标题/摘要
标题:VTCBench:视觉语言模型能否理解经视觉文本压缩后的长上下文?
与扩展LLM上下文窗口相关的计算和内存开销严重限制了其可扩展性。值得注意的解决方案是视觉文本压缩(VTC),如DeepSeek-OCR和Glyph等框架,将长文本转换为密集的二维视觉表示,从而实现3倍至20倍的标记压缩比。然而,这种高信息密度对视觉语言模型(VLMs)的核心长上下文能力的影响仍研究不足。为填补这一空白,我们首次引入了VTC基准,并系统评估了VLMs在三种长上下文理解设置中的性能:VTC-检索,评估模型检索和聚合信息的能力;VTC-推理,要求模型通过最小的词汇重叠来推断潜在关联以定位事实;VTC-记忆,衡量模型在长期对话记忆中进行综合问答的能力。此外,我们建立了VTCBench-Wild以模拟多样化的输入场景。我们在基准上全面评估了领先开源和专有模型。结果表明,尽管大多数VLMs能够很好地解码文本信息(如OCR),但在处理VTC处理的信息时,它们在长上下文理解方面表现出令人惊讶的差的能力,无法捕捉上下文中的长期关联或依赖关系。本研究为理解VTC提供了深入的见解,并为设计更高效和可扩展的VLMs奠定了基础。
Summary / 总结
This study introduces VTCBench to evaluate the long-context understanding capabilities of vision-language models (VLMs) using vision-text compression (VTC). The benchmark includes three tasks: VTC-Retrieval, VTC-Reasoning, and VTC-Memory, which test the model's ability to retrieve, infer, and answer questions based on compressed visual and textual information. The results show that most VLMs struggle with long-context understanding when using VTC-processed information, highlighting the need for improved VLM design.
该研究引入了VTCBench,以评估使用视觉文本压缩(VTC)的视觉语言模型(VLM)的长上下文理解能力。基准包括三个任务:VTC-Retrieval、VTC-Reasoning和VTC-Memory,分别评估模型在压缩视觉和文本信息基础上检索、推理和回答问题的能力。结果表明,大多数VLM在使用VTC处理的信息时,难以理解长上下文,这表明需要改进模型以有效处理高信息密度。
FiGO: Fine-Grained Object Counting without Annotations
Authors: Adriano D'Alessandro, Ali Mahdavi-Amiri, Ghassan Hamarneh
First: 2025-04-16T02:05:47+00:00 · Latest: 2025-12-23T01:57:40+00:00
Comments: data - https://dalessandro.dev/datasets/lookalikes/
Abstract
Class-agnostic counting (CAC) methods reduce annotation costs by letting users define what to count at test-time through text or visual exemplars. However, current open-vocabulary approaches work well for broad categories but fail when fine-grained category distinctions are needed, such as telling apart waterfowl species or pepper cultivars. We present FiGO, a new annotation-free method that adapts existing counting models to fine-grained categories using only the category name. Our approach uses a text-to-image diffusion model to create synthetic examples and a joint positive/hard-negative loss to learn a compact concept embedding that conditions a specialization module to convert outputs from any frozen counter into accurate, fine-grained estimates. To evaluate fine-grained counting, we introduce LOOKALIKES, a dataset of 37 subcategories across 14 parent categories with many visually similar objects per image. Our method substantially outperforms strong open-vocabulary baselines, moving counting systems from "count all the peppers" to "count only the habaneros."
中文标题/摘要
标题:FiGO:无需注释的细粒度对象计数
无类别计数(CAC)方法通过让用户在测试时通过文本或视觉示例定义要计数的内容来减少注释成本。然而,当前的开放式词汇方法在处理广泛的类别时效果良好,但在需要细粒度类别区分的情况下会失效,例如区分水禽种类或辣椒品种。我们提出了FiGO,这是一种新的无需注释的方法,仅使用类别名称即可将现有的计数模型适应到细粒度类别。我们的方法使用文本到图像的扩散模型生成合成示例,并使用联合正样本/困难负样本损失来学习一个紧凑的概念嵌入,该嵌入条件化一个专门化模块,将任何冻结计数器的输出转换为准确的细粒度估计。为了评估细粒度计数,我们引入了LOOKALIKES数据集,该数据集包含14个父类别下的37个子类别,每个图像中有许多视觉上相似的对象。我们的方法显著优于强大的开放式词汇基线,使计数系统从“计数所有的辣椒”转变为“仅计数胡椒科的辣椒”。
Summary / 总结
The research aims to address the limitations of current class-agnostic counting methods in handling fine-grained categories. The method, FiGO, uses a text-to-image diffusion model to generate synthetic examples and a joint loss function to learn a compact concept embedding, which conditions a specialization module to produce accurate fine-grained counts. The method significantly outperforms existing open-vocabulary baselines, demonstrating its capability to count specific subcategories like 'habaneros' among 'peppers' in images. The LOOKALIKES dataset, consisting of 37 subcategories across 14 parent categories, was used to evaluate the method's performance in fine-grained counting tasks.
研究旨在解决当前无标注计数方法在处理细粒度类别时的局限性。方法FiGO使用文本到图像的扩散模型生成合成样本,并使用联合损失函数学习紧凑的概念嵌入,该嵌入条件化一个专业化模块以生成准确的细粒度计数。该方法显著优于现有的开放词汇基线,展示了其在图像中区分特定子类别(如‘胡椒中的 habaneros’)的能力。LOOKALIKES数据集包含14个父类别中的37个子类别,用于评估方法在细粒度计数任务中的性能。
SE360: Semantic Edit in 360$^\circ$ Panoramas via Hierarchical Data Construction
Authors: Haoyi Zhong, Fang-Lue Zhang, Andrew Chalmers, Taehyun Rhee
First: 2025-12-23T00:24:46+00:00 · Latest: 2025-12-23T00:24:46+00:00
Abstract
While instruction-based image editing is emerging, extending it to 360$^\circ$ panoramas introduces additional challenges. Existing methods often produce implausible results in both equirectangular projections (ERP) and perspective views. To address these limitations, we propose SE360, a novel framework for multi-condition guided object editing in 360$^\circ$ panoramas. At its core is a novel coarse-to-fine autonomous data generation pipeline without manual intervention. This pipeline leverages a Vision-Language Model (VLM) and adaptive projection adjustment for hierarchical analysis, ensuring the holistic segmentation of objects and their physical context. The resulting data pairs are both semantically meaningful and geometrically consistent, even when sourced from unlabeled panoramas. Furthermore, we introduce a cost-effective, two-stage data refinement strategy to improve data realism and mitigate model overfitting to erase artifacts. Based on the constructed dataset, we train a Transformer-based diffusion model to allow flexible object editing guided by text, mask, or reference image in 360$^\circ$ panoramas. Our experiments demonstrate that our method outperforms existing methods in both visual quality and semantic accuracy.
中文标题/摘要
标题:SE360:通过分层数据构建在360°全景图中的语义编辑
尽管基于指令的图像编辑正在兴起,将其扩展到360°全景图带来了额外的挑战。现有方法在等角投影(ERP)和透视视图中经常产生不合理的结果。为了解决这些限制,我们提出了SE360,一种新颖的框架,用于在360°全景图中进行多条件引导对象编辑。其核心是一个无需人工干预的新颖的自上而下分层数据分析生成管道。该管道利用视觉语言模型(VLM)和自适应投影调整进行分层分析,确保对象及其物理上下文的整体分割。生成的数据对既具有语义意义又具有几何一致性,即使这些数据来自未标记的全景图。此外,我们引入了一种成本效益高的两阶段数据精炼策略,以提高数据的真实性和减轻模型过拟合以消除伪影。基于构建的数据集,我们训练了一个基于变换器的扩散模型,以允许在360°全景图中根据文本、掩码或参考图像进行灵活的对象编辑。我们的实验表明,与现有方法相比,我们的方法在视觉质量和语义准确性方面都表现出更优的效果。
Summary / 总结
The research aims to address the challenges of instruction-based image editing in 360° panoramas by proposing SE360, a novel framework that uses a coarse-to-fine autonomous data generation pipeline. This pipeline, which incorporates a Vision-Language Model and adaptive projection adjustment, ensures holistic segmentation and geometric consistency. The method introduces a two-stage data refinement strategy to enhance realism and reduce overfitting. Experiments show that SE360 outperforms existing methods in visual quality and semantic accuracy.
研究旨在通过提出SE360框架解决360°全景图指令式图像编辑的挑战。该框架采用自上而下的自主数据生成管道,结合视觉语言模型和自适应投影调整,确保整体分割和几何一致性。方法引入了两阶段数据精炼策略以增强现实性和减少过拟合。实验表明,SE360在视觉质量和语义准确性方面优于现有方法。
Weakly Supervised Ephemeral Gully Detection In Remote Sensing Images Using Vision Language Models
Authors: Seyed Mohamad Ali Tousi, Ramy Farag, John A. Lory, G. N. DeSouza
First: 2025-11-17T20:29:44+00:00 · Latest: 2025-12-22T21:37:32+00:00
Abstract
Among soil erosion problems, Ephemeral Gullies are one of the most concerning phenomena occurring in agricultural fields. Their short temporal cycles increase the difficulty in automatically detecting them using classical computer vision approaches and remote sensing. Also, due to scarcity of and the difficulty in producing accurate labeled data, automatic detection of ephemeral gullies using Machine Learning is limited to zero-shot approaches which are hard to implement. To overcome these challenges, we present the first weakly supervised pipeline for detection of ephemeral gullies. Our method relies on remote sensing and uses Vision Language Models (VLMs) to drastically reduce the labor-intensive task of manual labeling. In order to achieve that, the method exploits: 1) the knowledge embedded in the VLM's pretraining; 2) a teacher-student model where the teacher learns from noisy labels coming from the VLMs, and the student learns by weak supervision using teacher-generate labels and a noise-aware loss function. We also make available the first-of-its-kind dataset for semi-supervised detection of ephemeral gully from remote-sensed images. The dataset consists of a number of locations labeled by a group of soil and plant scientists, as well as a large number of unlabeled locations. The dataset represent more than 18,000 high-resolution remote-sensing images obtained over the course of 13 years. Our experimental results demonstrate the validity of our approach by showing superior performances compared to VLMs and the label model itself when using weak supervision to train an student model. The code and dataset for this work are made publicly available.
中文标题/摘要
标题:使用视觉语言模型在遥感图像中弱监督检测临时冲沟
在土壤侵蚀问题中,临时冲沟是农业田地中最令人关注的现象之一。它们短暂的时间周期增加了使用经典计算机视觉方法和遥感自动检测它们的难度。由于缺乏准确标注数据以及生成准确标注数据的困难,使用机器学习自动检测临时冲沟受到限制,仅限于零样本方法,这些方法难以实现。为克服这些挑战,我们提出了第一个用于检测临时冲沟的弱监督管道。该方法依赖于遥感,并利用视觉语言模型(VLMs)大大减少了手动标注的劳动密集型任务。为了实现这一点,该方法利用了:1)VLMs预训练中嵌入的知识;2)教师-学生模型,其中教师从VLMs产生的嘈杂标签中学习,学生通过弱监督使用教师生成的标签和噪声感知损失函数学习。我们还提供了首个用于半监督检测遥感图像中临时冲沟的数据集。该数据集由一群土壤和植物科学家标注了多个位置,以及大量未标注的位置。数据集包含超过18,000张高分辨率遥感图像,跨越13年。我们的实验结果通过显示在使用弱监督训练学生模型时,我们的方法优于VLMs和标签模型本身,证明了我们方法的有效性。该工作的代码和数据集已公开提供。
Summary / 总结
This paper addresses the challenge of detecting ephemeral gullies in agricultural fields using weak supervision and Vision Language Models (VLMs). The method leverages the pretraining knowledge of VLMs and a teacher-student framework to reduce the need for manual labeling. Experimental results show that the proposed approach outperforms VLMs and the label model itself when using weak supervision to train a student model, demonstrating the effectiveness of the method in semi-supervised detection of ephemeral gullies from remote-sensed images.
本文提出了一种使用弱监督和视觉语言模型(VLMs)检测农业田地中的临时冲沟的方法。该方法利用VLMs的预训练知识和教师-学生框架来减少手动标注的需求。实验结果表明,该方法在使用弱监督训练学生模型时,优于VLMs和标签模型本身,证明了该方法在半监督检测遥感图像中的临时冲沟方面的有效性。
Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis
Authors: Argha Kamal Samanta, Harshika Goyal, Vasudha Joshi, Tushar Mungle, Pabitra Mitra
First: 2025-12-22T18:41:45+00:00 · Latest: 2025-12-22T18:41:45+00:00
Comments: 14 pages, 14 figures
Abstract
Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-language models like Contrastive Language-Image Pre-Training (CLIP) perform well on natural image tasks, they struggle in medical domain applications, particularly in cross-modal retrieval for ophthalmological images. We propose a novel knowledge-enhanced joint embedding framework that integrates retinal fundus images, clinical text, and structured patient data through a multimodal transformer architecture to address the critical gap in medical image-text alignment. Our approach employs separate encoders for each modality: a Vision Transformer (ViT-B/16) for retinal images, Bio-ClinicalBERT for clinical narratives, and a multilayer perceptron for structured demographic and clinical features. These modalities are fused through a joint transformer with modality-specific embeddings, trained using multiple objectives including contrastive losses between modality pairs, reconstruction losses for images and text, and classification losses for DR severity grading according to ICDR and SDRG schemes. Experimental results on the Brazilian Multilabel Ophthalmological Dataset (BRSET) demonstrate significant improvements over baseline models. Our framework achieves near-perfect text-to-image retrieval performance with Recall@1 of 99.94% compared to fine-tuned CLIP's 1.29%, while maintaining state-of-the-art classification accuracy of 97.05% for SDRG and 97.97% for ICDR. Furthermore, zero-shot evaluation on the unseen DeepEyeNet dataset validates strong generalizability with 93.95% Recall@1 versus 0.22% for fine-tuned CLIP. These results demonstrate that our multimodal training approach effectively captures cross-modal relationships in the medical domain, establishing both superior retrieval capabilities and robust diagnostic performance.
中文标题/摘要
标题:超越CLIP:知识增强的多模态变换器在糖尿病视网膜病变诊断中的跨模态对齐
糖尿病视网膜病变(DR)是全球可预防失明的主要原因,需要准确的自动化诊断系统。虽然通用领域的视觉-语言模型如对比语言-图像预训练(CLIP)在自然图像任务上表现良好,但在医学领域的应用中却遇到困难,特别是在眼科图像的跨模态检索方面。我们提出了一种新颖的知识增强联合嵌入框架,通过多模态变换器架构将视网膜底片图像、临床文本和结构化患者数据结合起来,以解决医学图像-文本对齐的关键差距。我们的方法为每种模态使用单独的编码器:用于视网膜图像的视觉变换器(ViT-B/16),用于临床叙述的Bio-ClinicalBERT,以及用于结构化人口统计和临床特征的多层感知器。这些模态通过具有模态特定嵌入的联合变换器融合,使用包括模态对之间的对比损失、图像和文本的重构损失以及根据ICDR和SDRG方案的DR严重程度分类损失的多个目标进行训练。在巴西多标签眼科数据集(BRSET)上的实验结果表明,与基线模型相比有显著改进。我们的框架在文本到图像检索性能上达到近乎完美的99.94%的召回率@1,而微调后的CLIP仅为1.29%,同时保持了SDRG的97.05%和ICDR的97.97%的最先进的分类准确性。此外,对未见过的DeepEyeNet数据集的零样本评估验证了强大的泛化能力,召回率@1为93.95%,而微调后的CLIP仅为0.22%。这些结果表明,我们的多模态训练方法有效地捕捉了医学领域的跨模态关系,建立了卓越的检索能力和稳健的诊断性能。
Summary / 总结
The research aims to improve automated diagnostic systems for diabetic retinopathy (DR) by addressing the limitations of general-domain vision-language models in medical applications. The proposed method uses a knowledge-enhanced joint embedding framework with a multimodal transformer architecture, incorporating retinal images, clinical text, and structured patient data. The framework achieves significant improvements over baseline models, with near-perfect text-to-image retrieval performance and state-of-the-art classification accuracy for DR severity grading. Zero-shot evaluation on an unseen dataset further validates its generalizability.
本文提出了一种知识增强的多模态嵌入框架,使用多模态变压器架构来解决糖尿病视网膜病变(DR)的准确自动化诊断问题。该框架整合了视网膜图像、临床文本和结构化患者数据。它为每个模态使用单独的编码器,并通过联合变压器进行融合,通过多种目标进行训练。实验在BRSET数据集上显示,该模型在文本到图像检索中的召回率为99.94%,保持了最先进的分类准确性。零样本评估在DeepEyeNet数据集上进一步验证了模型的泛化能力。
AsyMoE: Leveraging Modal Asymmetry for Enhanced Expert Specialization in Large Vision-Language Models
Authors: Heng Zhang, Haichuan Hu, Yaomin Shen, Weihao Yu, Yilei Yuan, Haochen You, Guo Cheng, Zijian Zhang, Lubin Gan, Huihui Wei, Hao Zhang, Jin Huang
First: 2025-09-16T06:16:05+00:00 · Latest: 2025-12-22T18:22:20+00:00
Comments: This submission has been withdrawn by the authors due to a fundamental error in the methodology that affects the validity of the main results
Abstract
Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. However, existing Mixture of Experts (MoE) approaches face challenges due to the asymmetry between visual and linguistic processing. Visual information is spatially complete, while language requires maintaining sequential context. As a result, MoE models struggle to balance modality-specific features and cross-modal interactions. Through systematic analysis, we observe that language experts in deeper layers progressively lose contextual grounding and rely more on parametric knowledge rather than utilizing the provided visual and linguistic information. To address this, we propose AsyMoE, a novel architecture that models this asymmetry using three specialized expert groups. We design intra-modality experts for modality-specific processing, hyperbolic inter-modality experts for hierarchical cross-modal interactions, and evidence-priority language experts to suppress parametric biases and maintain contextual grounding. Extensive experiments demonstrate that AsyMoE achieves 26.58% and 15.45% accuracy improvements over vanilla MoE and modality-specific MoE respectively, with 25.45% fewer activated parameters than dense models.
中文标题/摘要
标题:AsyMoE:利用模态不对称性增强大型视觉-语言模型中的专家专业化
大型视觉-语言模型(LVLMs)通过扩展架构和大量训练,在多模态任务中表现出色。然而,现有的混合专家(MoE)方法由于视觉处理和语言处理之间的不对称性而面临挑战。视觉信息是空间上完整的,而语言需要保持顺序上下文。因此,MoE模型难以平衡模态特定特征和跨模态交互。通过系统分析,我们观察到,深层的语言专家逐渐失去上下文定位,并更多依赖参数知识,而不是利用提供的视觉和语言信息。为了解决这个问题,我们提出了一种新的AsyMoE架构,该架构使用三个专门的专家组来建模这种不对称性。我们设计了跨模态专家进行模态特定处理,超曲面跨模态专家进行分层跨模态交互,并设计了证据优先的语言专家以抑制参数偏差并保持上下文定位。广泛的实验表明,与vanilla MoE和模态特定MoE相比,AsyMoE分别实现了26.58%和15.45%的准确率提升,且激活的参数比密集模型少25.45%。
Summary / 总结
The research aims to improve the performance of large vision-language models by addressing the asymmetry between visual and linguistic processing in existing Mixture of Experts (MoE) approaches. AsyMoE, a novel architecture, is proposed to model this asymmetry using three specialized expert groups: intra-modality experts for modality-specific processing, hyperbolic inter-modality experts for hierarchical cross-modal interactions, and evidence-priority language experts to suppress parametric biases and maintain contextual grounding. Despite the promising results showing 26.58% and 15.45% accuracy improvements over vanilla MoE and modality-specific MoE respectively, the study was withdrawn due to a fundamental error in the methodology that affects the validity of the main results.
研究旨在通过解决现有Mixture of Experts (MoE)方法中视觉和语言处理之间的不对称性,来提高大型视觉-语言模型的性能。提出了AsyMoE这一新型架构,通过三个专门的专家组来建模这种不对称性:模态内专家进行模态特定处理,超球体跨模态专家进行分层跨模态交互,以及证据优先语言专家来抑制参数偏见并保持上下文接地。尽管研究显示AsyMoE在与vanilla MoE和模态特定MoE相比时分别实现了26.58%和15.45%的准确率提升,但由于方法中的根本性错误影响了主要结果的有效性,该研究已被撤回。
GraphGeo: Multi-Agent Debate Framework for Visual Geo-localization with Heterogeneous Graph Neural Networks
Authors: Heng Zheng, Yuling Shi, Xiaodong Gu, Haochen You, Zijian Zhang, Lubin Gan, Hao Zhang, Wenjun Huang, Jin Huang
First: 2025-11-02T11:58:55+00:00 · Latest: 2025-12-22T18:21:18+00:00
Comments: This submission has been withdrawn by the authors due to a fundamental error in the methodology that affects the validity of the main results
Abstract
Visual geo-localization requires extensive geographic knowledge and sophisticated reasoning to determine image locations without GPS metadata. Traditional retrieval methods are constrained by database coverage and quality. Recent Large Vision-Language Models (LVLMs) enable direct location reasoning from image content, yet individual models struggle with diverse geographic regions and complex scenes. Existing multi-agent systems improve performance through model collaboration but treat all agent interactions uniformly. They lack mechanisms to handle conflicting predictions effectively. We propose \textbf{GraphGeo}, a multi-agent debate framework using heterogeneous graph neural networks for visual geo-localization. Our approach models diverse debate relationships through typed edges, distinguishing supportive collaboration, competitive argumentation, and knowledge transfer. We introduce a dual-level debate mechanism combining node-level refinement and edge-level argumentation modeling. A cross-level topology refinement strategy enables co-evolution between graph structure and agent representations. Experiments on multiple benchmarks demonstrate GraphGeo significantly outperforms state-of-the-art methods. Our framework transforms cognitive conflicts between agents into enhanced geo-localization accuracy through structured debate.
中文标题/摘要
标题:GraphGeo:基于异构图神经网络的多智能体视觉地理定位框架
视觉地理定位需要广泛的空间知识和复杂的推理来确定图像位置,而不依赖GPS元数据。传统的检索方法受到数据库覆盖范围和质量的限制。最近的大规模视觉-语言模型(LVLMs)能够直接从图像内容进行位置推理,但单个模型在处理多样化的地理区域和复杂的场景时存在困难。现有的多智能体系统通过模型协作提高了性能,但所有智能体交互均处理一致。它们缺乏有效处理相互矛盾预测的机制。我们提出 **GraphGeo**,一种使用异构图神经网络的多智能体辩论框架,用于视觉地理定位。我们的方法通过类型化的边建模多样的辩论关系,区分支持性合作、竞争性论辩和知识转移。我们引入了一种节点级细化与边级论辩建模相结合的双重辩论机制。跨级拓扑细化策略使图结构和智能体表示能够共同进化。在多个基准上的实验表明,GraphGeo 显著优于现有最佳方法。我们的框架通过结构化的辩论将智能体之间的认知冲突转化为增强的地理定位准确性。
Summary / 总结
The research aims to improve visual geo-localization by addressing the limitations of traditional methods and recent large vision-language models. GraphGeo proposes a multi-agent debate framework using heterogeneous graph neural networks to model diverse debate relationships and enhance reasoning. The dual-level debate mechanism and cross-level topology refinement strategy are key components. However, the submission was withdrawn due to a fundamental error in the methodology that affects the validity of the main results.
研究旨在通过解决传统方法和近期大型视觉-语言模型的局限性,提高视觉地理定位的准确性。GraphGeo 提出了一种使用异构图神经网络的多代理辩论框架,以建模多样化的辩论关系并增强推理能力。双层辩论机制和跨层拓扑结构优化策略是关键组成部分。然而,由于方法论中的根本错误,该提交已被作者撤回,这影响了主要结果的有效性。
CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion
Authors: Moritz Böhle, Amélie Royer, Juliette Marrie, Edouard Grave, Patrick Pérez
First: 2025-12-22T16:21:39+00:00 · Latest: 2025-12-22T16:21:39+00:00
Abstract
Vision-language models (VLMs) are commonly trained by inserting image tokens from a pretrained vision encoder into the textual stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes extremely costly for high-resolution images, long conversations, or streaming videos, both in memory and compute. VLMs leveraging cross-attention are an efficient alternative to token insertion but exhibit a clear performance gap, in particular on tasks involving fine-grained visual details. We find that a key to improving such models is to also enable local text-to-text interaction in the dedicated cross-attention layers. Building on this, we propose CASA, Cross-Attention via Self-Attention, a simple and efficient paradigm which substantially reduces the gap with full token insertion on common image understanding benchmarks, while enjoying the same scalability as cross-attention models when applied to long-context multimodal tasks such as streaming video captioning. For samples and code, please see our project page at https://kyutai.org/casa .
中文标题/摘要
标题:CASA:通过自注意力实现的跨注意力高效视觉-语言融合
视觉-语言模型(VLMs)通常通过将预训练视觉编码器中的图像标记插入语言模型的文字流中来进行训练。这使得文本和图像信息能够在模型内部完全相互注意,但对高分辨率图像、长对话或流式视频来说,这在内存和计算上都变得极其昂贵。利用跨注意力的VLMs是标记插入的高效替代方案,但在涉及精细视觉细节的任务上表现出明显的性能差距。我们发现,提高此类模型的关键在于在专门的跨注意力层中也启用局部文本到文本的交互。在此基础上,我们提出了CASA(Cross-Attention via Self-Attention),一种简单而高效的范式,它在常见的图像理解基准测试中显著减少了与完整标记插入的差距,同时在长上下文多模态任务如流式视频字幕生成中保持与跨注意力模型相同的可扩展性。有关示例和代码,请参见我们的项目页面https://kyutai.org/casa 。
Summary / 总结
The research aims to improve the efficiency of vision-language models (VLMs) by addressing the computational cost associated with token insertion for high-resolution images and long conversations. CASA, Cross-Attention via Self-Attention, is proposed to enable local text-to-text interaction within cross-attention layers, reducing the performance gap with full token insertion. Experiments show that CASA outperforms cross-attention models on common image understanding benchmarks while maintaining scalability for long-context tasks like streaming video captioning.
研究旨在通过解决高分辨率图像和长对话中由于token插入导致的计算成本问题,提高视觉语言模型(VLMs)的效率。提出了CASA,即基于自我注意力的交叉注意力,以在交叉注意力层中实现局部文本到文本的交互,从而减少与全token插入模型之间的性能差距。实验表明,CASA在常见的图像理解基准测试中优于交叉注意力模型,并且在长上下文任务如流式视频字幕生成中保持可扩展性。
QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models
Authors: Li Puyin, Tiange Xiang, Ella Mao, Shirley Wei, Xinye Chen, Adnan Masood, Li Fei-fei, Ehsan Adeli
First: 2025-12-22T16:18:00+00:00 · Latest: 2025-12-22T16:18:00+00:00
Abstract
Understanding the physical world is essential for generalist AI agents. However, it remains unclear whether state-of-the-art vision perception models (e.g., large VLMs) can reason physical properties quantitatively. Existing evaluations are predominantly VQA-based and qualitative, offering limited insight into whether these models can infer the kinematic quantities of moving objects from video observations. To address this, we present QuantiPhy, the first benchmark designed to quantitatively measure a VLM's physical reasoning ability. Comprising more than 3.3K video-text instances with numerical ground truth, QuantiPhy evaluates a VLM's performance on estimating an object's size, velocity, and acceleration at a given timestamp, using one of these properties as an input prior. The benchmark standardizes prompts and scoring to assess numerical accuracy, enabling fair comparisons across models. Our experiments on state-of-the-art VLMs reveal a consistent gap between their qualitative plausibility and actual numerical correctness. We further provide an in-depth analysis of key factors like background noise, counterfactual priors, and strategic prompting and find that state-of-the-art VLMs lean heavily on pre-trained world knowledge rather than faithfully using the provided visual and textual inputs as references when reasoning kinematic properties quantitatively. QuantiPhy offers the first rigorous, scalable testbed to move VLMs beyond mere verbal plausibility toward a numerically grounded physical understanding.
中文标题/摘要
标题:QuantiPhy:评估视觉-语言模型物理推理能力的定量基准
理解物理世界对于通用人工智能代理至关重要。然而,尚不清楚最先进的视觉感知模型(例如大型VLM)是否能够进行定量的物理属性推理。现有的评估主要基于VQA且为定性的,提供的关于这些模型能否从视频观察中推断出移动物体的动力学量的见解有限。为了解决这一问题,我们提出了QuantiPhy,这是第一个用于定量测量VLM物理推理能力的基准。QuantiPhy包含超过3300个视频-文本实例,具有数值真实值,评估VLM在给定时间戳时估计物体大小、速度和加速度的表现,其中一个属性作为输入先验。基准标准化了提示和评分,以评估数值准确性,从而实现模型之间的公平比较。我们在最先进的VLM上的实验揭示了它们的定性合理性与实际数值正确性之间的一致差距。我们进一步深入分析了关键因素,如背景噪声、反事实先验和策略性提示,发现最先进的VLM在进行定量动力学属性推理时,严重依赖预训练的世界知识,而不是忠实使用提供的视觉和文本输入作为参考。QuantiPhy提供了第一个严格的、可扩展的测试平台,推动VLM超越单纯的口头合理性,向基于数值的物理理解迈进。
Summary / 总结
QuantiPhy is a benchmark designed to evaluate the quantitative physical reasoning abilities of vision-language models. It includes over 3,300 video-text instances with numerical ground truth to assess models' performance in estimating object size, velocity, and acceleration. Experiments show a gap between models' qualitative plausibility and numerical accuracy, indicating that models rely more on pre-trained knowledge than on visual and textual inputs when reasoning about kinematic properties.
QuantiPhy 是一个基准,用于评估视觉-语言模型的定量物理推理能力。它包含超过 3,300 个带有数值 ground truth 的视频-文本实例,以评估模型在估计物体的大小、速度和加速度方面的表现。实验表明,模型的定性合理性与数值准确性之间存在差距,表明它们在进行定量推理时依赖于预训练的知识,而不是视觉和文本输入。
VERDI: VLM-Embedded Reasoning for Autonomous Driving
Authors: Bowen Feng, Zhiting Mei, Baiang Li, Julian Ost, Filippo Ghilotti, Roger Girgis, Anirudha Majumdar, Felix Heide
First: 2025-05-21T18:24:36+00:00 · Latest: 2025-12-22T15:37:49+00:00
Abstract
While autonomous driving (AD) stacks struggle with decision making under partial observability and real-world complexity, human drivers are capable of commonsense reasoning to make near-optimal decisions with limited information. Recent work has attempted to leverage finetuned Vision-Language Models (VLMs) for trajectory planning at inference time to emulate human behavior. Despite their success in benchmark evaluations, these methods are often impractical to deploy (a 70B parameter VLM inference at merely 8 tokens per second requires more than 160G of memory), and their monolithic network structure prohibits safety decomposition. To bridge this gap, we propose VLM-Embedded Reasoning for autonomous Driving (VERDI), a training-time framework that distills the reasoning process and commonsense knowledge of VLMs into the AD stack. VERDI augments modular differentiable end-to-end (e2e) AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs. By encouraging alignment in latent space, VERDI enables the modular AD stack to internalize structured reasoning, without incurring the inference-time costs of large VLMs. We validate VERDI in both open-loop (NuScenes and Bench2Drive benchmarks) and closed-loop (HugSim Simulator) settings. We find that VERDI outperforms existing e2e methods that do not embed reasoning by up to 11% in $\ell_{2}$ distance and 11% in driving performance, while maintaining real-time inference speed.
中文标题/摘要
标题:VERDI: VLM嵌入式自主驾驶推理
在面对部分可观测性和现实复杂性带来的决策难题时,自主驾驶(AD)系统往往难以做出最优决策,而人类驾驶员则能够利用常识推理在信息有限的情况下做出近乎最优的决策。近期的研究尝试利用微调后的视觉-语言模型(VLMs)在推理阶段进行轨迹规划,以模仿人类的行为。尽管这些方法在基准测试中表现出色,但它们在部署时往往不切实际(一个700亿参数的VLM推理需要每秒8个词,内存超过160G),而且其单一网络结构也限制了安全性分解。为解决这一问题,我们提出了VLM嵌入式自主驾驶推理(VERDI),这是一种训练时框架,将VLM的推理过程和常识知识提炼到AD系统中。VERDI通过将模块化可微端到端(e2e)AD模型与VLM生成的解释驾驶推理过程的文本特征对齐,在感知、预测和规划阶段对齐中间模块输出,从而在潜在空间中促进对齐。通过这种方式,VERDI使模块化AD系统能够内化结构化推理,而不必承担大型VLM的推理时间成本。我们在开环(NuScenes和Bench2Drive基准)和闭环(HugSim模拟器)环境中验证了VERDI。我们发现,与不嵌入推理的现有端到端方法相比,VERDI在$\ell_{2}$距离上提高了11%,在驾驶性能上提高了11%,同时保持了实时推理速度。
Summary / 总结
VERDI is a training-time framework that embeds commonsense reasoning from Vision-Language Models into autonomous driving systems to improve decision-making under partial observability. It aligns intermediate outputs of perception, prediction, and planning modules with text features from VLMs, enabling the AD stack to internalize structured reasoning without the high inference costs of large VLMs. VERDI outperforms existing end-to-end methods by up to 11% in $\ell_{2}$ distance and driving performance, while maintaining real-time inference speed.
VERDI 是一个训练时框架,将视觉-语言模型中的常识推理嵌入到自动驾驶系统中,以改善在部分可观测性下的决策能力。它通过将感知、预测和规划模块的中间输出与 VLM 生成的文本特征对齐,使 AD 堆栈能够内化结构化的推理,而不增加大型 VLM 的高推理成本。VERDI 在 $\ell_{2}$ 距离和驾驶性能上分别比现有端到端方法高出最多 11%,同时保持实时推理速度。
SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning
Authors: Xiaojun Guo, Runyu Zhou, Yifei Wang, Qi Zhang, Chenheng Zhang, Stefanie Jegelka, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Yisen Wang
First: 2025-10-18T09:22:40+00:00 · Latest: 2025-12-22T15:14:59+00:00
Abstract
Vision-language models (VLMs) have shown remarkable abilities by integrating large language models with visual inputs. However, they often fail to utilize visual evidence adequately, either depending on linguistic priors in vision-centric tasks or resorting to textual shortcuts during reasoning. Although reinforcement learning (RL) can align models with desired behaviors, its application to VLMs has been hindered by the lack of scalable and reliable reward mechanisms. To overcome this challenge, we propose SSL4RL, a novel framework that leverages self-supervised learning (SSL) tasks as a source of verifiable rewards for RL-based fine-tuning. Our approach reformulates SSL objectives-such as predicting image rotation or reconstructing masked patches-into dense, automatic reward signals, eliminating the need for human preference data or unreliable AI evaluators. Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks. Furthermore, through systematic ablations, we identify key factors-such as task difficulty, model scale, and semantic alignment with the target domain-that influence the effectiveness of SSL4RL tasks, offering new design principles for future work. We also demonstrate the framework's generality by applying it to graph learning, where it yields significant gains. SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives.
中文标题/摘要
标题:SSL4RL:重新审视自我监督学习作为视觉-语言推理内在奖励
视觉-语言模型(VLMs)通过结合大型语言模型和视觉输入,展示了显著的能力。然而,它们往往未能充分利用视觉证据,要么依赖视觉中心任务中的语言先验,要么在推理过程中求助于文本捷径。尽管强化学习(RL)可以将模型与期望的行为对齐,但将其应用于VLMs受到了缺乏可扩展且可靠的奖励机制的阻碍。为克服这一挑战,我们提出了一种名为SSL4RL的新框架,该框架利用自我监督学习(SSL)任务作为RL基础微调的验证性奖励来源。我们的方法将SSL目标,如预测图像旋转或重建遮罩片段,重新表述为密集的自动奖励信号,从而消除了对人类偏好数据或不可靠的人工智能评估者的需要。实验表明,SSL4RL在视觉中心和视觉-语言推理基准测试中显著提高了性能。此外,通过系统性的消融实验,我们确定了影响SSL4RL任务有效性的关键因素,如任务难度、模型规模和与目标领域的语义对齐,为未来工作提供了新的设计原则。我们还通过将其应用于图学习,展示了该框架的通用性,其中它带来了显著的收益。SSL4RL建立了一种灵活且有效的范式,用于使用可验证的自我监督目标对多模态模型进行对齐。
Summary / 总结
The paper proposes SSL4RL, a framework that uses self-supervised learning (SSL) tasks as intrinsic rewards for reinforcement learning (RL) fine-tuning of vision-language models (VLMs). This approach improves performance on both vision-centric and vision-language reasoning benchmarks by providing dense, automatic reward signals without the need for human preference data. The study identifies key factors affecting the effectiveness of SSL4RL tasks and demonstrates its generality by applying it to graph learning, showing significant gains.
研究旨在通过将自我监督学习(SSL)作为强化学习(RL)的内在奖励来提升视觉语言模型的表现。方法是将SSL任务,如预测图像旋转或重建遮罩的片段,转化为密集的自动奖励信号。实验表明,这种方法在各种基准测试中显著提高了性能,并确定了影响其有效性的关键因素。该框架还被应用于图学习,取得了显著的改进。
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration
Authors: Runze Li, Yuwen Zhai, Bo Xu, LiWu Xu, Nian Shi, Wei Zhang, Ran Lin, Liang Wang
First: 2025-12-22T13:42:18+00:00 · Latest: 2025-12-22T13:42:18+00:00
Abstract
Contemporary GUI agents, while increasingly capable due to advances in Large Vision-Language Models (VLMs), often operate with a critical limitation: they treat each task in isolation, lacking a mechanism to systematically learn from past successes. This digital ''amnesia'' results in sub-optimal performance, repeated errors, and poor generalization to novel challenges. To bridge this gap, we introduce EchoTrail-GUI, a novel framework designed to mimic human-like experiential learning by equipping agents with a dynamic, accessible memory. Our framework operates in three distinct stages. First, during Experience Exploration, an agent autonomously interacts with GUI environments to build a curated database of successful task trajectories, validated by a reward model. Crucially, the entire knowledge base construction is thus fully automated, requiring no human supervision. Second, in the Memory Injection stage, upon receiving a new task, our system efficiently retrieves the most relevant past trajectories to serve as actionable ''memories''. Finally, during GUI Task Inference, these memories are injected as in-context guidance to inform the agent's reasoning and decision-making process. We demonstrate the efficacy of our approach on benchmarks including Android World and AndroidLab. The results show that EchoTrail-GUI significantly improves the task success rate and operational efficiency of baseline agents, validating the power of structured memory in creating more robust and intelligent GUI automation.
中文标题/摘要
标题:EchoTrail-GUI:通过评论引导自我探索构建可操作的记忆
当代GUI代理虽然由于大型视觉-语言模型(VLMs)的进步而变得越来越强大,但它们通常以一个关键限制为代价:它们将每个任务视为独立的,缺乏系统地从过去成功中学习的机制。这种数字“健忘症”导致了次优性能、重复错误和对新挑战的不良泛化。为了弥合这一差距,我们提出了EchoTrail-GUI,这是一种新型框架,旨在通过为代理提供动态且易于访问的记忆来模拟人类经验学习。我们的框架分为三个阶段。首先,在经验探索阶段,代理自主与GUI环境交互,构建由奖励模型验证的成功任务轨迹数据库。重要的是,整个知识库构建过程完全自动化,无需人类监督。其次,在记忆注入阶段,当收到新任务时,我们的系统高效地检索最相关的过去轨迹,作为可操作的“记忆”。最后,在GUI任务推理阶段,这些记忆作为上下文指导注入,以指导代理的推理和决策过程。我们在Android World和AndroidLab等基准测试上展示了我们方法的有效性。结果表明,EchoTrail-GUI 显著提高了基线代理的任务成功率和操作效率,验证了结构化记忆在创建更强大和智能的GUI自动化方面的力量。
Summary / 总结
EchoTrail-GUI is a framework that addresses the issue of digital amnesia in GUI agents by providing them with a dynamic memory system. The system consists of three stages: Experience Exploration, where agents autonomously learn from past successes; Memory Injection, where relevant past trajectories are retrieved for new tasks; and GUI Task Inference, where these memories guide the agent's decision-making. Experiments on Android World and AndroidLab show that EchoTrail-GUI enhances task success rates and operational efficiency compared to baseline agents, highlighting the importance of structured memory in GUI automation.
EchoTrail-GUI通过引入一种动态记忆系统来解决GUI代理的数字健忘问题,使代理能够从过去的成功中学习。该框架包括三个阶段:经验探索、记忆注入和GUI任务推理。在经验探索阶段,代理自主构建由奖励模型验证的成功任务轨迹数据库。在记忆注入阶段,相关过去的轨迹被检索以作为新任务的指导。在GUI任务推理阶段,这些记忆作为上下文指导来指导代理的推理和决策。实验结果表明,EchoTrail-GUI在Android World和AndroidLab基准测试中提高了任务成功率和操作效率,验证了结构化记忆在创建更强大和智能的GUI自动化中的作用。
Xiaomi MiMo-VL-Miloco Technical Report
Authors: Jiaze Li, Jingyang Chen, Yuxun Qu, Shijie Xu, Zhenru Lin, Junyou Zhu, Boshen Xu, Wenhui Tan, Pei Fu, Jianzhong Ju, Zhenbo Luo, Jian Luan
First: 2025-12-19T10:43:37+00:00 · Latest: 2025-12-22T13:27:24+00:00
Abstract
We open-source MiMo-VL-Miloco-7B and its quantized variant MiMo-VL-Miloco-7B-GGUF, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning based on Group Relative Policy Optimization, leveraging efficient multi-domain data. We further incorporate chain-of-thought supervision and token-budget-aware reasoning, enabling the model to learn knowledge in a data-efficient manner while also performing reasoning efficiently. Our analysis shows that targeted home-scenario training not only enhances activity and gesture understanding, but also improves text-only reasoning with only modest trade-offs on document-centric tasks. Model checkpoints, quantized GGUF weights, and our home-scenario evaluation toolkit are publicly available at https://github.com/XiaoMi/xiaomi-mimo-vl-miloco to support research and deployment in real-world smart-home applications.
中文标题/摘要
标题:小米MiMo-VL-Miloco技术报告
我们开源了MiMo-VL-Miloco-7B及其量化变体MiMo-VL-Miloco-7B-GGUF,这是一个专注于家庭场景理解与通用多模态推理的家为中心的视觉-语言模型对。基于MiMo-VL-7B骨干网络,MiMo-VL-Miloco-7B专门针对智能家居环境,实现了手势识别和常见家庭场景理解的领先F1分数,并在视频基准测试(如Video-MME、Video-MMMU和Charades-STA)以及语言理解基准测试(如MMMU-Pro和MMLU-Pro)中也取得了持续的改进。在我们的实验中,MiMo-VL-Miloco-7B在家庭场景理解和多个多模态推理基准测试中均优于强大的闭源和开源基线。为了平衡专业化和通用性,我们设计了一种两阶段训练管道,结合了监督微调和基于组相对策略优化的强化学习,利用高效的多域数据。我们进一步引入了思维链监督和令牌预算感知推理,使模型能够在数据高效学习的同时高效推理。我们的分析表明,针对家庭场景的训练不仅增强了活动和手势理解,还仅以适度的文档中心任务权衡提高了文本推理能力。模型检查点、量化GGUF权重以及我们的家庭场景评估工具包可在https://github.com/XiaoMi/xiaomi-mimo-vl-miloco 公开获取,以支持在实际智能家居应用中的研究和部署。
History
20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553