arXiv 论文速递

DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

Authors: Dongzhi Jiang, Renrui Zhang, Haodong Li, Zhuofan Zong, Ziyu Guo, Jun He, Claire Guo, Junyan Ye, Rongyao Fang, Weijia Li, Rui Liu, Hongsheng Li

First: 2025-12-04T18:59:53+00:00 · Latest: 2025-12-04T18:59:53+00:00

Comments: Project Page: https://github.com/CaraJ7/DraCo

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.

中文标题/摘要

标题：DraCo：草图作为CoT用于文本到图像预览和稀有概念生成

近期统一的多模态大型语言模型（MLLMs）展示了令人印象深刻的性能，通过链式推理（CoT）增强了文本到图像生成能力。然而，现有方法仍然有限，要么仅将模型视为独立生成器，要么依赖抽象的文本规划。为此，我们提出了一种名为Draft-as-CoT（DraCo）的新颖交替推理范式，该范式充分利用了文本和视觉内容在CoT中的双重作用，以更好地进行规划和验证。我们的方法首先生成低分辨率的草图图像作为预览，提供更具体的视觉规划和指导。然后，我们利用模型的内在理解能力验证草图与输入提示之间潜在的语义不一致，并通过选择性修正进行超分辨率细化。这样，我们的方法解决了两个基本挑战：文本规划的粗粒度性质和生成稀有属性组合的困难。为了支持训练，我们整理了DraCo-240K，旨在增强一般修正、实例操作和布局重组的三种原子能力。借助DraCo-CFG，一种专门的交替推理无分类器引导（CFG）策略，DraCo在GenEval上取得了8%的巨大提升，在Imagine-Bench上提升了0.91，在GenEval++上提升了3%，显著优于直接生成和其他基于CoT的生成方法。

Summary / 总结

DraCo aims to improve text-to-image generation by integrating chain-of-thought (CoT) reasoning, generating a low-resolution draft image first for visual guidance, and then refining it through selective corrections. This approach addresses the limitations of coarse textual planning and the difficulty in generating rare attributes. DraCo outperforms direct generation and other CoT-empowered methods on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%).

DraCo 通过结合链式思考（CoT）推理，首先生成低分辨率的草图以提供视觉指导，然后通过选择性修正进行细化。这种方法解决了粗略的文字规划限制和生成稀有属性的困难。DraCo 在 GenEval (+8%)、Imagine-Bench (+0.91) 和 GenEval++ (+3%) 上的表现优于直接生成和其他 CoT 增强的方法。

ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

Authors: Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, Conghui He, Dahua Lin, Jiaqi Wang

First: 2025-12-04T18:59:52+00:00 · Latest: 2025-12-04T18:59:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.

中文标题/摘要

标题：ARM-Thinker：通过自主工具使用和视觉推理强化多模态生成奖励模型

奖励模型对于使视觉-语言系统与人类偏好保持一致至关重要，但当前的方法存在幻觉、视觉定位弱以及无法使用工具进行验证的问题，这限制了它们在复杂多模态推理任务中的可靠性。我们提出了ARM-Thinker，这是一种自主多模态奖励模型，能够自主调用外部工具（例如，图像裁剪、文档页面检索）来使判断基于可验证的证据，替代静态、非交互式的奖励评分。这使模型能够验证细微的视觉细节，跨参考多页证据，并验证推理声明，而这些能力在现有的奖励模型中是不存在的。我们使用多阶段强化学习训练ARM-Thinker，同时优化工具调用决策和判断准确性。为了评估自主奖励建模，我们引入了ARMBench-VL，包含三个基准测试，分别评估细微的视觉定位（图像级工具）、多页文档理解（检索工具）和指令遵循（文本级验证）。ARM-Thinker 在奖励模型基准测试中平均提高了16.2%，在工具使用任务中提高了9.6%，并在多模态数学和逻辑推理基准测试中优于基线模型。我们的结果表明，自主能力显著提高了奖励模型的准确性和可解释性。

Summary / 总结

ARM-Thinker is designed to improve the reliability of vision-language systems by incorporating agentic tool use and visual reasoning into multimodal reward models. It autonomously invokes external tools to verify visual details and cross-reference evidence, addressing issues like hallucination and weak visual grounding. ARM-Thinker was trained using multi-stage reinforcement learning and achieved significant improvements on various benchmarks, including a 16.2% average improvement on reward modeling tasks and 9.6% on tool-use tasks, outperforming existing baselines.

ARM-Thinker 是一种使用外部工具进行验证的自主多模态奖励模型，解决了现有模型中的幻觉和弱视觉接地问题。它通过多阶段强化学习来优化工具调用决策和判断准确性。ARM-Thinker 在奖励建模基准测试中平均提高了 16.2% 的性能，并在多模态推理任务中优于基线模型。

TV2TV: A Unified Framework for Interleaved Language and Video Generation

Authors: Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, Yushi Hu, Shang-Wen Li, Sreya Dutta Roy, Jakob Verbeek, XuDong Wang, Marjan Ghazvininejad, Luke Zettlemoyer, Emily Dinan

First: 2025-12-04T18:59:09+00:00 · Latest: 2025-12-04T18:59:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.

中文标题/摘要

标题：TV2TV：一种统一的交错语言和视频生成框架

视频生成模型正在迅速发展，但仍可能在需要大量语义分支或反复进行下一步应该发生什么的高层推理的复杂视频输出上遇到困难。在本文中，我们介绍了一类新的全能视频-文本模型，这些模型结合了最近语言模型推理进展的想法，以应对这一挑战。具体来说，我们提出了TV2TV，这是一种统一的生成建模框架，将视频生成分解为交错的语言和视频生成过程。TV2TV 使用混合的变换器（MoT）架构联合学习语言建模（下一个标记预测）和视频流匹配（下一个帧预测）。在推理时，TV2TV 决定何时在生成文本和视频帧之间交替，使模型能够在“用词思考”后续内容之前“用像素行动”来生成帧。这种设计将决定下一步应该发生什么的责任大部分转移到了语言建模塔上，从而提高了生成视频的视觉质量和提示对齐。它还使用户能够在过程中任何时间通过文本干预来实现精细的控制，修改视频生成轨迹。在受控实验中，TV2TV 在视觉质量和可控性方面都取得了显著改进。TV2TV 还扩展到自然视频，我们通过使用视觉-语言模型（VLMs）交替自然语言动作描述来增强体育视频，展示了这一点。在该语料库上训练 TV2TV 产生了强大的视觉质量和提示对齐，展示了模型推理和生成复杂现实动作序列的能力。这些结果共同突显了 TV2TV 是朝着具有开放文本推理和控制的视频生成迈出的有希望的一步。

Summary / 总结

TV2TV is a unified generative modeling framework that addresses the challenge of generating complex videos by integrating language and video generation processes. It uses a Mixture-of-Transformers architecture to jointly learn language modeling and video flow matching. Experiments show that TV2TV improves visual quality and controllability in generated videos, especially in video game data, and scales to natural videos with the help of vision-language models.

TV2TV 是一种统一的生成模型框架，将文本和视频生成交织在一起，以应对复杂视频输出的挑战。它使用 Mixture-of-Transformers 架构同时学习语言建模和视频流匹配。实验表明，TV2TV 在视频游戏数据上提高了视觉质量和可控性，并通过使用视觉语言模型（VLM）对体育视频进行文本描述的增强，将其扩展到自然视频，展示了模型在生成复杂现实动作序列方面的强大视觉质量和提示对齐能力。

Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression

Authors: Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, Seungryong Kim

First: 2025-12-04T18:46:44+00:00 · Latest: 2025-12-04T18:46:44+00:00

Comments: Project Page: https://cvlab-kaist.github.io/DeepForcing/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.

中文标题/摘要

标题：深度强迫：无需训练的长视频生成方法

近期自回归视频扩散技术的进步使得实时帧流成为可能，但现有解决方案仍然存在时间重复、漂移和运动减速的问题。我们发现，直接将类似于StreamingLLM的注意力下采样应用于视频扩散会导致保真度下降和运动停滞。为了解决这个问题，我们引入了深度强迫，这是一种无需训练的机制，可以在不进行微调的情况下解决这些问题。具体来说，1) 深度下采样将滑动窗口的一半用于持久下采样标记，并重新对齐它们的当前时间线的时空RoPE相位，从而在长时间展开过程中稳定全局上下文。2) 参与式压缩执行重要性感知的KV缓存剪枝，仅保留最近参与注意力的活跃标记，同时安全地丢弃冗余和退化的历史记录，从而在生成超出分布长度时最小化误差累积。这些组件结合在一起，使生成能力提高了超过12倍（例如，5秒训练到60秒以上的生成），同时保持了更好的成像质量、更好的美学质量、几乎保持整体一致性，并在动态程度上取得了显著进步，同时保持实时生成。我们的结果表明，无需训练的KV缓存管理可以与基于训练的方法相媲美或超越自回归流式长视频生成。

Summary / 总结

Deep Forcing is a training-free method for long video generation that addresses temporal repetition and motion deceleration issues in existing solutions. It introduces two mechanisms: Deep Sink, which stabilizes global context by re-aligning temporal RoPE phases, and Participative Compression, which prunes the KV cache to preserve only active tokens. These components enable over 12x extrapolation with better imaging and aesthetic quality, maintaining consistency and dynamic degree, and supporting real-time generation.

Deep Forcing 是一种无需训练的方法，用于解决现有长视频生成方法中的时间重复和运动减速问题。它引入了两种机制：Deep Sink 通过重新对齐持久的 sink 标记来稳定全局上下文，而 Participative Compression 则通过仅保留积极参与的标记来修剪 KV 缓存，减少误差累积。这些机制使得生成能力提高了超过 12 倍，同时在成像质量和美学质量上优于先前的方法，保持了一致性和动态程度，并支持实时生成。

Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN

Authors: Mohammad Taufeeque, Aaron David Tucker, Adam Gleave, Adrià Garriga-Alonso

Venue: NeurIPS 2025

First: 2025-06-11T19:36:17+00:00 · Latest: 2025-12-04T18:28:33+00:00

Comments: Presented at the Mechanistic Interpretability Workshop at NeurIPS 2025. 34 pages, 26 figures

Abs · PDF · Code1 · Code2

Abstract

We partially reverse-engineer a convolutional recurrent neural network (RNN) trained with model-free reinforcement learning to play the box-pushing game Sokoban. We find that the RNN stores future moves (plans) as activations in particular channels of the hidden state, which we call path channels. A high activation in a particular location means that, when a box is in that location, it will get pushed in the channel's assigned direction. We examine the convolutional kernels between path channels and find that they encode the change in position resulting from each possible action, thus representing part of a learned transition model. The RNN constructs plans by starting at the boxes and goals. These kernels extend activations in path channels forwards from boxes and backwards from the goal. Negative values are placed in channels at obstacles. This causes the extension kernels to propagate the negative value in reverse, thus pruning the last few steps and letting an alternative plan emerge; a form of backtracking. Our work shows that, a precise understanding of the plan representation allows us to directly understand the bidirectional planning-like algorithm learned by model-free training in more familiar terms.

中文标题/摘要

标题：路径通道和计划扩展核：Sokoban RNN规划的机理描述

我们部分逆向工程了一个通过无模型强化学习训练的卷积递归神经网络（RNN），使其能够玩推箱子游戏Sokoban。我们发现，RNN将未来的动作（计划）存储在隐藏状态的特定通道中，我们称之为路径通道。特定位置的高激活意味着当箱子位于该位置时，它将被推到该通道指定的方向。我们检查了路径通道之间的卷积核，发现它们编码了每种可能动作导致的位置变化，从而代表了学习到的部分转移模型。RNN通过从箱子和目标开始构建计划。这些核将激活从箱子向前扩展到路径通道，并从目标向后扩展。在障碍物处放置负值。这导致扩展核将负值反向传播，从而修剪最后几步，让另一种计划浮现；一种形式的回溯。我们的工作表明，对计划表示的精确理解使我们能够直接用更熟悉的术语理解模型自由训练中学习到的双向规划算法。

Summary / 总结

This study partially reverse-engineers a convolutional RNN trained for Sokoban to reveal that the RNN stores future moves (plans) in specific channels, called path channels, and uses convolutional kernels to encode the effects of actions. The RNN constructs plans by extending activations from boxes and goals, with negative values at obstacles causing backtracking. This work demonstrates that understanding the plan representation helps interpret the bidirectional planning algorithm learned by the RNN.

研究部分反向工程了一个用于解Sokoban的卷积递归神经网络（RNN），发现RNN将未来的移动（计划）存储在隐藏状态的特定通道中，称为路径通道。这些通道指示当箱子处于某个位置时将被推的方向。路径通道之间的卷积核编码每个动作后的位置变化，代表了一个学习到的转移模型。RNN从箱子和目标开始构建计划，使用扩展核从前向箱子和从目标向后传播激活值，障碍物处的负值导致回溯的出现。这项工作表明，理解计划表示有助于解释RNN通过无模型训练学习到的双向规划算法。

4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer

Authors: Xianfeng Wu, Yajing Bai, Minghan Li, Xianzu Wu, Xueqi Zhao, Zhongyuan Lai, Wenyu Liu, Xinggang Wang

First: 2025-12-04T18:15:27+00:00 · Latest: 2025-12-04T18:15:27+00:00

Comments: Code: https://github.com/hustvl/4DLangVGGT, Webpage: https://hustvl.github.io/4DLangVGGT

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Constructing 4D language fields is crucial for embodied AI, augmented/virtual reality, and 4D scene understanding, as they provide enriched semantic representations of dynamic environments and enable open-vocabulary querying in complex scenarios. However, existing approaches to 4D semantic field construction primarily rely on scene-specific Gaussian splatting, which requires per-scene optimization, exhibits limited generalization, and is difficult to scale to real-world applications. To address these limitations, we propose 4DLangVGGT, the first Transformer-based feed-forward unified framework for 4D language grounding, that jointly integrates geometric perception and language alignment within a single architecture. 4DLangVGGT has two key components: the 4D Visual Geometry Transformer, StreamVGGT, which captures spatio-temporal geometric representations of dynamic scenes; and the Semantic Bridging Decoder (SBD), which projects geometry-aware features into a language-aligned semantic space, thereby enhancing semantic interpretability while preserving structural fidelity. Unlike prior methods that depend on costly per-scene optimization, 4DLangVGGT can be jointly trained across multiple dynamic scenes and directly applied during inference, achieving both deployment efficiency and strong generalization. This design significantly improves the practicality of large-scale deployment and establishes a new paradigm for open-vocabulary 4D scene understanding. Experiments on HyperNeRF and Neu3D datasets demonstrate that our approach not only generalizes effectively but also achieves state-of-the-art performance, achieving up to 2% gains under per-scene training and 1% improvements under multi-scene training. Our code released in https://github.com/hustvl/4DLangVGGT

中文标题/摘要

标题：4DLangVGGT：四维语言-视觉几何接地变换器

构建四维语言场对于具身人工智能、增强/虚拟现实以及四维场景理解至关重要，因为它们提供了动态环境的丰富语义表示，并在复杂场景中支持开放词汇查询。然而，现有的四维语义场构建方法主要依赖于场景特定的高斯点积，这需要逐场景优化，泛化能力有限，难以扩展到实际应用。为了解决这些限制，我们提出了4DLangVGGT，这是一种基于变换器的前馈统一框架，用于四维语言接地，该框架在单一架构中联合整合了几何感知和语言对齐。4DLangVGGT有两个关键组件：四维视觉几何变换器StreamVGGT，用于捕获动态场景的时空几何表示；以及语义桥梁解码器（SBD），将几何感知特征投影到语言对齐的语义空间，从而增强语义可解释性并保持结构保真度。与依赖于昂贵的逐场景优化的先前方法不同，4DLangVGGT可以在多个动态场景上联合训练，并在推理时直接应用，实现部署效率和强大的泛化能力。这种设计显著提高了大规模部署的实用性，并建立了开放词汇四维场景理解的新范式。在HyperNeRF和Neu3D数据集上的实验表明，我们的方法不仅泛化效果良好，还实现了最先进的性能，在逐场景训练下达到2%的提升，在多场景训练下达到1%的提升。我们在https://github.com/hustvl/4DLangVGGT发布了代码。

Summary / 总结

The research aims to construct 4D language fields for embodied AI and 4D scene understanding by proposing 4DLangVGGT, a Transformer-based unified framework that integrates geometric perception and language alignment. Key components include the 4D Visual Geometry Transformer (StreamVGGT) for capturing spatio-temporal geometric representations and the Semantic Bridging Decoder (SBD) for projecting these features into a language-aligned semantic space. Experiments show that 4DLangVGGT outperforms existing methods, achieving up to 2% gains under per-scene training and 1% improvements under multi-scene training, with better generalization and deployment efficiency. Code is available at https://github.com/hustvl/4DLangVGGT.

研究旨在通过提出4DLangVGGT，一种基于Transformer的统一框架，将几何感知和语言对齐结合起来，构建4D语言领域，以支持体感AI和4D场景理解。关键组件包括4D视觉几何变换器（StreamVGGT）用于捕获时空几何表示，以及语义桥梁解码器（SBD）将这些特征投影到语言对齐的语义空间。实验表明，4DLangVGGT在单场景训练中可获得高达2%的性能提升，在多场景训练中可获得1%的改进，具有更好的泛化能力和部署效率。代码可在https://github.com/hustvl/4DLangVGGT获取。

Towards a unified framework for guided diffusion models

Authors: Yuchen Jiao, Yuxin Chen, Gen Li

First: 2025-12-04T16:55:20+00:00 · Latest: 2025-12-04T16:55:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Guided or controlled data generation with diffusion models\blfootnote{Partial preliminary results of this work appeared in International Conference on Machine Learning 2025 \citep{li2025provable}.} has become a cornerstone of modern generative modeling. Despite substantial advances in diffusion model theory, the theoretical understanding of guided diffusion samplers remains severely limited. We make progress by developing a unified algorithmic and theoretical framework that accommodates both diffusion guidance and reward-guided diffusion. Aimed at fine-tuning diffusion models to improve certain rewards, we propose injecting a reward guidance term -- constructed from the difference between the original and reward-reweighted scores -- into the backward diffusion process, and rigorously quantify the resulting reward improvement over the unguided counterpart. As a key application, our framework shows that classifier-free guidance (CFG) decreases the expected reciprocal of the classifier probability, providing the first theoretical characterization of the specific performance metric that CFG improves for general target distributions. When applied to reward-guided diffusion, our framework yields a new sampler that is easy-to-train and requires no full diffusion trajectories during training. Numerical experiments further corroborate our theoretical findings.

中文标题/摘要

标题：迈向统一的引导扩散模型框架

带有扩散模型的引导或控制数据生成已成为现代生成建模的基石。尽管在扩散模型理论方面取得了重大进展，但对引导扩散采样器的理论理解仍然非常有限。我们通过开发一个统一的算法和理论框架取得了进展，该框架可以容纳扩散引导和奖励引导扩散。旨在微调扩散模型以提高某些奖励，我们提出将奖励引导项——由原始分数和奖励加权分数之差构建——注入反向扩散过程，并严格量化与未引导的对应物相比的奖励改进。作为关键应用，我们的框架表明，无分类器引导（CFG）降低了分类器概率的期望倒数，首次为通用目标分布提供了CFG改进的具体性能指标的理论表征。当应用于奖励引导扩散时，我们的框架产生了一种新的采样器，该采样器易于训练，并且在训练过程中不需要完整的扩散轨迹。数值实验进一步证实了我们的理论发现。

Summary / 总结

This paper aims to develop a unified framework for guided diffusion models to enhance theoretical understanding and practical applications. The authors propose injecting a reward guidance term into the backward diffusion process, which improves certain rewards compared to unguided counterparts. Key findings include a theoretical characterization of classifier-free guidance (CFG) and a new reward-guided diffusion sampler that is easy to train and does not require full diffusion trajectories during training.

本文旨在开发统一框架以增强对引导扩散采样的理论理解。作者提出在反向扩散过程中注入奖励引导项以提高某些奖励。他们严格量化了奖励改进，并表明无分类器引导减少了分类器概率的倒数的期望值，首次为通用目标分布提供了无分类器引导改进的性能指标的理论表征。该框架还为奖励引导扩散生成了一个新的易于训练的采样器，无需在训练期间使用完整的扩散轨迹。

Aligned but Stereotypical? The Hidden Influence of System Prompts on Social Bias in LVLM-Based Text-to-Image Models

Authors: NaHyeon Park, Namin An, Kunhee Kim, Soyeon Yoon, Jiahao Huo, Hyunjung Shim

First: 2025-12-04T16:52:45+00:00 · Latest: 2025-12-04T16:52:45+00:00

Comments: Project page: https://fairpro-t2i.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Large vision-language model (LVLM) based text-to-image (T2I) systems have become the dominant paradigm in image generation, yet whether they amplify social biases remains insufficiently understood. In this paper, we show that LVLM-based models produce markedly more socially biased images than non-LVLM-based models. We introduce a 1,024 prompt benchmark spanning four levels of linguistic complexity and evaluate demographic bias across multiple attributes in a systematic manner. Our analysis identifies system prompts, the predefined instructions guiding LVLMs, as a primary driver of biased behavior. Through decoded intermediate representations, token-probability diagnostics, and embedding-association analyses, we reveal how system prompts encode demographic priors that propagate into image synthesis. To this end, we propose FairPro, a training-free meta-prompting framework that enables LVLMs to self-audit and construct fairness-aware system prompts at test time. Experiments on two LVLM-based T2I models, SANA and Qwen-Image, show that FairPro substantially reduces demographic bias while preserving text-image alignment. We believe our findings provide deeper insight into the central role of system prompts in bias propagation and offer a practical, deployable approach for building more socially responsible T2I systems.

中文标题/摘要

标题：对齐但刻板？LVLM 基础文本到图像模型中社会偏见的隐秘影响

基于大型视觉语言模型（LVLM）的文本到图像（T2I）系统已成为图像生成的主导范式，但它们是否放大了社会偏见尚不充分理解。在本文中，我们展示了基于LVLM的模型生成的社会偏见图像明显多于非LVLM基础模型。我们引入了一个包含四个语言复杂度级别的1024个提示基准，并以系统的方式评估了多个属性上的人口统计学偏见。我们的分析确定系统提示，即引导LVLM的预定义指令，是偏见行为的主要驱动因素。通过解码中间表示、标记概率诊断和嵌入关联分析，我们揭示了系统提示如何编码人口统计学先入之见并传播到图像合成中。为此，我们提出了FairPro，一种无需训练的元提示框架，使LVLM能够在测试时自我审计并构建公平意识的系统提示。在两个基于LVLM的T2I模型SANA和Qwen-Image上的实验表明，FairPro在保持文本图像对齐的同时显著减少了人口统计学偏见。我们认为我们的发现提供了对系统提示在偏见传播中核心作用的更深入理解，并提供了一种实用的、可部署的方法来构建更具社会责任感的T2I系统。

Summary / 总结

This paper investigates the social bias in large vision-language model (LVLM)-based text-to-image (T2I) systems and finds that these models produce more socially biased images than non-LVLM-based models. The authors introduce a 1,024 prompt benchmark to evaluate demographic bias and identify system prompts as a key driver. They propose FairPro, a meta-prompting framework that reduces demographic bias without compromising text-image alignment, demonstrating its effectiveness on SANA and Qwen-Image models.

该研究探讨了基于LVLM的图文生成模型中的社会偏见问题，发现这些模型生成的图像比非LVLM模型更具偏见。通过分析一个包含1,024个提示的基准，研究确定系统提示是偏见传播的关键因素。作者提出了一种名为FairPro的元提示框架，帮助LVLM生成公平意识的系统提示，从而减少社会偏见同时保持文本与图像的一致性。

A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models

Authors: X. Y. Han, Yuan Zhong

First: 2025-12-03T16:00:02+00:00 · Latest: 2025-12-04T16:34:28+00:00

Abs · PDF · Code1 · Code2

Abstract

In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token. An operational challenge in this design is load balancing: routing tokens to minimize the number of idle experts, which is important for the efficient utilization of (costly) GPUs. We provide a theoretical framework for analyzing the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure -- proposed by DeepSeek's Wang et al. (2024) -- by casting it as a one-step-per-iteration primal-dual method for an assignment problem. First, in a stylized deterministic setting, our framework yields several insightful structural properties: (i) a monotonic improvement of a Lagrangian objective, (ii) a preference rule that moves tokens from overloaded to underloaded experts, and (iii) an approximate-balancing guarantee. Then, we incorporate the stochastic and dynamic nature of AI training using a generalized online optimization formulation. In the online setting, we derive a strong convexity property of the objective that leads to a logarithmic expected regret bound under certain step-size choices. Additionally, we present real experiments on 1B-parameter DeepSeekMoE models to complement our theoretical findings. Together, these results build a principled framework for analyzing the Auxiliary-Loss-Free Load Balancing of s-MoE in AI models.

Summary / 总结

The research aims to address the operational challenge of load balancing in Sparse Mixture-of-Experts (s-MoE) layers by providing a theoretical framework for the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure. The method is analyzed as a primal-dual method for an assignment problem, yielding insights into the monotonic improvement of a Lagrangian objective, a preference rule for moving tokens, and an approximate-balancing guarantee. The framework is further extended to handle the stochastic and dynamic nature of AI training, leading to a logarithmic expected regret bound under certain step-size choices. Experimental results on 1B-parameter DeepSeekMoE models support the theoretical findings.

论文提供了一种分析Sparse Mixture-of-Experts (s-MoE) 层中Auxiliary-Loss-Free Load Balancing (ALF-LB) 程序的理论框架，将其视为一个分配问题的原始-对偶方法。在确定性环境中，该框架揭示了几个结构性特征，包括拉格朗日目标的单调改进和近似平衡保证。作者将此扩展到随机和动态环境，推导出在某些步长选择下的对数期望后悔界。通过在DeepSeekMoE模型上的实际实验验证了理论发现，提供了一种分析s-MoE层中负载均衡的原理性方法。

LLMs Know More Than Words: A Genre Study with Syntax, Metaphor & Phonetics

Authors: Weiye Shi, Zhaowei Zhang, Shaoheng Yan, Yaodong Yang

First: 2025-12-04T16:26:42+00:00 · Latest: 2025-12-04T16:26:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) demonstrate remarkable potential across diverse language related tasks, yet whether they capture deeper linguistic properties, such as syntactic structure, phonetic cues, and metrical patterns from raw text remains unclear. To analysis whether LLMs can learn these features effectively and apply them to important nature language related tasks, we introduce a novel multilingual genre classification dataset derived from Project Gutenberg, a large-scale digital library offering free access to thousands of public domain literary works, comprising thousands of sentences per binary task (poetry vs. novel;drama vs. poetry;drama vs. novel) in six languages (English, French, German, Italian, Spanish, and Portuguese). We augment each with three explicit linguistic feature sets (syntactic tree structures, metaphor counts, and phonetic metrics) to evaluate their impact on classification performance. Experiments demonstrate that although LLM classifiers can learn latent linguistic structures either from raw text or from explicitly provided features, different features contribute unevenly across tasks, which underscores the importance of incorporating more complex linguistic signals during model training.

中文标题/摘要

标题：LLMs 知识远超文字：一种涉及句法、隐喻与音韵的体裁研究

大型语言模型（LLMs）在多种语言相关任务中展现出显著潜力，但它们是否能够捕捉到更深层次的语言特性，如句法结构、音素提示和音节模式，仍然不清楚。为了分析LLMs是否能够有效学习这些特征并应用于重要的自然语言相关任务，我们引入了一个新颖的多语言体裁分类数据集，该数据集源自Project Gutenberg，这是一个提供数千篇公共领域文学作品的大型数字图书馆，包含六种语言（英语、法语、德语、意大利语、西班牙语和葡萄牙语）的数千个句子，每种二元任务（诗歌 vs. 小说；戏剧 vs. 诗歌；戏剧 vs. 小说）。我们为每个任务集增加了三个明确的语言特征集（句法树结构、隐喻计数和音韵指标），以评估它们对分类性能的影响。实验表明，尽管LLM分类器可以从原始文本或明确提供的特征中学习潜在的语言结构，但不同特征在不同任务中的贡献是不均衡的，这突显了在模型训练过程中整合更复杂语言信号的重要性。

Summary / 总结

This study investigates whether large language models (LLMs) can learn and apply deeper linguistic properties such as syntax, metaphor, and phonetics. A multilingual genre classification dataset was created using Project Gutenberg texts, with explicit linguistic features added to evaluate their impact. Experiments show that LLMs can learn these features from both raw text and explicit features, but the effectiveness varies across different tasks, highlighting the need for incorporating complex linguistic signals during training.

研究探讨了大型语言模型（LLMs）是否能够学习和应用更深层次的语法规则、隐喻和音韵等语言特性。使用Project Gutenberg文本创建了一个多语言体裁分类数据集，并添加了显式的语言特征以评估其影响。实验表明，LLMs可以从原始文本或显式输入中学习这些特征，但在不同任务上的效果不同，强调了在模型训练中需要包含更复杂的语言信号的重要性。

FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via neural Action Tokenization

Authors: Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, Liangtao Zheng, Tao Jiang, Jingjing Gong, Xipeng Qiu, Hang Zhao

First: 2025-12-04T16:21:38+00:00 · Latest: 2025-12-04T16:21:38+00:00

Abs · PDF · Code1 · Code2

Abstract

Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.

中文标题/摘要

标题：FASTer：通过神经动作分词实现高效自回归视觉语言动作建模

自回归视觉-语言-动作（VLA）模型最近在机器人操作方面展示了强大的能力。然而，它们的核心动作分词过程通常会在重建保真度和推理效率之间进行权衡。我们引入了FASTer，这是一种统一的高效且可泛化的机器人学习框架，该框架结合了一个可学习的分词器和基于它的自回归策略。FASTerVQ 将动作片段编码为单通道图像，捕获全局时空依赖关系的同时保持高压缩比。FASTerVLA 在此基础上使用块状自回归解码和轻量级动作专家，实现更快的推理和更高的任务性能。广泛的实验表明，FASTerVQ 提供了卓越的重建质量、高分词利用率和强大的跨任务和跨载体泛化能力，而 FASTerVLA 进一步提高了整体能力，在推理速度和任务性能方面均超越了之前的最先进的 VLA 模型。

Summary / 总结

FASTer is a framework designed to improve the efficiency of autoregressive vision-language-action models in robotic manipulation. It uses a learnable tokenizer to encode action chunks as single-channel images, enhancing both reconstruction quality and inference speed. Experiments show that FASTerVQ outperforms previous models in reconstruction quality and token utilization, and FASTerVLA further enhances task performance and inference speed, surpassing state-of-the-art VLA models.

研究旨在通过动作分词解决自回归视觉-语言-动作模型在机器人操作中重建保真度和推理效率之间的权衡问题。FASTer框架引入了可学习的分词器和自回归策略，其中FASTerVQ专注于高质量编码，而FASTerVLA则侧重于更快的推理和更好的任务性能。实验结果表明，FASTerVQ在重建质量和分词利用率方面表现出色，而FASTerVLA在推理速度和任务性能方面超越了之前的最先进模型，在各种基准测试中均表现出色。

"I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments

Authors: Ziyi Zhang, Zhen Sun, Zongmin Zhang, Zifan Peng, Yuemeng Zhao, Zichun Wang, Zeren Luo, Ruiting Zuo, Xinlei He

First: 2025-05-07T15:03:16+00:00 · Latest: 2025-12-04T16:15:45+00:00

Comments: 17 pages

Abs · PDF · Code1 · Code2

Abstract

The visually impaired population faces significant challenges in daily activities. While prior works employ vision language models for assistance, most focus on static content and cannot address real-time perception needs in complex environments. Recent VideoLLMs enable real-time vision and speech interaction, offering promising potential for assistive tasks. In this work, we conduct the first study evaluating their effectiveness in supporting daily life for visually impaired individuals. We first conducted a user survey with visually impaired participants to design the benchmark VisAssistDaily for daily life evaluation. Using VisAssistDaily, we evaluate popular VideoLLMs and find GPT-4o achieves the highest task success rate. We further conduct a user study to reveal concerns about hazard perception. To address this, we propose SafeVid, an environment-awareness dataset, and fine-tune VITA-1.5, improving risk recognition accuracy from 25.00% to 76.00%.We hope this work provides valuable insights and inspiration for future research in this field.

中文标题/摘要

标题："我能看到永远！": 评估实时视频LLM在辅助视觉障碍个体中的效果

视觉障碍人群在日常活动中面临重大挑战。尽管先前的工作利用视觉语言模型进行辅助，但大多数工作集中在静态内容上，无法解决复杂环境中实时感知的需求。最近的视频LLM能够实现实时视觉和语音交互，为辅助任务提供了巨大的潜力。在本文中，我们首次研究了它们在支持视觉障碍个体日常生活的有效性。我们首先对视觉障碍参与者进行了用户调查，设计了用于日常生活的基准测试VisAssistDaily。使用VisAssistDaily，我们评估了流行的视频LLM，发现GPT-4o的任务成功率最高。我们进一步进行了一项用户研究，揭示了对危险感知的担忧。为了解决这个问题，我们提出了SafeVid，一个环境感知数据集，并对VITA-1.5进行了微调，将风险识别准确性从25.00%提高到76.00%。我们希望这项工作为该领域的未来研究提供有价值的见解和灵感。

Multi-Agent Reinforcement Learning for Intraday Operating Rooms Scheduling under Uncertainty

Authors: Kailiang Liu, Ying Chen, Ralf Borndörfer, Thorsten Koch

First: 2025-12-04T15:47:08+00:00 · Latest: 2025-12-04T15:47:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Intraday surgical scheduling is a multi-objective decision problem under uncertainty-balancing elective throughput, urgent and emergency demand, delays, sequence-dependent setups, and overtime. We formulate the problem as a cooperative Markov game and propose a multi-agent reinforcement learning (MARL) framework in which each operating room (OR) is an agent trained with centralized training and decentralized execution. All agents share a policy trained via Proximal Policy Optimization (PPO), which maps rich system states to actions, while a within-epoch sequential assignment protocol constructs conflict-free joint schedules across ORs. A mixed-integer pre-schedule provides reference starting times for electives; we impose type-specific quadratic delay penalties relative to these references and a terminal overtime penalty, yielding a single reward that captures throughput, timeliness, and staff workload. In simulations reflecting a realistic hospital mix (six ORs, eight surgery types, random urgent and emergency arrivals), the learned policy outperforms six rule-based heuristics across seven metrics and three evaluation subsets, and, relative to an ex post MIP oracle, quantifies optimality gaps. Policy analytics reveal interpretable behavior-prioritizing emergencies, batching similar cases to reduce setups, and deferring lower-value electives. We also derive a suboptimality bound for the sequential decomposition under simplifying assumptions. We discuss limitations-including OR homogeneity and the omission of explicit staffing constraints-and outline extensions. Overall, the approach offers a practical, interpretable, and tunable data-driven complement to optimization for real-time OR scheduling.

中文标题/摘要

标题：在不确定性下的日内手术室调度的多智能体强化学习

日内手术调度是一个在不确定性下的多目标决策问题，需要平衡择期手术量、紧急和急诊需求、延迟、顺序相关的设置以及加班。我们将问题形式化为合作马尔可夫博弈，并提出一个多智能体强化学习（MARL）框架，其中每个手术室（OR）是一个通过集中训练和分散执行训练的智能体。所有智能体共享一个通过近端策略优化（PPO）训练的策略，该策略将丰富的系统状态映射为动作，而一个在周期内的顺序分配协议构建了OR之间的无冲突联合调度。混合整数预调度提供择期手术的参考开始时间；我们对这些参考施加类型特定的二次延迟惩罚，并施加一个终端加班惩罚，产生一个单一的奖励，该奖励捕捉了吞吐量、及时性和工作人员的工作量。在反映现实医院情况的模拟中（六个OR，八种手术类型，随机的紧急和急诊到达），学习到的策略在七个指标和三个评估子集上均优于六种基于规则的启发式方法，并且相对于事后MIP优化器，量化了最优性差距。策略分析揭示了可解释的行为-优先处理紧急情况、批量处理相似案例以减少设置以及推迟低价值的择期手术。我们还在简化假设下推导了顺序分解的次优性界。我们讨论了限制，包括OR同质性和未明确包含的人员配置约束，并概述了扩展。总体而言，该方法为实时手术室调度提供了实用、可解释且可调节的数据驱动补充，与优化方法相结合。

Summary / 总结

The paper addresses the challenge of intraday surgical scheduling under uncertainty by formulating the problem as a cooperative Markov game and proposing a multi-agent reinforcement learning (MARL) framework. Each operating room is an agent trained with centralized training and decentralized execution using Proximal Policy Optimization (PPO). The learned policy outperforms six rule-based heuristics across seven metrics and three evaluation subsets, demonstrating superior performance in terms of throughput, timeliness, and staff workload. The approach also provides interpretable behavior insights and a suboptimality bound under simplifying assumptions.

论文针对手术室在日内的调度问题，将其表述为一个在不确定性下的多目标决策问题。通过合作马尔可夫游戏的形式，提出了一个多智能体强化学习（MARL）框架，其中每个手术室是一个智能体，使用集中训练和分散执行的方式，并采用近端策略优化（PPO）。学习到的策略在七个指标和三个评估子集上优于六种基于规则的启发式方法，提供了诸如优先处理紧急情况和推迟低价值择期手术等可解释的行为。该方法还提供了一种实用、可解释且可调的数据驱动解决方案，用于实时手术室调度。

Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment

Authors: Yang Chen, Xiaowei Xu, Shuai Wang, Chenhui Zhu, Ruxue Wen, Xubin Li, Tiezheng Ge, Limin Wang

Venue: AAAI 2026

First: 2025-11-27T11:35:08+00:00 · Latest: 2025-12-04T15:44:45+00:00

Comments: Accepted by AAAI 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Normalizing Flows (NFs) are a class of generative models distinguished by a mathematically invertible architecture, where the forward pass transforms data into a latent space for density estimation, and the reverse pass generates new samples from this space. This characteristic creates an intrinsic synergy between representation learning and data generation. However, the generative quality of standard NFs is limited by poor semantic representations from log-likelihood optimization. To remedy this, we propose a novel alignment strategy that creatively leverages the invertibility of NFs: instead of regularizing the forward pass, we align the intermediate features of the generative (reverse) pass with representations from a powerful vision foundation model, demonstrating superior effectiveness over naive alignment. We also introduce a novel training-free, test-time optimization algorithm for classification, which provides a more intrinsic evaluation of the NF's embedded semantic knowledge. Comprehensive experiments demonstrate that our approach accelerates the training of NFs by over 3.3$\times$, while simultaneously delivering significant improvements in both generative quality and classification accuracy. New state-of-the-art results for NFs are established on ImageNet 64$\times$64 and 256$\times$256. Our code is available at https://github.com/MCG-NJU/FlowBack.

中文标题/摘要

标题：逆向表示对齐改进：通过逆向表示对齐提高流动变换

流动变换（NFs）是一类生成模型，以其数学可逆的架构为特征，其中前向传递将数据转换到潜在空间进行密度估计，而逆向传递则从该空间生成新的样本。这一特性在表示学习和数据生成之间创造了内在的协同作用。然而，标准NFs的生成质量受限于从对数似然优化中获得的较差语义表示。为了解决这一问题，我们提出了一种新颖的对齐策略，创造性地利用了NFs的可逆性：而不是正则化前向传递，我们对生成（逆向）传递的中间特征与强视觉基础模型的表示进行对齐，显示出比简单对齐更有效的效果。我们还引入了一种新的无需训练、测试时的优化算法，用于分类，这为NF嵌入的语义知识提供了更内在的评估。全面的实验表明，我们的方法不仅将NFs的训练加速了3.3倍以上，还在生成质量和分类准确性方面取得了显著的改进。在ImageNet 64×64和256×256上，我们建立了NFs的新最佳结果。我们的代码可在https://github.com/MCG-NJU/FlowBack获取。

Summary / 总结

The paper addresses the limitation of standard Normalizing Flows (NFs) in generating high-quality data due to poor semantic representations. It introduces a novel alignment strategy that aligns the intermediate features of the reverse pass with those from a vision foundation model, improving both generative quality and classification accuracy. Experiments show a 3.3 times faster training speed and new state-of-the-art results on ImageNet. A training-free optimization algorithm is also proposed for evaluating NFs at test time.

本文针对标准Normalizing Flows (NFs)因语义表示较差而导致生成高质量数据的局限性，提出了一种逆向表示对齐策略，该策略将生成过程与强大的视觉基础模型对齐，从而提高生成质量和分类准确性。实验显示训练速度提高了3.3倍，并在ImageNet 64x64和256x256上取得了新的最佳结果。

EoS-FM: Can an Ensemble of Specialist Models act as a Generalist Feature Extractor?

Authors: Pierre Adorni, Minh-Tan Pham, Stéphane May, Sébastien Lefèvre

First: 2025-11-26T15:52:56+00:00 · Latest: 2025-12-04T15:22:57+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in foundation models have shown great promise in domains such as natural language processing and computer vision, and similar efforts are now emerging in the Earth Observation community. These models aim to generalize across tasks with limited supervision, reducing the need for training separate models for each task. However, current strategies, which largely focus on scaling model size and dataset volume, require prohibitive computational and data resources, limiting accessibility to only a few large institutions. Moreover, this paradigm of ever-larger models stands in stark contrast with the principles of sustainable and environmentally responsible AI, as it leads to immense carbon footprints and resource inefficiency. In this work, we present a novel and efficient alternative: an Ensemble-of-Specialists framework for building Remote Sensing Foundation Models (RSFMs). Our method decomposes the training process into lightweight, task-specific ConvNeXtV2 specialists that can be frozen and reused. This modular approach offers strong advantages in efficiency, interpretability, and extensibility. Moreover, it naturally supports federated training, pruning, and continuous specialist integration, making it particularly well-suited for collaborative and resource-constrained settings. Our framework sets a new direction for building scalable and efficient RSFMs. All codes and pretrained models are available at https://github.com/pierreadorni/EoS-FM.

中文标题/摘要

标题：EoS-FM：一组专家模型能否充当通用特征提取器？

基础模型在自然语言处理和计算机视觉等领域的最新进展显示了巨大的潜力，类似的努力现在也在地球观测领域出现。这些模型旨在在有限监督的情况下泛化任务，减少为每个任务单独训练模型的需要。然而，当前的策略主要集中在扩大模型规模和数据集的大小上，这需要巨大的计算和数据资源，限制了其仅对少数大型机构的可用性。此外，这种不断扩大的模型范式与可持续和环境友好的人工智能原则形成了鲜明对比，因为它导致了巨大的碳足迹和资源低效。在本文中，我们提出了一种新颖且高效的替代方案：用于构建遥感基础模型（RSFM）的专家模型组框架。我们的方法将训练过程分解为轻量级、任务特定的ConvNeXtV2专家，这些专家可以冻结并重用。这种模块化方法在效率、可解释性和可扩展性方面具有明显优势。此外，它自然支持联邦训练、剪枝和连续专家集成，使其特别适合协作和资源受限的环境。我们的框架为构建可扩展和高效的RSFM设定了新的方向。所有代码和预训练模型均可在https://github.com/pierreadorni/EoS-FM获取。

Summary / 总结

This paper explores the feasibility of using an Ensemble-of-Specialists framework to build Remote Sensing Foundation Models (RSFMs), addressing the limitations of current large-scale models in terms of computational and data resource requirements. The method involves training lightweight, task-specific ConvNeXtV2 specialists that can be reused and frozen, offering advantages in efficiency, interpretability, and extensibility. Key findings include improved performance and resource efficiency compared to monolithic models, making it suitable for collaborative and resource-constrained settings.

本文探讨了使用Ensemble-of-Specialists框架构建遥感基础模型（RSFMs）的可能性，以解决当前大规模模型在计算和数据资源需求方面的局限性。该方法涉及训练轻量级、任务特定的ConvNeXtV2专家模型，并可重复使用和冻结，从而在效率、可解释性和可扩展性方面提供优势。关键发现包括与单一模型相比，该框架在性能和资源效率方面有所提升，特别适合协作和资源受限的环境。

Chameleon: Adaptive Adversarial Agents for Scaling-Based Visual Prompt Injection in Multimodal AI Systems

Authors: M Zeeshan, Saud Satti

First: 2025-12-04T15:22:28+00:00 · Latest: 2025-12-04T15:22:28+00:00

Comments: 5 pages, 2 figures, IEEE Transactions on Dependable and Secure Computing

Abs · PDF · Code1 · Code2

Abstract

Multimodal Artificial Intelligence (AI) systems, particularly Vision-Language Models (VLMs), have become integral to critical applications ranging from autonomous decision-making to automated document processing. As these systems scale, they rely heavily on preprocessing pipelines to handle diverse inputs efficiently. However, this dependency on standard preprocessing operations, specifically image downscaling, creates a significant yet often overlooked security vulnerability. While intended for computational optimization, scaling algorithms can be exploited to conceal malicious visual prompts that are invisible to human observers but become active semantic instructions once processed by the model. Current adversarial strategies remain largely static, failing to account for the dynamic nature of modern agentic workflows. To address this gap, we propose Chameleon, a novel, adaptive adversarial framework designed to expose and exploit scaling vulnerabilities in production VLMs. Unlike traditional static attacks, Chameleon employs an iterative, agent-based optimization mechanism that dynamically refines image perturbations based on the target model's real-time feedback. This allows the framework to craft highly robust adversarial examples that survive standard downscaling operations to hijack downstream execution. We evaluate Chameleon against Gemini 2.5 Flash model. Our experiments demonstrate that Chameleon achieves an Attack Success Rate (ASR) of 84.5% across varying scaling factors, significantly outperforming static baseline attacks which average only 32.1%. Furthermore, we show that these attacks effectively compromise agentic pipelines, reducing decision-making accuracy by over 45% in multi-step tasks. Finally, we discuss the implications of these vulnerabilities and propose multi-scale consistency checks as a necessary defense mechanism.

中文标题/摘要

标题：变色龙：基于缩放的视觉提示注入适应性对抗代理在多模态AI系统中的应用

多模态人工智能（AI）系统，特别是视觉-语言模型（VLMs），已成为从自主决策到自动化文档处理等关键应用的重要组成部分。随着这些系统的扩展，它们依赖于预处理管道来高效处理各种输入。然而，对标准预处理操作，特别是图像缩放的依赖，创造了一个重要的但经常被忽视的安全漏洞。虽然缩放算法旨在进行计算优化，但它们可以被利用来隐藏对人类观察者不可见但被模型处理后成为有效语义指令的恶意视觉提示。当前的对抗策略大多保持静态，未能考虑到现代代理工作流程的动态性。为了解决这一差距，我们提出了变色龙，这是一种新颖的、适应性的对抗框架，旨在揭示并利用生产VLMs中的缩放漏洞。与传统的静态攻击不同，变色龙采用了一种迭代的、基于代理的优化机制，根据目标模型的实时反馈动态细化图像扰动。这使得框架能够生成高度鲁棒的对抗样本，这些样本能够生存下来标准的缩放操作，从而劫持下游执行。我们使用Gemini 2.5 Flash模型对变色龙进行了评估。我们的实验表明，变色龙在不同缩放因子下的攻击成功率（ASR）达到了84.5%，远高于平均32.1%的静态基线攻击。此外，我们展示了这些攻击有效地破坏了代理管道，在多步骤任务中使决策准确性降低了超过45%。最后，我们讨论了这些漏洞的影响，并提出了多尺度一致性检查作为必要的防御机制。

Summary / 总结

Chameleon is an adaptive adversarial framework designed to exploit scaling vulnerabilities in Vision-Language Models (VLMs). Unlike static attacks, Chameleon uses an iterative optimization mechanism to refine image perturbations based on real-time feedback from the target model. Experiments show that Chameleon achieves an Attack Success Rate of 84.5% across different scaling factors, significantly outperforming static attacks with an average ASR of 32.1%. The attacks compromise agentic pipelines, reducing decision-making accuracy by over 45% in multi-step tasks.

论文提出了Chameleon，一种针对视觉语言模型（VLM）缩放漏洞的自适应对抗框架。不同于静态攻击，Chameleon 使用基于目标模型实时反馈的迭代、代理优化机制来细化图像扰动。实验显示，Chameleon 在不同缩放因子下的攻击成功率高达84.5%，远超平均32.1%的静态攻击成功率。这些攻击会破坏代理管道，使多步任务的决策准确性降低超过45%。

You Only Train Once (YOTO): A Retraining-Free Object Detection Framework

Authors: Priyanto Hidayatullah, Nurjannah Syakrani, Yudi Widhiyasana, Muhammad Rizqi Sholahuddin, Refdinal Tubagus, Zahri Al Adzani Hidayat, Hanri Fajar Ramadhan, Dafa Alfarizki Pratama, Farhan Muhammad Yasin

First: 2025-12-04T15:15:43+00:00 · Latest: 2025-12-04T15:15:43+00:00

Comments: under review in the Elsevier Engineering Journal

Abs · PDF · Code1 · Code2

Abstract

Object detection constitutes the primary task within the domain of computer vision. It is utilized in numerous domains. Nonetheless, object detection continues to encounter the issue of catastrophic forgetting. The model must be retrained whenever new products are introduced, utilizing not only the new products dataset but also the entirety of the previous dataset. The outcome is obvious: increasing model training expenses and significant time consumption. In numerous sectors, particularly retail checkout, the frequent introduction of new products presents a great challenge. This study introduces You Only Train Once (YOTO), a methodology designed to address the issue of catastrophic forgetting by integrating YOLO11n for object localization with DeIT and Proxy Anchor Loss for feature extraction and metric learning. For classification, we utilize cosine similarity between the embedding features of the target product and those in the Qdrant vector database. In a case study conducted in a retail store with 140 products, the experimental results demonstrate that our proposed framework achieves encouraging accuracy, whether for detecting new or existing products. Furthermore, without retraining, the training duration difference is significant. We achieve almost 3 times the training time efficiency compared to classical object detection approaches. This efficiency escalates as additional new products are added to the product database. The average inference time is 580 ms per image containing multiple products, on an edge device, validating the proposed framework's feasibility for practical use.

中文标题/摘要

标题：你只需训练一次（YOTO）：一种无需重新训练的目标检测框架

目标检测是计算机视觉领域的主要任务，被广泛应用于多个领域。然而，目标检测仍然面临灾难性遗忘的问题。每当引入新产品时，模型必须重新训练，不仅需要新产品的数据集，还需要整个之前的数据集。结果显而易见：增加了模型训练成本和大量时间消耗。在许多领域，尤其是零售结账领域，频繁引入新产品是一个巨大的挑战。本研究引入了你只需训练一次（YOTO）的方法，通过将YOLO11n用于目标定位、DeIT和Proxy Anchor Loss用于特征提取和度量学习来解决灾难性遗忘问题。对于分类，我们使用目标产品嵌入特征与Qdrant向量数据库中特征的余弦相似度。在一家拥有140种产品的零售店进行的案例研究中，实验结果表明，我们提出的框架在检测新产品或现有产品时均取得了令人鼓舞的准确性。此外，无需重新训练，训练时间差异显著。我们实现了与经典目标检测方法相比几乎3倍的训练时间效率。随着产品数据库中新增产品的数量增加，这种效率会进一步提高。在边缘设备上，每张包含多个产品的图像平均推理时间为580毫秒，验证了所提框架在实际应用中的可行性。

Summary / 总结

The study addresses the issue of catastrophic forgetting in object detection by proposing You Only Train Once (YOTO), which integrates YOLO11n, DeIT, and Proxy Anchor Loss. The framework demonstrates high accuracy in detecting both new and existing products in a retail setting with 140 products, achieving nearly three times the training time efficiency compared to traditional methods without retraining. The average inference time is 580 ms per image on an edge device, making it feasible for practical use.

研究通过提出You Only Train Once (YOTO) 方法，整合了YOLO11n、DeIT 和 Proxy Anchor Loss，解决了对象检测中的灾难性遗忘问题。实验结果表明，该框架在包含140种产品的零售店中，对新旧产品都具有较高的检测精度，相比传统方法，训练时间效率提高了近三倍，边缘设备上的平均推理时间为每张包含多个产品的图像580毫秒，验证了该框架的实际可行性。

SDG-Track: A Heterogeneous Observer-Follower Framework for High-Resolution UAV Tracking on Embedded Platforms

Authors: Jiawen Wen, Yu Hu, Suixuan Qiu, Jinshan Huang, Xiaowen Chu

First: 2025-12-04T15:11:43+00:00 · Latest: 2025-12-04T15:11:43+00:00

Comments: https://github.com/Jeffry-wen/SDG-Track

Abs · PDF · Code1 · Code2 · Code3

Abstract

Real-time tracking of small unmanned aerial vehicles (UAVs) on edge devices faces a fundamental resolution-speed conflict. Downsampling high-resolution imagery to standard detector input sizes causes small target features to collapse below detectable thresholds. Yet processing native 1080p frames on resource-constrained platforms yields insufficient throughput for smooth gimbal control. We propose SDG-Track, a Sparse Detection-Guided Tracker that adopts an Observer-Follower architecture to reconcile this conflict. The Observer stream runs a high-capacity detector at low frequency on the GPU to provide accurate position anchors from 1920x1080 frames. The Follower stream performs high-frequency trajectory interpolation via ROI-constrained sparse optical flow on the CPU. To handle tracking failures from occlusion or model drift caused by spectrally similar distractors, we introduce Dual-Space Recovery, a training-free re-acquisition mechanism combining color histogram matching with geometric consistency constraints. Experiments on a ground-to-air tracking station demonstrate that SDG-Track achieves 35.1 FPS system throughput while retaining 97.2\% of the frame-by-frame detection precision. The system successfully tracks agile FPV drones under real-world operational conditions on an NVIDIA Jetson Orin Nano. Our paper code is publicly available at https://github.com/Jeffry-wen/SDG-Track

中文标题/摘要

标题：SDG-Track：嵌入式平台高分辨率无人机跟踪的异构观察者-跟随框架

边缘设备上实时跟踪小型无人机面临着分辨率与速度的基本冲突。将高分辨率图像下采样为标准检测输入大小会导致小目标特征低于可检测阈值。而在资源受限平台上处理原生1080p帧则无法提供足够的吞吐量以实现平滑的云台控制。我们提出SDG-Track，一种稀疏检测引导跟踪器，采用观察者-跟随架构来解决这一冲突。观察者流在GPU上以低频率运行高容量检测器，从1920x1080帧中提供准确的位置锚点。跟随者流在CPU上通过ROI约束的稀疏光流进行高频率轨迹插值。为处理由光谱相似干扰物引起的遮挡或模型漂移导致的跟踪失败，我们引入了双空间恢复机制，这是一种无需训练的重新获取机制，结合了颜色直方图匹配与几何一致性约束。在地面到空中跟踪站上的实验表明，SDG-Track实现了35.1 FPS系统吞吐量，同时保留了97.2%的逐帧检测精度。该系统在NVIDIA Jetson Orin Nano上成功跟踪了现实世界操作条件下的敏捷FPV无人机。我们的论文代码已公开发布在https://github.com/Jeffry-wen/SDG-Track

Summary / 总结

SDG-Track addresses the challenge of real-time tracking of small UAVs on edge devices by proposing an Observer-Follower architecture. The Observer stream runs a high-capacity detector on the GPU to provide accurate position anchors from high-resolution frames, while the Follower stream performs high-frequency trajectory interpolation on the CPU. To handle tracking failures, Dual-Space Recovery combines color histogram matching with geometric consistency constraints. Experiments show SDG-Track achieves 35.1 FPS throughput with 97.2% frame-by-frame detection precision, successfully tracking agile FPV drones under real-world conditions.

SDG-Track通过提出Observer-Follower架构解决了边缘设备上小无人机实时跟踪的挑战。Observer流在GPU上以低频率使用高容量检测器从1920x1080帧中提供准确的位置锚点，而Follower流在CPU上使用ROI约束的稀疏光流进行高频率轨迹插值。系统引入了Dual-Space Recovery来处理由遮挡或模型漂移引起的跟踪失败。实验表明，SDG-Track实现了35.1 FPS的吞吐量，同时保持了97.2%的帧间检测精度，并成功在真实世界条件下跟踪敏捷的FPV无人机。

Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens

Authors: Ziran Qin, Youru Lv, Mingbao Lin, Zeren Zhang, Chanfan Gan, Tieyuan Chen, Weiyao Lin

First: 2025-12-04T14:41:21+00:00 · Latest: 2025-12-04T14:41:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Autoregressive (AR) visual generation has emerged as a powerful paradigm for image and multimodal synthesis, owing to its scalability and generality. However, existing AR image generation suffers from severe memory bottlenecks due to the need to cache all previously generated visual tokens during decoding, leading to both high storage requirements and low throughput. In this paper, we introduce \textbf{LineAR}, a novel, training-free progressive key-value (KV) cache compression pipeline for autoregressive image generation. By fully exploiting the intrinsic characteristics of visual attention, LineAR manages the cache at the line level using a 2D view, preserving the visual dependency regions while progressively evicting less-informative tokens that are harmless for subsequent line generation, guided by inter-line attention. LineAR enables efficient autoregressive (AR) image generation by utilizing only a few lines of cache, achieving both memory savings and throughput speedup, while maintaining or even improving generation quality. Extensive experiments across six autoregressive image generation models, including class-conditional and text-to-image generation, validate its effectiveness and generality. LineAR improves ImageNet FID from 2.77 to 2.68 and COCO FID from 23.85 to 22.86 on LlamaGen-XL and Janus-Pro-1B, while retaining only 1/6 KV cache. It also improves DPG on Lumina-mGPT-768 with just 1/8 KV cache. Additionally, LineAR achieves significant memory and throughput gains, including up to 67.61% memory reduction and 7.57x speedup on LlamaGen-XL, and 39.66% memory reduction and 5.62x speedup on Janus-Pro-7B.

中文标题/摘要

标题：自回归图像生成仅需几行缓存令牌

自回归（AR）视觉生成已成为图像和多模态合成的强大范式，得益于其可扩展性和通用性。然而，现有的AR图像生成由于解码过程中需要缓存所有之前生成的视觉令牌而遭受严重的内存瓶颈，导致高存储需求和低吞吐量。本文介绍了一种名为LineAR的新型、无需训练的渐进式键值（KV）缓存压缩管道，用于自回归图像生成。通过充分利用视觉注意力的内在特性，LineAR在二维视图中按行级管理缓存，保留视觉依赖区域的同时，逐步淘汰对后续行生成无害的、信息量较少的令牌，由行间注意力引导。LineAR通过仅使用几行缓存实现高效的自回归（AR）图像生成，同时实现内存节省和吞吐量提升，同时保持或甚至提高生成质量。在六个自回归图像生成模型中，包括类别条件和文本到图像生成的广泛实验验证了其有效性和通用性。LineAR在LlamaGen-XL和Janus-Pro-1B上将ImageNet FID从2.77提高到2.68，COCO FID从23.85提高到22.86，同时仅保留1/6的KV缓存。它还在Lumina-mGPT-768上仅使用1/8的KV缓存提高了DPG。此外，LineAR实现了显著的内存和吞吐量增益，包括LlamaGen-XL上的67.61%内存减少和7.57倍速度提升，Janus-Pro-7B上的39.66%内存减少和5.62倍速度提升。

Summary / 总结

This paper addresses the memory bottleneck issue in autoregressive (AR) image generation by introducing LineAR, a training-free method that compresses the key-value cache. LineAR uses a 2D view to manage cache at the line level, preserving visual dependencies while evicting less-informative tokens. This approach enables efficient AR image generation with reduced memory usage and improved throughput, while maintaining or enhancing generation quality. Experiments across various AR models show significant improvements in FID scores and memory/throughput gains, with LineAR retaining only a fraction of the traditional cache size.

本文提出了一种名为LineAR的方法，通过一种无需训练的渐进式键值缓存压缩管道来解决自回归（AR）图像生成中的内存瓶颈问题。通过在线性级别管理和利用跨行注意机制，LineAR减少了需要缓存所有先前生成的视觉令牌的需求，从而节省内存并提高吞吐量。实验表明，LineAR在ImageNet和COCO数据集上的FID分数有所提升，并且在各种AR模型上实现了显著的内存和速度提升。

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

Authors: GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, Peng Li, Qiuping Deng, Runqi Ouyang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yilong Li, Yiran Ding, Yuan Xu, Yun Ye, Yukun Zhou, Zhehao Dong, Zhenan Wang, Zhichao Liu, Zheng Zhu

First: 2025-10-22T09:57:13+00:00 · Latest: 2025-12-04T14:28:04+00:00

Comments: https://gigabrain0.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.

中文标题/摘要

标题：GigaBrain-0：一种基于世界模型的视觉-语言-行动模型

训练通用机器人视觉-语言-行动（VLA）模型通常需要大量的真实世界机器人数据，这在收集上既昂贵又耗时。物理数据收集的低效严重限制了当前VLA系统的可扩展性和泛化能力。为了解决这一挑战，我们引入了GigaBrain-0，这是一种由世界模型生成数据（例如视频生成、真实到真实转移、人类转移、视角转移、模拟到真实转移数据）赋能的新型VLA基础模型。通过利用世界模型生成大规模的多样化数据，GigaBrain-0显著减少了对真实机器人数据的依赖，同时提高了跨任务的泛化能力。我们的方法进一步通过RGBD输入建模和具身思维链（CoT）监督，提高了策略的鲁棒性，使模型在执行任务时能够推理空间几何、物体状态和长时依赖关系，从而在灵巧、长时依赖和移动操作任务上取得了显著的现实世界性能提升。大量实验表明，GigaBrain-0在外观变化（例如纹理、颜色）、物体摆放和摄像机视角等方面实现了卓越的泛化能力。此外，我们还介绍了GigaBrain-0-Small，这是一种优化的轻量级变体，旨在高效运行在NVIDIA Jetson AGX Orin等设备上。

Summary / 总结

GigaBrain-0 is a Vision-Language-Action (VLA) foundation model that uses world model-generated data to reduce reliance on expensive real-world robot data, improving cross-task generalization and policy robustness. It enhances real-world performance on dexterous, long-horizon, and mobile manipulation tasks through RGBD input modeling and embodied Chain-of-Thought supervision, achieving superior generalization across various task variations and camera viewpoints.

GigaBrain-0 是一种 Vision-Language-Action 基础模型，通过使用世界模型生成的数据来减少对昂贵的现实世界机器人数据的依赖，提高跨任务泛化能力和策略鲁棒性。它结合了 RGBD 输入建模和具身思维链监督，导致在灵巧、长时序和移动操作任务上的更好表现。实验表明，它在各种任务变化和相机视角下具有更好的泛化能力。

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Authors: Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li

First: 2025-05-21T12:18:15+00:00 · Latest: 2025-12-04T14:24:47+00:00

Comments: https://github.com/xtong-zhang/Chain-of-Focus

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks. However, the multimodal reasoning capability has not been fully explored in existing models. In this paper, we propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions based on obtained visual cues and the given questions, achieving efficient multimodal reasoning. To enable this CoF capability, we present a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct the MM-CoF dataset, comprising 3K samples derived from a visual agent designed to adaptively identify key regions to solve visual tasks with different image resolutions and questions. We use MM-CoF to fine-tune the Qwen2.5-VL model for cold start. In the RL stage, we leverage the outcome accuracies and formats as rewards to update the Qwen2.5-VL model, enabling further refining the search and reasoning strategy of models without human priors. Our model achieves significant improvements on multiple benchmarks. On the V* benchmark that requires strong visual reasoning capability, our model outperforms existing VLMs by 5% among 8 image resolutions ranging from 224 to 4K, demonstrating the effectiveness of the proposed CoF method and facilitating the more efficient deployment of VLMs in practical applications.

中文标题/摘要

标题：基于动态视觉搜索与缩放的自适应焦点链推理方法以提高高效VLMs

视觉语言模型（VLMs）在各种计算机视觉任务中取得了令人印象深刻的性能。然而，现有的模型尚未充分探索其多模态推理能力。本文提出了一种焦点链（CoF）方法，使VLMs能够根据获得的视觉线索和给定的问题，自适应地聚焦并放大关键图像区域，实现高效的多模态推理。为了实现这种CoF能力，我们提出了一种两阶段训练管道，包括监督微调（SFT）和强化学习（RL）。在SFT阶段，我们构建了MM-CoF数据集，包含3000个样本，这些样本来自一个视觉代理，该代理能够自适应地识别关键区域以解决不同图像分辨率和问题的视觉任务。我们使用MM-CoF对Qwen2.5-VL模型进行冷启动微调。在RL阶段，我们利用结果准确性和格式作为奖励来更新Qwen2.5-VL模型，从而进一步优化模型的搜索和推理策略，无需人类先验知识。我们的模型在多个基准测试中取得了显著的改进。在V*基准测试中，该基准测试要求强大的视觉推理能力，我们的模型在从224到4K的8种图像分辨率中优于现有VLMs 5%，证明了所提出的CoF方法的有效性，并促进了VLMs在实际应用中的更高效部署。

Summary / 总结

This paper proposes a Chain-of-Focus (CoF) method for VLMs to perform adaptive focusing and zooming on key image regions based on visual cues and questions, enhancing multimodal reasoning. The method involves a two-stage training pipeline: supervised fine-tuning (SFT) using the MM-CoF dataset and reinforcement learning (RL) to refine search and reasoning strategies. The model shows significant improvements on multiple benchmarks, particularly on the V* benchmark, where it outperforms existing VLMs by 5% across various image resolutions, demonstrating the effectiveness of the CoF method.

本文提出了一种Chain-of-Focus (CoF) 方法，使VLMs能够根据视觉线索和问题进行自适应聚焦和放大关键图像区域。该方法采用两阶段训练流程：监督微调（SFT）和强化学习（RL）。SFT阶段涉及构建MM-CoF数据集并微调Qwen2.5-VL模型，而RL阶段则进一步优化模型的搜索和推理策略。该模型在多个基准测试中表现出显著改进，特别是在V*基准测试中，其在从224到4K的8种图像分辨率下比现有VLMs高出5%，突显了CoF方法的有效性。

FreeGen: Feed-Forward Reconstruction-Generation Co-Training for Free-Viewpoint Driving Scene Synthesis

Authors: Shijie Chen, Peixi Peng

First: 2025-12-04T14:14:21+00:00 · Latest: 2025-12-04T14:14:21+00:00

Comments: Novel View Synthesis, Driving Scene, Free Trajectory, Image Generation

Abs · PDF · Code1 · Code2

Abstract

Closed-loop simulation and scalable pre-training for autonomous driving require synthesizing free-viewpoint driving scenes. However, existing datasets and generative pipelines rarely provide consistent off-trajectory observations, limiting large-scale evaluation and training. While recent generative models demonstrate strong visual realism, they struggle to jointly achieve interpolation consistency and extrapolation realism without per-scene optimization. To address this, we propose FreeGen, a feed-forward reconstruction-generation co-training framework for free-viewpoint driving scene synthesis. The reconstruction model provides stable geometric representations to ensure interpolation consistency, while the generation model performs geometry-aware enhancement to improve realism at unseen viewpoints. Through co-training, generative priors are distilled into the reconstruction model to improve off-trajectory rendering, and the refined geometry in turn offers stronger structural guidance for generation. Experiments demonstrate that FreeGen achieves state-of-the-art performance for free-viewpoint driving scene synthesis.

中文标题/摘要

标题：FreeGen：前馈重建-生成协同训练在自由视角驾驶场景合成中的应用

闭环模拟和可扩展预训练需要合成自由视角驾驶场景。然而，现有数据集和生成管道很少提供一致的离轨迹观测，限制了大规模评估和训练。尽管最近的生成模型展示了强大的视觉真实性，但在无需场景优化的情况下同时实现插值一致性和外推真实性方面仍存在困难。为了解决这个问题，我们提出了一种前馈重建-生成协同训练框架FreeGen，用于自由视角驾驶场景合成。重建模型提供稳定的几何表示以确保插值一致性，而生成模型则进行几何感知增强以提高在未见视角下的真实性。通过协同训练，生成先验知识被提炼到重建模型中以改善离轨迹渲染，而细化的几何结构反过来为生成提供了更强的结构指导。实验表明，FreeGen 在自由视角驾驶场景合成中达到了最先进的性能。

Summary / 总结

The research aims to address the limitations of existing datasets and generative pipelines in providing consistent off-trajectory observations for autonomous driving simulation and training. FreeGen, a feed-forward reconstruction-generation co-training framework, is proposed to ensure interpolation consistency and improve realism at unseen viewpoints. Experiments show that FreeGen outperforms existing methods in free-viewpoint driving scene synthesis.

研究旨在通过合成自由视角驾驶场景来支持自主驾驶的仿真和训练。提出的FreeGen框架采用前向重建-生成联合训练方法，其中重建模型确保插值一致性，生成模型提高在未见视角下的逼真度。实验表明，FreeGen在自由视角驾驶场景合成中表现出色，达到最先进的性能。

Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships

Authors: Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen

Venue: WACV 2026

First: 2024-05-29T05:20:02+00:00 · Latest: 2025-12-04T13:44:07+00:00

Comments: WACV 2026 Accepted. Code available at https://github.com/CyberAgentAI/multimodal-adversarial-training

Abs · PDF · Code1 · Code2 · Code3

Abstract

Pre-trained vision-language (VL) models are highly vulnerable to adversarial attacks. However, existing defense methods primarily focus on image classification, overlooking two key aspects of VL tasks: multimodal attacks, where both image and text can be perturbed, and the one-to-many relationship of images and texts, where a single image can correspond to multiple textual descriptions and vice versa (1:N and N:1). This work is the first to explore defense strategies against multimodal attacks in VL tasks, whereas prior VL defense methods focus on vision robustness. We propose multimodal adversarial training (MAT), which incorporates adversarial perturbations in both image and text modalities during training, significantly outperforming existing unimodal defenses. Furthermore, we discover that MAT is limited by deterministic one-to-one (1:1) image-text pairs in VL training data. To address this, we conduct a comprehensive study on leveraging one-to-many relationships to enhance robustness, investigating diverse augmentation techniques. Our analysis shows that, for a more effective defense, augmented image-text pairs should be well-aligned, diverse, yet avoid distribution shift -- conditions overlooked by prior research. This work pioneers defense strategies against multimodal attacks, providing insights for building robust VLMs from both optimization and data perspectives. Our code is publicly available at https://github.com/CyberAgentAI/multimodal-adversarial-training.

中文标题/摘要

标题：利用一对多关系的多模态对抗防御方法

预训练的视觉-语言（VL）模型对对抗攻击极为敏感。然而，现有的防御方法主要集中在图像分类上，忽视了VL任务中的两个关键方面：多模态攻击，其中图像和文本都可以被扰动，以及一对多关系，即一个图像可以对应多个文本描述，反之亦然（1:N和N:1）。本工作是首次探索VL任务中对抗多模态攻击的防御策略，而之前的VL防御方法主要关注视觉鲁棒性。我们提出了多模态对抗训练（MAT），在训练过程中同时在图像和文本模态中引入对抗扰动，显著优于现有的单模态防御方法。此外，我们发现MAT受限于VL训练数据中确定的一对一（1:1）图像-文本对。为了解决这一问题，我们对利用一对多关系增强鲁棒性进行了全面研究，探讨了多种增强技术。我们的分析表明，为了更有效的防御，增强的图像-文本对应该对齐良好、多样化，但要避免分布偏移——这是之前研究中忽视的条件。本工作开创了对抗多模态攻击的防御策略，从优化和数据两个角度提供了构建鲁棒VL模型的见解。我们的代码已公开发布在https://github.com/CyberAgentAI/multimodal-adversarial-training。

Summary / 总结

This work addresses the vulnerability of pre-trained vision-language models to adversarial attacks by proposing multimodal adversarial training (MAT), which incorporates adversarial perturbations in both image and text modalities. The study highlights the importance of leveraging one-to-many relationships in vision-language training data to enhance robustness. Experimental results show that MAT significantly outperforms existing unimodal defenses, but is limited by deterministic one-to-one image-text pairs. The research provides insights for building more robust vision-language models from both optimization and data perspectives.

该研究通过提出多模态对抗训练（MAT）方法，将对抗扰动同时应用于图像和文本模态，显著优于现有的单模态防御方法。研究还强调了在图像-文本对中利用一对一到多对一关系的重要性，以增强鲁棒性，建议增强的对应该具备对齐良好、多样化且避免分布偏移的特点。该研究从优化和数据两个角度为构建鲁棒的视觉-语言模型提供了新的见解。

ASTRIDE: A Security Threat Modeling Platform for Agentic-AI Applications

Authors: Eranga Bandara, Amin Hass, Ross Gore, Sachin Shetty, Ravi Mukkamala, Safdar H. Bouk, Xueping Liang, Ng Wee Keong, Kasun De Zoysa, Aruna Withanage, Nilaan Loganathan

First: 2025-12-04T13:32:40+00:00 · Latest: 2025-12-04T13:32:40+00:00

Abs · PDF · Code1 · Code2

Abstract

AI agent-based systems are becoming increasingly integral to modern software architectures, enabling autonomous decision-making, dynamic task execution, and multimodal interactions through large language models (LLMs). However, these systems introduce novel and evolving security challenges, including prompt injection attacks, context poisoning, model manipulation, and opaque agent-to-agent communication, that are not effectively captured by traditional threat modeling frameworks. In this paper, we introduce ASTRIDE, an automated threat modeling platform purpose-built for AI agent-based systems. ASTRIDE extends the classical STRIDE framework by introducing a new threat category, A for AI Agent-Specific Attacks, which encompasses emerging vulnerabilities such as prompt injection, unsafe tool invocation, and reasoning subversion, unique to agent-based applications. To automate threat modeling, ASTRIDE combines a consortium of fine-tuned vision-language models (VLMs) with the OpenAI-gpt-oss reasoning LLM to perform end-to-end analysis directly from visual agent architecture diagrams, such as data flow diagrams(DFDs). LLM agents orchestrate the end-to-end threat modeling automation process by coordinating interactions between the VLM consortium and the reasoning LLM. Our evaluations demonstrate that ASTRIDE provides accurate, scalable, and explainable threat modeling for next-generation intelligent systems. To the best of our knowledge, ASTRIDE is the first framework to both extend STRIDE with AI-specific threats and integrate fine-tuned VLMs with a reasoning LLM to fully automate diagram-driven threat modeling in AI agent-based applications.

中文标题/摘要

标题：ASTRIDE：面向代理AI应用的安全威胁建模平台

基于AI代理的系统正日益成为现代软件架构中的重要组成部分，通过大型语言模型（LLMs）实现自主决策、动态任务执行和多模态交互。然而，这些系统引入了新型且不断演变的安全挑战，包括提示注入攻击、上下文污染、模型操控和代理间不透明的通信，这些挑战未能被传统的威胁建模框架有效捕捉。在本文中，我们介绍了ASTRIDE，一个专为基于AI代理的系统设计的自动化威胁建模平台。ASTRIDE通过引入一个新的威胁类别A（针对AI代理的特定攻击），扩展了经典的STRIDE框架，该类别涵盖了诸如提示注入、不安全工具调用和推理颠覆等新兴漏洞，这些漏洞是代理应用特有的。为了自动化威胁建模，ASTRIDE结合了一个由微调的视觉-语言模型（VLMs）组成的联盟和OpenAI-gpt-oss推理LLM，直接从视觉代理架构图（如数据流图DFDs）进行端到端分析。LLM代理协调整个威胁建模自动化过程，协调VLM联盟与推理LLM之间的交互。我们的评估表明，ASTRIDE能够为下一代智能系统提供准确、可扩展和可解释的威胁建模。据我们所知，ASTRIDE是第一个扩展STRIDE以包含AI特定威胁并结合微调的VLMs与推理LLM以完全自动化基于代理的AI应用中的图驱动威胁建模的框架。

Summary / 总结

ASTRIDE is an automated threat modeling platform designed for AI agent-based systems, extending the classical STRIDE framework to include AI-specific threats. It uses a consortium of fine-tuned vision-language models and the OpenAI-gpt-oss reasoning LLM to analyze visual agent architecture diagrams, automating the threat modeling process. Evaluations show that ASTRIDE offers accurate, scalable, and explainable threat modeling for next-generation intelligent systems, making it the first framework to integrate fine-tuned VLMs with a reasoning LLM for fully automated diagram-driven threat modeling in AI agent-based applications.

ASTRIDE 是一个自动化威胁建模平台，针对 AI 剂量系统中的新型安全挑战，如提示注入和上下文污染。它扩展了 STRIDE 框架，增加了 A 类别以涵盖 AI 特定威胁，并使用一组精细调整的视觉语言模型和 OpenAI 推理 LLM 来从视觉剂剂量架构图中自动化威胁建模。评估表明，ASTRIDE 提供了准确、可扩展和可解释的威胁建模，是第一个将精细调整的 VLM 与推理 LLM 结合起来进行自动化图驱动威胁建模的框架，适用于 AI 剂量应用。

TTRV: Test-Time Reinforcement Learning for Vision Language Models

Authors: Akshit Singh, Shyam Marjit, Wei Lin, Paul Gavrikov, Serena Yeung-Levy, Hilde Kuehne, Rogerio Feris, Sivan Doveh, James Glass, M. Jehanzeb Mirza

First: 2025-10-08T09:10:31+00:00 · Latest: 2025-12-04T13:17:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of the model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets. Remarkably, on image recognition, TTRV applied to InternVL 8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to 5.5% in recognition tasks.

中文标题/摘要

标题：TTRV：视觉语言模型的测试时强化学习

现有的强化学习中提取奖励信号的方法通常依赖于标记数据和专门的训练分割，这与人类直接从环境中学习的方式不同。在本工作中，我们提出了TTRV，通过在推理时使模型实时适应，从而增强视觉语言理解，无需任何标记数据。具体而言，我们通过基于基模型输出频率设计奖励，结合多次对每个测试样本进行推理，改进了Group Relative Policy Optimization (GRPO)框架。此外，我们还提出通过同时奖励模型以获得输出经验分布的低熵来控制模型输出的多样性。我们的方法在对象识别和视觉问答（VQA）中均表现出一致的改进，分别提高了52.4%和29.8%，并在16个数据集中平均提高了24.6%和10.0%。值得注意的是，在图像识别方面，TTRV应用于InternVL 8B在8个基准测试中平均优于GPT-4o 2.3%，同时在VQA方面保持高度竞争力，表明测试时的强化学习可以匹配或超越最强的专有模型。最后，我们发现测试时的RL对于VLMs有许多有趣的特性：例如，在极端数据受限的场景中，即使在单个随机选择的未标记测试样本上进行适应，TTRV仍能带来高达5.5%的识别任务改进。

Summary / 总结

TTRV enhances vision language models by adapting the model at inference time without labeled data, using frequency-based rewards and low entropy to control output diversity. It achieves consistent improvements in object recognition and visual question answering, with gains up to 52.4% and 29.8% respectively, and average boosts of 24.6% and 10.0% across 16 datasets. TTRV also outperforms GPT-4o on image recognition by 2.3% on average across 8 benchmarks while maintaining strong performance on VQA.

TTRV提出了一种在推理时使用强化学习来增强视觉语言模型的方法，无需使用标注数据即可对模型进行适应。它利用基模型输出的频率来设计奖励，并对每个测试样本进行多次推理以控制输出多样性。TTRV在物体识别和视觉问答方面实现了持续改进，分别获得高达52.4%和29.8%的增益，以及16个数据集上的平均提升24.6%和10.0%。在图像识别方面，TTRV在8个基准测试中平均比GPT-4o高出2.3%，同时在视觉问答方面保持竞争力。

MemLoRA: Distilling Expert Adapters for On-Device Memory Systems

Authors: Massimo Bini, Ondrej Bohdal, Umberto Michieli, Zeynep Akata, Mete Ozay, Taha Ceritli

First: 2025-12-04T12:56:30+00:00 · Latest: 2025-12-04T12:56:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Memory-augmented Large Language Models (LLMs) have demonstrated remarkable consistency during prolonged dialogues by storing relevant memories and incorporating them as context. Such memory-based personalization is also key in on-device settings that allow users to keep their conversations and data private. However, memory-augmented systems typically rely on LLMs that are too costly for local on-device deployment. Even though Small Language Models (SLMs) are more suitable for on-device inference than LLMs, they cannot achieve sufficient performance. Additionally, these LLM-based systems lack native visual capabilities, limiting their applicability in multimodal contexts. In this paper, we introduce (i) MemLoRA, a novel memory system that enables local deployment by equipping SLMs with specialized memory adapters, and (ii) its vision extension MemLoRA-V, which integrates small Vision-Language Models (SVLMs) to memory systems, enabling native visual understanding. Following knowledge distillation principles, each adapter is trained separately for specific memory operations$\unicode{x2013}$knowledge extraction, memory update, and memory-augmented generation. Equipped with memory adapters, small models enable accurate on-device memory operations without cloud dependency. On text-only operations, MemLoRA outperforms 10$\times$ larger baseline models (e.g., Gemma2-27B) and achieves performance comparable to 60$\times$ larger models (e.g., GPT-OSS-120B) on the LoCoMo benchmark. To evaluate visual understanding operations instead, we extend LoCoMo with challenging Visual Question Answering tasks that require direct visual reasoning. On this, our VLM-integrated MemLoRA-V shows massive improvements over caption-based approaches (81.3 vs. 23.7 accuracy) while keeping strong performance in text-based tasks, demonstrating the efficacy of our method in multimodal contexts.

中文标题/摘要

标题：MemLoRA：为本地内存系统提供专家适配器的精简

增强内存的大语言模型（LLMs）在长时间对话中表现出显著的一致性，通过存储相关记忆并在上下文中加以利用。这种基于记忆的个性化在允许用户保持对话和数据隐私的本地设备设置中也至关重要。然而，增强内存的系统通常依赖于成本过高的LLMs，不适合本地设备部署。尽管小型语言模型（SLMs）比LLMs更适合本地推理，但它们无法达到足够的性能。此外，这些基于LLM的系统缺乏原生的视觉能力，限制了它们在多模态环境中的应用。在本文中，我们介绍了（i）MemLoRA，一种通过为SLMs配备专门的记忆适配器来实现本地部署的新颖内存系统，以及（ii）其视觉扩展MemLoRA-V，将小型视觉-语言模型（SVLMs）集成到内存系统中，使系统具备原生的视觉理解能力。遵循知识蒸馏原则，每个适配器分别针对特定的记忆操作进行训练——知识提取、记忆更新和增强记忆的生成。配备记忆适配器的小型模型能够在没有云依赖的情况下实现准确的本地内存操作。在仅文本操作上，MemLoRA在LoCoMo基准测试中优于10倍更大的基线模型（例如，Gemma2-27B），并在性能上与60倍更大的模型（例如，GPT-OSS-120B）相当。为了评估视觉理解操作，我们扩展了LoCoMo，加入了具有直接视觉推理要求的挑战性视觉问答任务。在这些任务上，我们的VLM集成的MemLoRA-V在准确率上大幅优于基于字幕的方法（81.3 vs. 23.7），同时在基于文本的任务上保持了强大的性能，证明了我们方法在多模态环境中的有效性。

Summary / 总结

The research aims to enable local deployment of memory-augmented systems by equipping Small Language Models (SLMs) with specialized memory adapters, leading to MemLoRA. This method involves training separate adapters for memory extraction, update, and augmented generation. Experimental results show that MemLoRA outperforms larger models on text-only operations and significantly improves visual understanding tasks compared to caption-based approaches, while maintaining strong performance in text-based tasks. MemLoRA-V, an extension that integrates Vision-Language Models, further enhances visual reasoning capabilities in multimodal contexts.

研究旨在通过为小型语言模型（SLMs）配备专门的记忆适配器来实现本地部署，从而提出了MemLoRA。该方法包括分别训练用于记忆提取、更新和增强生成的适配器。实验结果显示，MemLoRA 在文本操作中优于更大规模的模型，并且在视觉理解任务中显著优于基于描述的方法，同时在文本任务中保持了强大的性能。MemLoRA-V 的扩展则进一步增强了在多模态环境中的视觉推理能力。

Jina-VLM: Small Multilingual Vision Language Model

Authors: Andreas Koukounas, Georgios Mastrapas, Florian Hönicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao

First: 2025-12-03T18:13:41+00:00 · Latest: 2025-12-04T12:45:29+00:00

Comments: 18 pages, 1-7 main content, 13-18 appendix for tables and dataset

Abs · PDF · Code1 · Code2 · Code3

Abstract

We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm .

中文标题/摘要

标题：Jina-VLM：小型多语言视觉语言模型

我们提出了Jina-VLM，这是一种参数量为24亿的视觉-语言模型，在开放的2B规模VLM中实现了最先进的多语言视觉问答效果。该模型通过注意力池化连接器将SigLIP2视觉编码器与Qwen3语言骨干网络耦合，从而能够高效处理任意分辨率的图像。该模型在标准VQA基准测试和多语言评估中取得了领先结果，同时保持了竞争力的纯文本性能。模型权重和代码已公开发布在https://huggingface.co/jinaai/jina-vlm 。

Summary / 总结

Jina-VLM is a 2.4 billion parameter vision-language model designed for multilingual visual question answering, achieving state-of-the-art results. It integrates a SigLIP2 vision encoder with a Qwen3 language model via an attention-pooling connector, allowing efficient processing of images. The model excels on standard VQA benchmarks and multilingual evaluations while maintaining strong text-only performance. The model weights and code are publicly available.

Jina-VLM 是一个 24 亿参数的视觉语言模型，在开放的 2B 级别模型中，其在多语言视觉问答任务上表现出色。该模型通过注意力池化连接器将 SigLIP2 视觉编码器与 Qwen3 语言骨干网络结合，实现高效的图像处理。模型在标准 VQA 基准测试和多语言评估中表现出色，同时保持了竞争力的纯文本性能。模型权重和代码已公开发布。

E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

Authors: Yihong Tang, Haicheng Liao, Tong Nie, Junlin He, Ao Qu, Kehua Chen, Wei Ma, Zhenning Li, Lijun Sun, Chengzhong Xu

First: 2025-12-04T12:17:25+00:00 · Latest: 2025-12-04T12:17:25+00:00

Abs · PDF · Code1 · Code2

Abstract

End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and human-centric feedback.

中文标题/摘要

标题：E3AD：一种面向人类的端到端情绪感知视觉-语言-行动模型

端到端自动驾驶（AD）系统越来越多地采用视觉-语言-行动（VLA）模型，但通常会忽略乘客的情绪状态，这在舒适性和AD接受度方面至关重要。我们提出了开放域端到端（OD-E2E）自动驾驶，其中自动驾驶车辆（AV）必须解释自由形式的自然语言命令，推断情绪，并规划一个物理上可行的轨迹。我们提出了E3AD，这是一种情绪感知的VLA框架，通过两个认知启发式的组件增强了语义理解：一个连续的正价-唤醒-支配（VAD）情绪模型，用于捕捉语言中的语气和紧迫感，以及一个双路径空间推理模块，将第一人称和第三人称视角融合起来，实现类似人类的空间认知。一种以一致性为导向的训练方案，结合模态预训练与偏好对齐，进一步确保了情绪意图与驾驶行为之间的连贯性。在现实世界的数据集上，E3AD 提高了视觉定位和航点规划，并实现了最先进的（SOTA）VAD相关性，用于情绪估计。这些结果表明，将情绪注入VLA风格的驾驶中，可以实现更符合人类的定位、规划和以人为本的反馈。

Summary / 总结

The research aims to enhance end-to-end autonomous driving systems by incorporating the passenger's emotional state, which is crucial for comfort and acceptance. E3AD, an emotion-aware vision-language-action framework, uses a continuous VAD emotion model and a dual-pathway spatial reasoning module to interpret natural-language commands and plan trajectories. The model demonstrates improved visual grounding and waypoint planning, and achieves state-of-the-art VAD correlation for emotion estimation, indicating better alignment with human behavior and preferences.

研究旨在通过融入乘客的情绪状态来提升端到端的自动驾驶系统，这对于舒适性和接受度至关重要。E3AD 是一个情绪感知的视觉-语言-行动框架，使用连续的 VAD 情绪模型和双路径空间推理模块来解释自然语言命令并规划轨迹。该模型展示了改进的视觉定位和航点规划，并实现了情绪估计的最新 VAD 相关性，表明与人类行为和偏好有更好的对齐。

Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild

Authors: Yigui Feng, Qinglin Wang, Haotian Mo, Yang Liu, Ke Liu, Gencheng Liu, Xinhai Chen, Siqi Shen, Songzhu Mei, Jie Liu

First: 2025-12-04T12:13:18+00:00 · Latest: 2025-12-04T12:13:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Generative psychological analysis of in-the-wild conversations faces two fundamental challenges: (1) existing Vision-Language Models (VLMs) fail to resolve Articulatory-Affective Ambiguity, where visual patterns of speech mimic emotional expressions; and (2) progress is stifled by a lack of verifiable evaluation metrics capable of assessing visual grounding and reasoning depth. We propose a complete ecosystem to address these twin challenges. First, we introduce Multilevel Insight Network for Disentanglement(MIND), a novel hierarchical visual encoder that introduces a Status Judgment module to algorithmically suppress ambiguous lip features based on their temporal feature variance, achieving explicit visual disentanglement. Second, we construct ConvoInsight-DB, a new large-scale dataset with expert annotations for micro-expressions and deep psychological inference. Third, Third, we designed the Mental Reasoning Insight Rating Metric (PRISM), an automated dimensional framework that uses expert-guided LLM to measure the multidimensional performance of large mental vision models. On our PRISM benchmark, MIND significantly outperforms all baselines, achieving a +86.95% gain in micro-expression detection over prior SOTA. Ablation studies confirm that our Status Judgment disentanglement module is the most critical component for this performance leap. Our code has been opened.

中文标题/摘要

标题：测量未言说之事：野生对话的心理分析解缠模型与基准

生成性心理分析面临两大根本挑战：(1) 现有的视觉-语言模型无法解决发音-情感歧义问题，即视觉上的言语模式模仿情感表达；(2) 缺乏可验证的评估指标阻碍了视觉定位和推理深度的评估。我们提出了一整套生态系统来应对这些挑战。首先，我们引入了多级洞察网络解缠(MIND)，这是一种新颖的分层视觉编码器，引入了状态判断模块，基于其时间特征方差算法性地抑制模糊的唇部特征，实现显式的视觉解缠。其次，我们构建了ConvoInsight-DB，这是一个新的大规模数据集，包含专家标注的微表情和深层次心理推断。第三，我们设计了心理推理洞察评级指标(PRISM)，这是一种自动化的多维度框架，使用专家指导的大规模语言模型来衡量大型心理视觉模型的多维度性能。在我们的PRISM基准上，MIND显著优于所有基线，微表情检测的性能提高了86.95%。消融研究证实，我们的状态判断解缠模块是实现这一性能飞跃的关键组件。我们的代码已开源。

Summary / 总结

The paper addresses the challenges of analyzing in-the-wild conversations by proposing MIND, a novel hierarchical visual encoder that disentangles articulatory-affective ambiguity. It also introduces ConvoInsight-DB, a new dataset for micro-expressions and deep psychological inference, and PRISM, an automated evaluation metric. MIND outperforms existing methods by 86.95% in micro-expression detection on the PRISM benchmark, with the Status Judgment module being the key component for this improvement.

本文通过提出MIND，一种分层视觉编码器来解决在野对话分析中的语 articulatory-affective 语义混淆问题，并提出PRISM，一种自动评估指标。MIND引入了状态判断模块来抑制模糊的唇部特征，而PRISM使用专家引导的LLM来测量多维度的心理推理性能。实验表明，MIND在微表情检测上的表现比之前的方法高出86.95%，状态判断模块是这一改进的关键组成部分。

EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models

Authors: Haiyang Yu, Mengyang Zhao, Jinghui Lu, Ke Niu, Yanjie Wang, Weijie Yin, Weitao Jia, Teng Fu, Yang Liu, Jun Liu, Hong Chen

First: 2025-03-06T03:19:56+00:00 · Latest: 2025-12-04T11:50:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Video subtitles play a crucial role in short videos and movies, as they not only help models better understand video content but also support applications such as video translation and content retrieval. Existing video subtitle extraction methods typically rely on multi-stage frameworks, where errors accumulate across stages and temporal dependencies are underutilized due to frame-wise processing. Moreover, although some Large Vision-Language Models (LVLMs) possess strong OCR capabilities, predicting accurate timestamps for subtitle texts remains challenging. To this end, we propose an End-to-end Video subtitle Extraction framework based on LVLMs, named EVE, which can output subtitles and their timestamps simultaneously. Specifically, we introduce a dual-branch Spatiotemporal Subtitle-Salient (S\textsuperscript{3}) Module that serves as an adapter for LVLMs, capable of representing subtitle-related content and considering inter-frame correlations using only a small number of tokens. Within this module, the Spatial Semantic Context Aggregate branch aggregates high-level global semantics to provide spatial visual contextual information, while the Temporal Subtitle Token Query branch explicitly queries subtitle-relevant tokens while considering temporal correlation across frames. The small number of tokens retained by the S\textsuperscript{3} module are fed to the language model, which then directly outputs the subtitle text along with its timestamps. Furthermore, we construct the first large-scale dataset dedicated to video subtitle extraction, ViSa, containing over 2.5M videos with timestamped and bilingual annotation, thereby providing the community with a well-organized training and evaluation benchmark.

中文标题/摘要

标题：EVE：基于视觉语言模型的端到端视频字幕提取

视频字幕在短视频和电影中起着至关重要的作用，它们不仅帮助模型更好地理解视频内容，还支持视频翻译和内容检索等应用。现有的视频字幕提取方法通常依赖多阶段框架，各阶段的错误会累积，且由于逐帧处理，时间依赖性被严重低估。此外，尽管一些大型视觉语言模型（LVLMs）具有强大的OCR能力，但预测字幕文本的准确时间戳仍然具有挑战性。为此，我们提出了一种基于LVLMs的端到端视频字幕提取框架EVE，该框架可以同时输出字幕及其时间戳。具体而言，我们引入了一种双分支时空字幕显著性（S³）模块，作为LVLMs的适配器，仅使用少量令牌即可表示与字幕相关的内容并考虑帧间相关性。在该模块中，空间语义上下文聚合分支聚合高层次的全局语义，提供空间视觉上下文信息，而时间字幕令牌查询分支则明确查询与字幕相关的令牌，并考虑帧间的时间相关性。S³模块保留的少量令牌被输入到语言模型中，该模型直接输出字幕文本及其时间戳。此外，我们构建了第一个专门用于视频字幕提取的大规模数据集ViSa，包含超过250万条带有时间戳和双语注释的视频，从而为社区提供了一个组织良好的训练和评估基准。

Summary / 总结

The paper proposes EVE, an end-to-end framework for video subtitle extraction using Large Vision-Language Models (LVLMs). It addresses the limitations of multi-stage frameworks by directly outputting subtitles and timestamps. The dual-branch Spatiotemporal Subtitle-Salient (S³) Module aggregates spatial and temporal information with a small number of tokens, improving subtitle prediction accuracy. The EVE framework is evaluated on a newly constructed dataset, ViSa, which contains over 2.5 million videos with timestamped and bilingual annotations, demonstrating improved performance in subtitle extraction and timestamp prediction.

论文提出了使用大型视觉-语言模型（LVLMs）的端到端框架EVE，用于视频字幕提取。该框架通过直接输出字幕和时间戳来解决多阶段框架的局限性。关键方法是引入了一个双分支时空字幕显著（S³）模块，该模块通过少量的令牌聚合空间和时间信息，提高字幕准确性。该框架在包含超过250万条视频的ViSa数据集上进行了评估，这些视频具有时间戳和双语注释，展示了在字幕提取和时间戳预测方面的改进性能。