arXiv 论文速递

DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

Authors: Dongzhi Jiang, Renrui Zhang, Haodong Li, Zhuofan Zong, Ziyu Guo, Jun He, Claire Guo, Junyan Ye, Rongyao Fang, Weijia Li, Rui Liu, Hongsheng Li

First: 2025-12-04T18:59:53+00:00 · Latest: 2025-12-04T18:59:53+00:00

Comments: Project Page: https://github.com/CaraJ7/DraCo

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.

中文标题/摘要

标题：DraCo：草图作为CoT以实现文本到图像预览和稀有概念生成

近期统一的多模态大型语言模型（MLLMs）展示了令人印象深刻的性能，通过链式推理（CoT）增强了文本到图像生成能力。然而，现有方法仍然有限，要么仅将模型视为独立生成器，要么依赖抽象的文本规划。为此，我们提出了一种名为Draft-as-CoT（DraCo）的新颖交替推理范式，该范式充分利用了CoT中的文本和视觉内容，以更好地进行规划和验证。我们的方法首先生成低分辨率的草图图像作为预览，提供更具体的视觉规划和指导。然后，我们利用模型的内在理解能力验证草图与输入提示之间潜在的语义不一致，并通过选择性修正进行超分辨率细化。这样，我们的方法解决了两个基本挑战：文本规划的粗粒度性质和生成稀有属性组合的难度。为了支持训练，我们收集了DraCo-240K，旨在增强一般修正、实例操作和布局重组的三种原子能力。借助DraCo-CFG，一种专门的交替推理无分类器引导（CFG）策略，DraCo在GenEval上取得了8%的巨大提升，在Imagine-Bench上提升了0.91，在GenEval++上提升了3%，显著优于直接生成和其他借助CoT增强的生成方法。

Summary / 总结

DraCo proposes a novel interleaved reasoning paradigm called Draft-as-CoT to enhance text-to-image generation by leveraging both textual and visual contents in the chain-of-thought process. It generates a low-resolution draft image as a preview, which provides concrete visual planning and guidance. The model then verifies potential semantic misalignments and refines the draft through selective corrections with super-resolution. DraCo significantly outperforms direct generation and other CoT-empowered methods on GenEval, Imagine-Bench, and GenEval++ by 8%, 0.91, and 3% respectively.

DraCo 提出了一种名为 Draft-as-CoT 的新颖交错推理范式，通过在链式思考过程中利用文本和视觉内容来增强文本到图像的生成。它首先生成一个低分辨率的草图图像以进行视觉规划和指导，然后通过选择性修正和超分辨率进行验证和细化。DraCo 在 GenEval (+8%)、Imagine-Bench (+0.91) 和 GenEval++ (+3%) 上显著优于直接生成和其他受 CoT 支持的方法，解决了粗粒度文本规划和稀有属性生成的挑战。

ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

Authors: Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, Conghui He, Dahua Lin, Jiaqi Wang

First: 2025-12-04T18:59:52+00:00 · Latest: 2025-12-04T18:59:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.

中文标题/摘要

标题：ARM-Thinker：通过自主工具使用和视觉推理强化多模态生成奖励模型

奖励模型对于使视觉-语言系统与人类偏好保持一致至关重要，但当前的方法存在幻觉、视觉定位弱以及无法使用工具进行验证的问题，限制了它们在复杂多模态推理任务中的可靠性。我们提出了ARM-Thinker，这是一种自主多模态奖励模型，能够自主调用外部工具（例如，图像裁剪、文档页面检索）来将判断建立在可验证的证据之上，替代静态、非交互式的奖励评分。这使模型能够验证细微的视觉细节，跨参考多页证据，并验证推理声明，而这些能力在现有的奖励模型中是不存在的。我们通过多阶段强化学习训练ARM-Thinker，联合优化工具调用决策和判断准确性。为了评估自主奖励建模，我们引入了ARMBench-VL，包含三个基准测试，分别评估细微的视觉定位（图像级工具）、多页文档理解（检索工具）和指令遵循（文本级验证）。ARM-Thinker 在奖励模型基准测试中平均提高了16.2%，在工具使用任务中提高了9.6%，并在多模态数学和逻辑推理基准测试中优于基线模型。我们的结果表明，自主能力显著提高了奖励模型的准确性和可解释性。

Summary / 总结

ARM-Thinker is an agentic multimodal reward model that uses external tools for verification, addressing issues of hallucination and weak visual grounding in current vision-language systems. It employs multi-stage reinforcement learning to optimize tool-calling decisions and judgment accuracy. ARM-Thinker shows significant improvements, achieving a 16.2% average increase on reward modeling benchmarks and outperforming baselines on multimodal reasoning tasks.

ARM-Thinker 是一种使用外部工具进行验证的自主多模态奖励模型，解决了现有模型中的幻觉和弱视觉定位问题。它通过多阶段强化学习来优化工具调用决策和判断准确性。ARM-Thinker 在奖励模型基准测试中平均提高了 16.2%，并在多模态数学和逻辑推理基准测试中优于基线模型。

TV2TV: A Unified Framework for Interleaved Language and Video Generation

Authors: Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, Yushi Hu, Shang-Wen Li, Sreya Dutta Roy, Jakob Verbeek, XuDong Wang, Marjan Ghazvininejad, Luke Zettlemoyer, Emily Dinan

First: 2025-12-04T18:59:09+00:00 · Latest: 2025-12-04T18:59:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.

中文标题/摘要

标题：TV2TV：一种统一的交错语言和视频生成框架

视频生成模型正在迅速发展，但仍可能在需要大量语义分支或反复进行下一步应该发生什么的高层推理的复杂视频输出上遇到困难。在本文中，我们介绍了一类新的全能视频-文本模型，这些模型结合了最近语言模型推理进展的想法，以应对这一挑战。具体来说，我们提出了TV2TV，这是一种统一的生成建模框架，将视频生成分解为交错的语言和视频生成过程。TV2TV 使用混合的变换器（MoT）架构联合学习语言建模（下一个标记预测）和视频流匹配（下一个帧预测）。在推理时，TV2TV 决定何时在生成文本和视频帧之间交替，使模型能够在“用词思考”后续内容之前“用像素行动”来生成帧。这种设计将决定下一步应该发生什么的责任大部分转移到了语言建模塔上，从而提高了生成视频的视觉质量和提示对齐。它还使用户能够在过程中任何时间通过文本干预来实现精细的可控性。在对视频游戏数据的受控实验中，TV2TV 在视觉质量和可控性方面都取得了显著改进。TV2TV 还扩展到自然视频，正如我们通过使用视觉-语言模型（VLMs）交替自然语言动作描述来增强体育视频所展示的那样。在该语料库上训练 TV2TV 产生了强大的视觉质量和提示对齐，展示了该模型能够推理和生成复杂的现实世界动作序列的能力。这些结果共同突显了 TV2TV 是朝着具有开放文本推理和控制的视频生成迈出的有希望的一步。

Summary / 总结

TV2TV is a unified generative modeling framework that addresses the challenge of generating complex videos by integrating language and video generation processes. It uses a Mixture-of-Transformers architecture to jointly learn language modeling and video flow matching. TV2TV demonstrates significant improvements in visual quality and controllability in video game data and scales to natural videos, showing strong prompt alignment and the ability to reason about complex real-world action sequences.

TV2TV 是一种统一的生成模型框架，将文本和视频生成交织在一起，以应对复杂视频输出的挑战。它使用混合的变换器架构来联合学习语言建模和视频流匹配。实验表明，TV2TV 在生成视频时提高了视觉质量和可控性，特别是在视频游戏数据中，并且通过文本干预可以对自然视频进行细粒度控制。

Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression

Authors: Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, Seungryong Kim

First: 2025-12-04T18:46:44+00:00 · Latest: 2025-12-04T18:46:44+00:00

Comments: Project Page: https://cvlab-kaist.github.io/DeepForcing/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.

中文标题/摘要

标题：深度强迫：无需训练的长视频生成方法

近期自回归视频扩散技术的进步使得实时帧流成为可能，但现有解决方案仍然存在时间重复、漂移和运动减速的问题。我们发现，直接将类似于StreamingLLM的注意力池应用于视频扩散会导致保真度下降和运动停滞。为克服这一问题，我们引入了深度强迫，这是一种无需训练的机制，无需任何微调即可解决这一问题。具体来说，1) 深度池化将滑动窗口的一半用于持久池化令牌，并重新对齐其时间RoPE相位以匹配当前时间线，从而在长时间展开过程中稳定全局上下文。2) 参与式压缩执行重要性感知的KV缓存剪枝，仅保留积极参与最近注意力的令牌，同时安全地丢弃冗余和退化的历史记录，从而在生成超出分布长度时最小化误差累积。这些组件结合在一起，使生成能力提高了超过12倍（例如，5秒训练到60秒以上的生成），同时保持了更好的成像质量、更好的美学质量、几乎保持整体一致性，并在动态程度上取得了显著进步，同时保持实时生成。我们的结果表明，无需训练的KV缓存管理可以与基于训练的方法相媲美或超越自回归流式长视频生成。

Summary / 总结

Deep Forcing is a training-free method for long video generation that addresses temporal repetition and motion deceleration issues in existing solutions. It introduces two mechanisms: Deep Sink, which stabilizes global context by re-aligning persistent sink tokens, and Participative Compression, which prunes the KV cache to preserve only active tokens. These mechanisms enable over 12x extrapolation with better imaging and aesthetic quality than previous methods, maintaining consistency and dynamic degree while supporting real-time generation.

Deep Forcing通过引入两种无需训练的机制——Deep Sink和Participative Compression来解决自回归视频扩散中的时间重复和运动减速问题。Deep Sink在长时间生成中稳定全局上下文，而Participative Compression通过保留仅参与近期注意力的令牌来修剪KV缓存，从而减少误差累积。这些机制使得生成能力提高了超过12倍，同时在成像质量和美学质量上优于现有方法，保持了一致性和动态程度，并支持实时生成。

Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN

Authors: Mohammad Taufeeque, Aaron David Tucker, Adam Gleave, Adrià Garriga-Alonso

Venue: NeurIPS 2025

First: 2025-06-11T19:36:17+00:00 · Latest: 2025-12-04T18:28:33+00:00

Comments: Presented at the Mechanistic Interpretability Workshop at NeurIPS 2025. 34 pages, 26 figures

Abs · PDF · Code1 · Code2

Abstract

We partially reverse-engineer a convolutional recurrent neural network (RNN) trained with model-free reinforcement learning to play the box-pushing game Sokoban. We find that the RNN stores future moves (plans) as activations in particular channels of the hidden state, which we call path channels. A high activation in a particular location means that, when a box is in that location, it will get pushed in the channel's assigned direction. We examine the convolutional kernels between path channels and find that they encode the change in position resulting from each possible action, thus representing part of a learned transition model. The RNN constructs plans by starting at the boxes and goals. These kernels extend activations in path channels forwards from boxes and backwards from the goal. Negative values are placed in channels at obstacles. This causes the extension kernels to propagate the negative value in reverse, thus pruning the last few steps and letting an alternative plan emerge; a form of backtracking. Our work shows that, a precise understanding of the plan representation allows us to directly understand the bidirectional planning-like algorithm learned by model-free training in more familiar terms.

中文标题/摘要

标题：路径通道和计划扩展核：Sokoban RNN规划的机理描述

我们部分逆向工程了一个通过无模型强化学习训练的卷积递归神经网络（RNN），使其能够玩推箱子游戏Sokoban。我们发现，RNN将未来的动作（计划）存储在隐藏状态的特定通道中，我们称之为路径通道。特定位置的高激活意味着当箱子位于该位置时，它将被推到该通道指定的方向。我们检查了路径通道之间的卷积核，发现它们编码了每种可能动作导致的位置变化，从而代表了学习到的部分转移模型。RNN通过从箱子和目标开始构建计划。这些核将路径通道中的激活向前扩展到箱子，向后扩展到目标。在障碍物处放置负值会使得扩展核反向传播负值，从而修剪最后几步，让另一种计划浮现；一种形式的回溯。我们的工作表明，对计划表示的精确理解使我们能够直接用更熟悉的术语理解模型自由训练中学习到的双向规划算法。

Summary / 总结

The study partially reverse-engineers a convolutional recurrent neural network (RNN) trained for Sokoban, revealing that the RNN stores future moves as activations in specific channels, termed path channels. These channels indicate the direction a box will be pushed when in a certain position. The convolutional kernels between path channels encode the change in position for each action, representing part of a learned transition model. The RNN constructs plans by extending activations from boxes and goals, with negative values at obstacles causing a backtracking mechanism to emerge, pruning the last few steps and allowing alternative plans to develop.

研究部分反向工程了一个用于解Sokoban的卷积循环神经网络（RNN），发现RNN将未来的移动存储在特定的通道中，称为路径通道。这些通道指示当箱子处于某个位置时将被推的方向。路径通道之间的卷积核编码每个动作的位置变化，代表了学习到的部分转移模型。RNN通过从箱子和目标扩展激活来构建计划，障碍物处的负值导致出现回溯机制，修剪最后几步并允许替代计划的产生。

4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer

Authors: Xianfeng Wu, Yajing Bai, Minghan Li, Xianzu Wu, Xueqi Zhao, Zhongyuan Lai, Wenyu Liu, Xinggang Wang

First: 2025-12-04T18:15:27+00:00 · Latest: 2025-12-04T18:15:27+00:00

Comments: Code: https://github.com/hustvl/4DLangVGGT, Webpage: https://hustvl.github.io/4DLangVGGT

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Constructing 4D language fields is crucial for embodied AI, augmented/virtual reality, and 4D scene understanding, as they provide enriched semantic representations of dynamic environments and enable open-vocabulary querying in complex scenarios. However, existing approaches to 4D semantic field construction primarily rely on scene-specific Gaussian splatting, which requires per-scene optimization, exhibits limited generalization, and is difficult to scale to real-world applications. To address these limitations, we propose 4DLangVGGT, the first Transformer-based feed-forward unified framework for 4D language grounding, that jointly integrates geometric perception and language alignment within a single architecture. 4DLangVGGT has two key components: the 4D Visual Geometry Transformer, StreamVGGT, which captures spatio-temporal geometric representations of dynamic scenes; and the Semantic Bridging Decoder (SBD), which projects geometry-aware features into a language-aligned semantic space, thereby enhancing semantic interpretability while preserving structural fidelity. Unlike prior methods that depend on costly per-scene optimization, 4DLangVGGT can be jointly trained across multiple dynamic scenes and directly applied during inference, achieving both deployment efficiency and strong generalization. This design significantly improves the practicality of large-scale deployment and establishes a new paradigm for open-vocabulary 4D scene understanding. Experiments on HyperNeRF and Neu3D datasets demonstrate that our approach not only generalizes effectively but also achieves state-of-the-art performance, achieving up to 2% gains under per-scene training and 1% improvements under multi-scene training. Our code released in https://github.com/hustvl/4DLangVGGT

中文标题/摘要

标题：4DLangVGGT：四维语言视觉几何接地变换器

构建四维语言场对于具身人工智能、增强/虚拟现实以及四维场景理解至关重要，因为它们提供了动态环境的丰富语义表示，并在复杂场景中支持开放词汇查询。然而，现有的四维语义场构建方法主要依赖于场景特定的高斯点积，这需要针对每个场景进行优化，表现出有限的泛化能力，并难以扩展到实际应用中。为了解决这些限制，我们提出了4DLangVGGT，这是一种基于变换器的前馈统一框架，用于四维语言接地，该框架在单一架构中联合整合了几何感知和语言对齐。4DLangVGGT有两个关键组件：四维视觉几何变换器StreamVGGT，用于捕获动态场景的时空几何表示；以及语义桥梁解码器（SBD），将几何感知特征投影到语言对齐的语义空间，从而增强语义可解释性并保持结构保真度。与依赖于昂贵的场景特定优化的先前方法不同，4DLangVGGT可以在多个动态场景上联合训练，并在推理时直接应用，实现部署效率和强大的泛化能力。这种设计显著提高了大规模部署的实用性，并建立了开放词汇四维场景理解的新范式。在HyperNeRF和Neu3D数据集上的实验表明，我们的方法不仅泛化效果良好，还实现了最先进的性能，在场景训练下达到2%的提升，在多场景训练下达到1%的提升。我们在https://github.com/hustvl/4DLangVGGT发布了代码。

Towards a unified framework for guided diffusion models

Authors: Yuchen Jiao, Yuxin Chen, Gen Li

First: 2025-12-04T16:55:20+00:00 · Latest: 2025-12-04T16:55:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Guided or controlled data generation with diffusion models\blfootnote{Partial preliminary results of this work appeared in International Conference on Machine Learning 2025 \citep{li2025provable}.} has become a cornerstone of modern generative modeling. Despite substantial advances in diffusion model theory, the theoretical understanding of guided diffusion samplers remains severely limited. We make progress by developing a unified algorithmic and theoretical framework that accommodates both diffusion guidance and reward-guided diffusion. Aimed at fine-tuning diffusion models to improve certain rewards, we propose injecting a reward guidance term -- constructed from the difference between the original and reward-reweighted scores -- into the backward diffusion process, and rigorously quantify the resulting reward improvement over the unguided counterpart. As a key application, our framework shows that classifier-free guidance (CFG) decreases the expected reciprocal of the classifier probability, providing the first theoretical characterization of the specific performance metric that CFG improves for general target distributions. When applied to reward-guided diffusion, our framework yields a new sampler that is easy-to-train and requires no full diffusion trajectories during training. Numerical experiments further corroborate our theoretical findings.

中文标题/摘要

标题：迈向统一的引导扩散模型框架

带有扩散模型的引导或控制数据生成已成为现代生成建模的基石。尽管在扩散模型理论方面取得了重大进展，但对引导扩散采样器的理论理解仍然非常有限。我们通过开发一个统一的算法和理论框架取得了进展，该框架可以同时容纳扩散引导和奖励引导扩散。旨在微调扩散模型以提高某些奖励，我们提出将奖励引导项——由原始分数和奖励加权分数之差构建——注入反向扩散过程，并严格量化与未引导的对应物相比的奖励改进。作为关键应用，我们的框架表明，无分类器引导（CFG）降低了分类器概率的期望倒数，首次为通用目标分布提供了CFG改进的具体性能指标的理论表征。当应用于奖励引导扩散时，我们的框架产生了一种新的采样器，该采样器易于训练，并且在训练过程中不需要完整的扩散轨迹。数值实验进一步证实了我们的理论发现。

Summary / 总结

The paper aims to develop a unified framework for guided diffusion models to enhance theoretical understanding and practical applications. The authors propose injecting a reward guidance term into the backward diffusion process to improve certain rewards. Key findings include a theoretical characterization of classifier-free guidance (CFG) and the introduction of a new easy-to-train reward-guided diffusion sampler.

论文旨在开发统一框架以指导扩散模型，增强理论理解和实际应用。作者提出在反向扩散过程中注入奖励指导项以提高某些奖励。关键发现包括对分类器无条件指导（CFG）的理论表征，并引入了一种新的易于训练的奖励导向扩散采样器。

Aligned but Stereotypical? The Hidden Influence of System Prompts on Social Bias in LVLM-Based Text-to-Image Models

Authors: NaHyeon Park, Namin An, Kunhee Kim, Soyeon Yoon, Jiahao Huo, Hyunjung Shim

First: 2025-12-04T16:52:45+00:00 · Latest: 2025-12-04T16:52:45+00:00

Comments: Project page: https://fairpro-t2i.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Large vision-language model (LVLM) based text-to-image (T2I) systems have become the dominant paradigm in image generation, yet whether they amplify social biases remains insufficiently understood. In this paper, we show that LVLM-based models produce markedly more socially biased images than non-LVLM-based models. We introduce a 1,024 prompt benchmark spanning four levels of linguistic complexity and evaluate demographic bias across multiple attributes in a systematic manner. Our analysis identifies system prompts, the predefined instructions guiding LVLMs, as a primary driver of biased behavior. Through decoded intermediate representations, token-probability diagnostics, and embedding-association analyses, we reveal how system prompts encode demographic priors that propagate into image synthesis. To this end, we propose FairPro, a training-free meta-prompting framework that enables LVLMs to self-audit and construct fairness-aware system prompts at test time. Experiments on two LVLM-based T2I models, SANA and Qwen-Image, show that FairPro substantially reduces demographic bias while preserving text-image alignment. We believe our findings provide deeper insight into the central role of system prompts in bias propagation and offer a practical, deployable approach for building more socially responsible T2I systems.

中文标题/摘要

标题：对齐但刻板？LVLM 基础文本到图像模型中社会偏见的隐秘影响

基于大型视觉-语言模型（LVLM）的文本到图像（T2I）系统已成为图像生成的主导范式，但它们是否放大了社会偏见仍不够了解。在本文中，我们展示了基于LVLM的模型生成的社会偏见图像明显多于非LVLM基础模型。我们引入了一个包含四个语言复杂度级别的1024个提示基准，并以系统的方式评估了多个属性上的人口统计学偏见。我们的分析确定系统提示，即指导LVLM的预定义指令，是偏见行为的主要驱动因素。通过解码中间表示、标记概率诊断和嵌入关联分析，我们揭示了系统提示如何编码人口统计学先验并传播到图像合成中。为此，我们提出了FairPro，一种无需训练的元提示框架，使LVLM能够在测试时自我审计并构建公平意识的系统提示。在两个基于LVLM的T2I模型SANA和Qwen-Image上的实验表明，FairPro在保持文本-图像对齐的同时显著减少了人口统计学偏见。我们认为我们的发现提供了对系统提示在偏见传播中核心作用的更深入理解，并提供了一种实用的、可部署的方法来构建更具社会责任感的T2I系统。

Summary / 总结

This paper investigates the social biases in large vision-language model (LVLM)-based text-to-image (T2I) systems and finds that these models produce more socially biased images than non-LVLM-based models. By introducing a 1,024 prompt benchmark, the authors evaluate demographic bias across multiple attributes and identify system prompts as a primary driver. They propose FairPro, a meta-prompting framework that reduces demographic bias while maintaining text-image alignment, offering a practical approach to building more socially responsible T2I systems.

本文研究了大型视觉-语言模型（LVLM）驱动的文本到图像（T2I）系统中的社会偏见问题，发现这些模型生成的图像比非LVLM模型更具有社会偏见。作者引入了一个包含1,024个提示的基准，并系统地评估了人口统计学偏见。他们将系统提示识别为主要的偏见驱动因素，并提出了一种名为FairPro的元提示框架，该框架在保持文本-图像对齐的同时减少了偏见。实验结果表明，FairPro有效地减轻了偏见。

A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models

Authors: X. Y. Han, Yuan Zhong

First: 2025-12-03T16:00:02+00:00 · Latest: 2025-12-04T16:34:28+00:00

Abs · PDF · Code1 · Code2

Abstract

In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token. An operational challenge in this design is load balancing: routing tokens to minimize the number of idle experts, which is important for the efficient utilization of (costly) GPUs. We provide a theoretical framework for analyzing the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure -- proposed by DeepSeek's Wang et al. (2024) -- by casting it as a one-step-per-iteration primal-dual method for an assignment problem. First, in a stylized deterministic setting, our framework yields several insightful structural properties: (i) a monotonic improvement of a Lagrangian objective, (ii) a preference rule that moves tokens from overloaded to underloaded experts, and (iii) an approximate-balancing guarantee. Then, we incorporate the stochastic and dynamic nature of AI training using a generalized online optimization formulation. In the online setting, we derive a strong convexity property of the objective that leads to a logarithmic expected regret bound under certain step-size choices. Additionally, we present real experiments on 1B-parameter DeepSeekMoE models to complement our theoretical findings. Together, these results build a principled framework for analyzing the Auxiliary-Loss-Free Load Balancing of s-MoE in AI models.

Summary / 总结

The paper provides a theoretical framework for analyzing the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure in Sparse Mixture-of-Experts (s-MoE) layers, which is crucial for efficient GPU utilization in large-scale AI models. The framework, based on a primal-dual method, offers insights into the monotonic improvement of a Lagrangian objective, a preference rule for token routing, and an approximate-balancing guarantee. It also incorporates the stochastic and dynamic nature of AI training, leading to a logarithmic expected regret bound under certain step-size choices. Real experiments on DeepSeekMoE models validate these theoretical findings.

论文提供了一种分析Sparse Mixture-of-Experts (s-MoE)层中无辅助损失负载均衡（ALF-LB）程序的理论框架，这对于大规模AI训练中的高效GPU利用至关重要。该框架基于一个原始对偶方法，表明ALF-LB会单调地提高拉格朗日目标，偏好将令牌从过载专家移动到欠载专家，并提供近似平衡的保证。研究还考虑了AI训练的随机性和动态性，推导出在某些步长选择下的对数期望后悔界。实际实验验证了理论发现。

LLMs Know More Than Words: A Genre Study with Syntax, Metaphor & Phonetics

Authors: Weiye Shi, Zhaowei Zhang, Shaoheng Yan, Yaodong Yang

First: 2025-12-04T16:26:42+00:00 · Latest: 2025-12-04T16:26:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) demonstrate remarkable potential across diverse language related tasks, yet whether they capture deeper linguistic properties, such as syntactic structure, phonetic cues, and metrical patterns from raw text remains unclear. To analysis whether LLMs can learn these features effectively and apply them to important nature language related tasks, we introduce a novel multilingual genre classification dataset derived from Project Gutenberg, a large-scale digital library offering free access to thousands of public domain literary works, comprising thousands of sentences per binary task (poetry vs. novel;drama vs. poetry;drama vs. novel) in six languages (English, French, German, Italian, Spanish, and Portuguese). We augment each with three explicit linguistic feature sets (syntactic tree structures, metaphor counts, and phonetic metrics) to evaluate their impact on classification performance. Experiments demonstrate that although LLM classifiers can learn latent linguistic structures either from raw text or from explicitly provided features, different features contribute unevenly across tasks, which underscores the importance of incorporating more complex linguistic signals during model training.

中文标题/摘要

标题：LLMs 知识远超文字：一种涉及句法、隐喻与音韵的体裁研究

大型语言模型（LLMs）在多种语言相关任务中展现出显著潜力，但它们是否能够捕捉到更深层次的语言特性，如句法结构、音素提示和韵律模式，仍然不清楚。为了分析LLMs是否能够有效学习这些特征并应用于重要的自然语言相关任务，我们引入了一个新颖的多语言体裁分类数据集，该数据集源自Project Gutenberg，这是一个提供数千篇公共领域文学作品的大型数字图书馆，包含六种语言（英语、法语、德语、意大利语、西班牙语和葡萄牙语）的数千个句子，每种二元任务（诗歌 vs. 小说；戏剧 vs. 诗歌；戏剧 vs. 小说）。我们为每个任务增加了三个明确的语言特征集（句法树结构、隐喻计数和音韵指标），以评估它们对分类性能的影响。实验表明，尽管LLM分类器可以从原始文本或明确提供的特征中学习潜在的语言结构，但不同特征在不同任务中的贡献不均，这突显了在模型训练过程中整合更复杂语言信号的重要性。

Summary / 总结

The study investigates whether large language models (LLMs) can learn and apply deeper linguistic properties such as syntax, metaphor, and phonetics from raw text. A multilingual genre classification dataset was created using Project Gutenberg, with explicit linguistic features added to evaluate their impact. Experiments show that LLMs can learn these features from both raw text and explicit features, but the contribution varies across different tasks, highlighting the need for incorporating complex linguistic signals during training.

研究探讨了大型语言模型（LLMs）是否可以从原始文本中学习更深层次的语法规则、隐喻和音韵等语言特性。使用Project Gutenberg创建了一个多语言体裁分类数据集，包含六种语言中数千个句子，用于三个二元任务。实验表明，LLMs可以从原始文本或明确提供的语言特征中学习这些特性，但不同任务的贡献不同，强调了在模型训练中需要结合更复杂的语言信号的重要性。

FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via neural Action Tokenization

Authors: Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, Liangtao Zheng, Tao Jiang, Jingjing Gong, Xipeng Qiu, Hang Zhao

First: 2025-12-04T16:21:38+00:00 · Latest: 2025-12-04T16:21:38+00:00

Abs · PDF · Code1 · Code2

Abstract

Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.

中文标题/摘要

标题：FASTer：通过神经动作分词实现高效自回归视觉语言动作建模

自回归视觉-语言-动作（VLA）模型最近在机器人操作方面展示了强大的能力。然而，它们的核心动作分词过程通常会在重建保真度和推理效率之间进行权衡。我们引入了FASTer，这是一种统一的高效且可泛化的机器人学习框架，该框架结合了一个可学习的分词器和基于其构建的自回归策略。FASTerVQ 将动作片段编码为单通道图像，捕获全局时空依赖关系同时保持高压缩比。FASTerVLA 在此基础上使用块状自回归解码和轻量级动作专家，实现更快的推理和更高的任务性能。广泛的实验表明，FASTerVQ 提供了卓越的重建质量、高分词利用率和强大的跨任务和跨载体泛化能力，而 FASTerVLA 进一步提高了整体能力，在推理速度和任务性能方面均超越了之前的最先进的 VLA 模型。

Summary / 总结

The research aims to improve the efficiency and generalizability of autoregressive vision-language-action models for robotic manipulation. The method involves a learnable tokenizer called FASTerVQ that encodes action chunks as single-channel images, enhancing global spatio-temporal dependencies and compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving faster inference and higher task performance. Experiments show that FASTerVQ outperforms in reconstruction quality and token utilization, and FASTerVLA surpasses previous state-of-the-art models in both inference speed and task performance.

研究旨在提高视觉-语言-动作模型在机器人操作中的效率和通用性。提出了一种统一框架FASTer，包含可学习的分词器和自回归策略。FASTerVQ高效地编码动作片段，保持高压缩率的同时确保良好的重建质量。FASTerVLA在此基础上使用块级自回归解码和轻量级动作专家，进一步提升推理速度和任务性能。实验表明，FASTerVQ在重建质量和跨任务、跨载体泛化方面表现更优，而FASTerVLA在速度和性能上均超越了之前的最先进的视觉-语言-动作模型。

"I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments

Authors: Ziyi Zhang, Zhen Sun, Zongmin Zhang, Zifan Peng, Yuemeng Zhao, Zichun Wang, Zeren Luo, Ruiting Zuo, Xinlei He

First: 2025-05-07T15:03:16+00:00 · Latest: 2025-12-04T16:15:45+00:00

Comments: 17 pages

Abs · PDF · Code1 · Code2

Abstract

The visually impaired population faces significant challenges in daily activities. While prior works employ vision language models for assistance, most focus on static content and cannot address real-time perception needs in complex environments. Recent VideoLLMs enable real-time vision and speech interaction, offering promising potential for assistive tasks. In this work, we conduct the first study evaluating their effectiveness in supporting daily life for visually impaired individuals. We first conducted a user survey with visually impaired participants to design the benchmark VisAssistDaily for daily life evaluation. Using VisAssistDaily, we evaluate popular VideoLLMs and find GPT-4o achieves the highest task success rate. We further conduct a user study to reveal concerns about hazard perception. To address this, we propose SafeVid, an environment-awareness dataset, and fine-tune VITA-1.5, improving risk recognition accuracy from 25.00% to 76.00%.We hope this work provides valuable insights and inspiration for future research in this field.

中文标题/摘要

标题："我能看到永远！": 评估实时视频LLM在辅助视觉障碍个体中的效果

视觉障碍人群在日常活动中面临重大挑战。虽然先前的工作使用视觉语言模型进行辅助，但大多数都集中在静态内容上，无法解决复杂环境中实时感知的需求。最近的视频LLM能够实现实时视觉和语音交互，为辅助任务提供了巨大的潜力。在本研究中，我们首次评估了它们在支持视觉障碍个体日常生活的有效性。我们首先对视觉障碍参与者进行了用户调查，设计了用于日常生活的基准测试VisAssistDaily。使用VisAssistDaily，我们评估了流行的视频LLM，发现GPT-4o的任务成功率最高。我们进一步进行了一项用户研究，揭示了对危险感知的担忧。为了解决这一问题，我们提出了SafeVid，一种环境感知数据集，并对VITA-1.5进行了微调，将风险识别准确性从25.00%提高到76.00%。我们希望这项工作为该领域的未来研究提供有价值的见解和灵感。

Multi-Agent Reinforcement Learning for Intraday Operating Rooms Scheduling under Uncertainty

Authors: Kailiang Liu, Ying Chen, Ralf Borndörfer, Thorsten Koch

First: 2025-12-04T15:47:08+00:00 · Latest: 2025-12-04T15:47:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Intraday surgical scheduling is a multi-objective decision problem under uncertainty-balancing elective throughput, urgent and emergency demand, delays, sequence-dependent setups, and overtime. We formulate the problem as a cooperative Markov game and propose a multi-agent reinforcement learning (MARL) framework in which each operating room (OR) is an agent trained with centralized training and decentralized execution. All agents share a policy trained via Proximal Policy Optimization (PPO), which maps rich system states to actions, while a within-epoch sequential assignment protocol constructs conflict-free joint schedules across ORs. A mixed-integer pre-schedule provides reference starting times for electives; we impose type-specific quadratic delay penalties relative to these references and a terminal overtime penalty, yielding a single reward that captures throughput, timeliness, and staff workload. In simulations reflecting a realistic hospital mix (six ORs, eight surgery types, random urgent and emergency arrivals), the learned policy outperforms six rule-based heuristics across seven metrics and three evaluation subsets, and, relative to an ex post MIP oracle, quantifies optimality gaps. Policy analytics reveal interpretable behavior-prioritizing emergencies, batching similar cases to reduce setups, and deferring lower-value electives. We also derive a suboptimality bound for the sequential decomposition under simplifying assumptions. We discuss limitations-including OR homogeneity and the omission of explicit staffing constraints-and outline extensions. Overall, the approach offers a practical, interpretable, and tunable data-driven complement to optimization for real-time OR scheduling.

中文标题/摘要

标题：不确定性条件下日内手术室调度的多智能体强化学习

日内手术调度是一个在不确定性条件下多目标决策问题，需要平衡择期手术量、紧急和急诊需求、延迟、顺序相关的设置以及加班。我们将问题形式化为合作马尔可夫博弈，并提出一个多智能体强化学习（MARL）框架，其中每个手术室（OR）是一个通过集中训练和分散执行训练的智能体。所有智能体共享一个通过近端策略优化（PPO）训练的策略，该策略将丰富的系统状态映射为动作，而每轮内的顺序分配协议构建了OR之间的无冲突联合调度。混合整数预调度提供择期手术的参考开始时间；我们对这些参考施加类型特定的二次延迟惩罚，并施加一个终端加班惩罚，产生一个综合了吞吐量、及时性和工作人员工作量的单一奖励。在反映现实医院情况（六个OR，八种手术类型，随机的紧急和急诊到达）的模拟中，学习到的策略在七个指标和三个评估子集上均优于六种基于规则的启发式方法，并且相对于事后MIP优化器，量化了最优性差距。策略分析揭示了可解释的行为-优先处理紧急情况、批量处理相似病例以减少设置以及推迟低价值的择期手术。我们还在简化假设下推导了顺序分解的次优性界。我们讨论了限制因素，包括OR同质性和未明确包含的人员配置约束，并概述了扩展。总体而言，该方法为实时手术室调度提供了实用、可解释且可调节的数据驱动补充，与优化方法相结合。

Summary / 总结

The paper addresses the complex scheduling problem of intraday surgical operations, formulating it as a cooperative Markov game and using a multi-agent reinforcement learning (MARL) framework where each operating room is an agent. The agents are trained with centralized learning and decentralized execution, using Proximal Policy Optimization (PPO) to map system states to actions. The approach outperforms six rule-based heuristics across seven metrics and three evaluation subsets, and provides interpretable behavior such as prioritizing emergencies and deferring lower-value electives. The method also offers a practical, interpretable, and tunable solution for real-time OR scheduling, though it has limitations such as OR homogeneity and the omission of explicit staffing constraints.

论文解决了日间手术操作的复杂调度问题，将其形式化为合作马尔可夫游戏，并使用多智能体强化学习（MARL）框架，其中每个手术室是一个智能体。智能体通过集中学习和分散执行使用Proximal Policy Optimization (PPO)来映射系统状态到动作。该方法在七个指标和三个评估子集上优于六种基于规则的启发式方法，并提供了可解释的行为，如优先处理紧急情况和推迟低价值的择期手术。该方法还提供了一种实用、可解释和可调节的实时手术室调度解决方案，尽管存在手术室同质性和未明确考虑人员配置约束等局限性。

Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment

Authors: Yang Chen, Xiaowei Xu, Shuai Wang, Chenhui Zhu, Ruxue Wen, Xubin Li, Tiezheng Ge, Limin Wang

Venue: AAAI 2026

First: 2025-11-27T11:35:08+00:00 · Latest: 2025-12-04T15:44:45+00:00

Comments: Accepted by AAAI 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Normalizing Flows (NFs) are a class of generative models distinguished by a mathematically invertible architecture, where the forward pass transforms data into a latent space for density estimation, and the reverse pass generates new samples from this space. This characteristic creates an intrinsic synergy between representation learning and data generation. However, the generative quality of standard NFs is limited by poor semantic representations from log-likelihood optimization. To remedy this, we propose a novel alignment strategy that creatively leverages the invertibility of NFs: instead of regularizing the forward pass, we align the intermediate features of the generative (reverse) pass with representations from a powerful vision foundation model, demonstrating superior effectiveness over naive alignment. We also introduce a novel training-free, test-time optimization algorithm for classification, which provides a more intrinsic evaluation of the NF's embedded semantic knowledge. Comprehensive experiments demonstrate that our approach accelerates the training of NFs by over 3.3$\times$, while simultaneously delivering significant improvements in both generative quality and classification accuracy. New state-of-the-art results for NFs are established on ImageNet 64$\times$64 and 256$\times$256. Our code is available at https://github.com/MCG-NJU/FlowBack.

中文标题/摘要

标题：逆向表示对齐：通过反向表示对齐改进流动模型

流动模型（NFs）是一类生成模型，以其可逆的架构为特征，其中前向传递将数据转换到潜在空间进行密度估计，而反向传递则从该空间生成新的样本。这一特性在表示学习和数据生成之间创造了内在的协同作用。然而，标准NFs的生成质量受限于从对数似然优化中获得的较差语义表示。为了解决这一问题，我们提出了一种新颖的对齐策略，创造性地利用了NFs的可逆性：而不是正则化前向传递，我们对生成（反向）传递中的中间特征与强视觉基础模型的表示进行对齐，显示出比简单对齐更优越的效果。我们还引入了一种新的无需训练、测试时的优化算法，用于分类，这为NF嵌入的语义知识提供了更内在的评估。全面的实验表明，我们的方法不仅将NFs的训练加速了3.3倍以上，还在生成质量和分类准确性方面取得了显著的改进。在ImageNet 64×64和256×256上，我们建立了NFs的新最佳结果。我们的代码可在https://github.com/MCG-NJU/FlowBack/ 获取。

Summary / 总结

This paper addresses the limitations of standard Normalizing Flows (NFs) in generating high-quality data due to poor semantic representations. It introduces a novel alignment strategy that aligns the intermediate features of the reverse pass with those from a powerful vision foundation model, improving both generative quality and classification accuracy. Experiments show that this approach accelerates training by over 3.3 times and sets new state-of-the-art results on ImageNet 64x64 and 256x256.

本文针对标准归一化流（NFs）因语义表示较差而导致生成高质量数据的局限性，提出了一种新的对齐策略，该策略将生成（逆向）过程的中间特征与强视觉基础模型的表示对齐，从而提高了生成质量和分类准确性。实验表明，训练速度提高了3.3倍，并在ImageNet 64x64和256x256上取得了新的最佳结果。

EoS-FM: Can an Ensemble of Specialist Models act as a Generalist Feature Extractor?

Authors: Pierre Adorni, Minh-Tan Pham, Stéphane May, Sébastien Lefèvre

First: 2025-11-26T15:52:56+00:00 · Latest: 2025-12-04T15:22:57+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in foundation models have shown great promise in domains such as natural language processing and computer vision, and similar efforts are now emerging in the Earth Observation community. These models aim to generalize across tasks with limited supervision, reducing the need for training separate models for each task. However, current strategies, which largely focus on scaling model size and dataset volume, require prohibitive computational and data resources, limiting accessibility to only a few large institutions. Moreover, this paradigm of ever-larger models stands in stark contrast with the principles of sustainable and environmentally responsible AI, as it leads to immense carbon footprints and resource inefficiency. In this work, we present a novel and efficient alternative: an Ensemble-of-Specialists framework for building Remote Sensing Foundation Models (RSFMs). Our method decomposes the training process into lightweight, task-specific ConvNeXtV2 specialists that can be frozen and reused. This modular approach offers strong advantages in efficiency, interpretability, and extensibility. Moreover, it naturally supports federated training, pruning, and continuous specialist integration, making it particularly well-suited for collaborative and resource-constrained settings. Our framework sets a new direction for building scalable and efficient RSFMs. All codes and pretrained models are available at https://github.com/pierreadorni/EoS-FM.

中文标题/摘要

标题：EoS-FM：一组专家模型能否充当通用特征提取器？

基础模型在自然语言处理和计算机视觉等领域取得了巨大进展，类似的努力现在也在地球观测领域出现。这些模型旨在在有限监督的情况下泛化到各种任务，减少为每个任务单独训练模型的需要。然而，当前的策略主要集中在扩大模型规模和数据集规模上，这需要巨大的计算和数据资源，限制了其仅对少数大型机构的可用性。此外，这种不断扩大的模型范式与可持续和环境友好型人工智能的原则背道而驰，因为它导致了巨大的碳足迹和资源低效。在本文中，我们提出了一种新颖且高效的替代方案：用于构建遥感基础模型（RSFM）的专家模型组框架。我们的方法将训练过程分解为轻量级、任务特定的ConvNeXtV2专家，可以冻结并重用。这种模块化方法在效率、可解释性和可扩展性方面具有明显优势。此外，它自然支持联邦训练、剪枝和连续专家集成，使其特别适合协作和资源受限的环境。我们的框架为构建可扩展和高效的RSFM指明了新方向。所有代码和预训练模型均可在https://github.com/pierreadorni/EoS-FM获取。

Chameleon: Adaptive Adversarial Agents for Scaling-Based Visual Prompt Injection in Multimodal AI Systems

Authors: M Zeeshan, Saud Satti

First: 2025-12-04T15:22:28+00:00 · Latest: 2025-12-04T15:22:28+00:00

Comments: 5 pages, 2 figures, IEEE Transactions on Dependable and Secure Computing

Abs · PDF · Code1 · Code2

Abstract

Multimodal Artificial Intelligence (AI) systems, particularly Vision-Language Models (VLMs), have become integral to critical applications ranging from autonomous decision-making to automated document processing. As these systems scale, they rely heavily on preprocessing pipelines to handle diverse inputs efficiently. However, this dependency on standard preprocessing operations, specifically image downscaling, creates a significant yet often overlooked security vulnerability. While intended for computational optimization, scaling algorithms can be exploited to conceal malicious visual prompts that are invisible to human observers but become active semantic instructions once processed by the model. Current adversarial strategies remain largely static, failing to account for the dynamic nature of modern agentic workflows. To address this gap, we propose Chameleon, a novel, adaptive adversarial framework designed to expose and exploit scaling vulnerabilities in production VLMs. Unlike traditional static attacks, Chameleon employs an iterative, agent-based optimization mechanism that dynamically refines image perturbations based on the target model's real-time feedback. This allows the framework to craft highly robust adversarial examples that survive standard downscaling operations to hijack downstream execution. We evaluate Chameleon against Gemini 2.5 Flash model. Our experiments demonstrate that Chameleon achieves an Attack Success Rate (ASR) of 84.5% across varying scaling factors, significantly outperforming static baseline attacks which average only 32.1%. Furthermore, we show that these attacks effectively compromise agentic pipelines, reducing decision-making accuracy by over 45% in multi-step tasks. Finally, we discuss the implications of these vulnerabilities and propose multi-scale consistency checks as a necessary defense mechanism.

中文标题/摘要

标题：变色龙：基于缩放的视觉提示注入适应性对抗代理在多模态AI系统中的应用

多模态人工智能（AI）系统，特别是视觉-语言模型（VLMs），已成为从自主决策到自动化文档处理等关键应用中的重要组成部分。随着这些系统的扩展，它们依赖于预处理管道来高效处理各种输入。然而，对标准预处理操作，特别是图像缩放的依赖，创造了一个重要的但经常被忽视的安全漏洞。虽然缩放算法旨在进行计算优化，但它们可以被利用来隐藏对人类观察者不可见但被模型处理后成为有效语义指令的恶意视觉提示。当前的对抗策略大多保持静态，未能考虑到现代代理工作流程的动态性。为了解决这一差距，我们提出了变色龙，这是一种新颖的、适应性的对抗框架，旨在揭示并利用生产VLMs中的缩放漏洞。与传统的静态攻击不同，变色龙采用了一种迭代的、基于代理的优化机制，根据目标模型的实时反馈动态细化图像扰动。这使得框架能够生成高度鲁棒的对抗样本，这些样本能够生存下来标准的缩放操作，从而劫持下游执行。我们使用Gemini 2.5 Flash模型对变色龙进行了评估。我们的实验表明，变色龙在不同缩放因子下的攻击成功率（ASR）达到了84.5%，显著优于平均仅32.1%的静态基线攻击。此外，我们展示了这些攻击有效地破坏了代理管道，在多步骤任务中使决策准确性降低了超过45%。最后，我们讨论了这些漏洞的影响，并提出了多尺度一致性检查作为必要的防御机制。

Summary / 总结

Chameleon is an adaptive adversarial framework designed to exploit scaling vulnerabilities in Vision-Language Models (VLMs). Unlike static attacks, Chameleon iteratively refines image perturbations based on real-time feedback from the target model, achieving an Attack Success Rate of 84.5% across different scaling factors. This outperforms static baseline attacks, which average 32.1%, and significantly reduces decision-making accuracy by over 45% in multi-step tasks.

Chameleon 是一种适应性对抗框架，旨在利用视觉语言模型（VLM）中的缩放漏洞。与静态攻击不同，Chameleon 会根据目标模型的实时反馈迭代优化图像扰动，其攻击成功率在不同缩放因子下达到 84.5%，远超静态基线攻击的平均成功率 32.1%。此外，Chameleon 显著破坏了自动工作流程，使多步骤任务的决策准确性降低超过 45%。

You Only Train Once (YOTO): A Retraining-Free Object Detection Framework

Authors: Priyanto Hidayatullah, Nurjannah Syakrani, Yudi Widhiyasana, Muhammad Rizqi Sholahuddin, Refdinal Tubagus, Zahri Al Adzani Hidayat, Hanri Fajar Ramadhan, Dafa Alfarizki Pratama, Farhan Muhammad Yasin

First: 2025-12-04T15:15:43+00:00 · Latest: 2025-12-04T15:15:43+00:00

Comments: under review in the Elsevier Engineering Journal

Abs · PDF · Code1 · Code2

Abstract

Object detection constitutes the primary task within the domain of computer vision. It is utilized in numerous domains. Nonetheless, object detection continues to encounter the issue of catastrophic forgetting. The model must be retrained whenever new products are introduced, utilizing not only the new products dataset but also the entirety of the previous dataset. The outcome is obvious: increasing model training expenses and significant time consumption. In numerous sectors, particularly retail checkout, the frequent introduction of new products presents a great challenge. This study introduces You Only Train Once (YOTO), a methodology designed to address the issue of catastrophic forgetting by integrating YOLO11n for object localization with DeIT and Proxy Anchor Loss for feature extraction and metric learning. For classification, we utilize cosine similarity between the embedding features of the target product and those in the Qdrant vector database. In a case study conducted in a retail store with 140 products, the experimental results demonstrate that our proposed framework achieves encouraging accuracy, whether for detecting new or existing products. Furthermore, without retraining, the training duration difference is significant. We achieve almost 3 times the training time efficiency compared to classical object detection approaches. This efficiency escalates as additional new products are added to the product database. The average inference time is 580 ms per image containing multiple products, on an edge device, validating the proposed framework's feasibility for practical use.

中文标题/摘要

标题：你只需训练一次（YOTO）：一种无需重新训练的目标检测框架

目标检测是计算机视觉领域的主要任务，被广泛应用于多个领域。然而，目标检测仍然面临灾难性遗忘的问题。每当引入新产品时，模型必须重新训练，不仅需要使用新产品数据集，还需要使用所有先前的数据集。结果显而易见：增加了模型训练成本和大量时间消耗。在许多领域，尤其是零售结账领域，频繁引入新产品带来了巨大挑战。本研究引入了你只需训练一次（YOTO）的方法，通过将YOLO11n用于目标定位、DeIT和Proxy Anchor Loss用于特征提取和度量学习来解决灾难性遗忘问题。对于分类，我们使用目标产品嵌入特征与Qdrant向量数据库中特征的余弦相似度。在一家拥有140种产品的零售店进行的案例研究中，实验结果表明，我们提出的框架在检测新产品和现有产品时均取得了令人鼓舞的准确性。此外，无需重新训练，训练时间差异显著。我们实现了与经典目标检测方法相比几乎3倍的训练时间效率。随着产品数据库中新增产品的数量增加，这种效率会进一步提高。在边缘设备上，每张包含多个产品的图像平均推理时间为580毫秒，验证了所提框架在实际应用中的可行性。

Summary / 总结

The study addresses the issue of catastrophic forgetting in object detection by proposing You Only Train Once (YOTO), which integrates YOLO11n, DeIT, and Proxy Anchor Loss. The framework demonstrates high accuracy in detecting both new and existing products in a retail setting with 140 products, achieving nearly three times the training time efficiency compared to traditional methods without retraining. The average inference time is 580 ms per image, making it feasible for practical use.

研究通过引入You Only Train Once (YOTO) 方法解决了对象检测中的灾难性遗忘问题，该方法使用YOLO11n 进行定位，DeIT 和 Proxy Anchor Loss 进行特征提取，并使用余弦相似度进行分类。在包含140 种产品的零售店中，所提出的框架在无需重新训练的情况下，对新旧产品均实现了高精度，相比传统方法提高了3 倍的训练效率，每张包含多个产品的图像平均推理时间为580 毫秒，验证了该框架的实际可行性。

SDG-Track: A Heterogeneous Observer-Follower Framework for High-Resolution UAV Tracking on Embedded Platforms

Authors: Jiawen Wen, Yu Hu, Suixuan Qiu, Jinshan Huang, Xiaowen Chu

First: 2025-12-04T15:11:43+00:00 · Latest: 2025-12-04T15:11:43+00:00

Comments: https://github.com/Jeffry-wen/SDG-Track

Abs · PDF · Code1 · Code2 · Code3

Abstract

Real-time tracking of small unmanned aerial vehicles (UAVs) on edge devices faces a fundamental resolution-speed conflict. Downsampling high-resolution imagery to standard detector input sizes causes small target features to collapse below detectable thresholds. Yet processing native 1080p frames on resource-constrained platforms yields insufficient throughput for smooth gimbal control. We propose SDG-Track, a Sparse Detection-Guided Tracker that adopts an Observer-Follower architecture to reconcile this conflict. The Observer stream runs a high-capacity detector at low frequency on the GPU to provide accurate position anchors from 1920x1080 frames. The Follower stream performs high-frequency trajectory interpolation via ROI-constrained sparse optical flow on the CPU. To handle tracking failures from occlusion or model drift caused by spectrally similar distractors, we introduce Dual-Space Recovery, a training-free re-acquisition mechanism combining color histogram matching with geometric consistency constraints. Experiments on a ground-to-air tracking station demonstrate that SDG-Track achieves 35.1 FPS system throughput while retaining 97.2\% of the frame-by-frame detection precision. The system successfully tracks agile FPV drones under real-world operational conditions on an NVIDIA Jetson Orin Nano. Our paper code is publicly available at https://github.com/Jeffry-wen/SDG-Track

Summary / 总结

SDG-Track addresses the resolution-speed conflict in real-time UAV tracking on edge devices by using an Observer-Follower architecture. The Observer runs a high-capacity detector on the GPU to provide accurate position anchors, while the Follower performs high-frequency trajectory interpolation on the CPU. To handle tracking failures, Dual-Space Recovery combines color histogram matching with geometric consistency constraints. Experiments show SDG-Track achieves 35.1 FPS throughput with 97.2% frame-by-frame detection precision, successfully tracking agile FPV drones under real-world conditions.

SDG-Track通过在GPU上以低频率运行高容量检测器来提供准确的位置锚点，同时在CPU上通过稀疏光流进行高频轨迹插值，以解决实时UAV跟踪中的分辨率-速度冲突。它引入了双空间恢复机制，结合颜色直方图匹配和几何一致性约束来处理跟踪失败。实验显示SDG-Track实现了35.1 FPS的吞吐量，同时保持了97.2%的检测精度，并成功在真实世界条件下跟踪敏捷的FPV无人机。

Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens

Authors: Ziran Qin, Youru Lv, Mingbao Lin, Zeren Zhang, Chanfan Gan, Tieyuan Chen, Weiyao Lin

First: 2025-12-04T14:41:21+00:00 · Latest: 2025-12-04T14:41:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Autoregressive (AR) visual generation has emerged as a powerful paradigm for image and multimodal synthesis, owing to its scalability and generality. However, existing AR image generation suffers from severe memory bottlenecks due to the need to cache all previously generated visual tokens during decoding, leading to both high storage requirements and low throughput. In this paper, we introduce \textbf{LineAR}, a novel, training-free progressive key-value (KV) cache compression pipeline for autoregressive image generation. By fully exploiting the intrinsic characteristics of visual attention, LineAR manages the cache at the line level using a 2D view, preserving the visual dependency regions while progressively evicting less-informative tokens that are harmless for subsequent line generation, guided by inter-line attention. LineAR enables efficient autoregressive (AR) image generation by utilizing only a few lines of cache, achieving both memory savings and throughput speedup, while maintaining or even improving generation quality. Extensive experiments across six autoregressive image generation models, including class-conditional and text-to-image generation, validate its effectiveness and generality. LineAR improves ImageNet FID from 2.77 to 2.68 and COCO FID from 23.85 to 22.86 on LlamaGen-XL and Janus-Pro-1B, while retaining only 1/6 KV cache. It also improves DPG on Lumina-mGPT-768 with just 1/8 KV cache. Additionally, LineAR achieves significant memory and throughput gains, including up to 67.61% memory reduction and 7.57x speedup on LlamaGen-XL, and 39.66% memory reduction and 5.62x speedup on Janus-Pro-7B.

中文标题/摘要

标题：自回归图像生成仅需几行缓存令牌

自回归（AR）视觉生成已成为图像和多模态合成的强大范式，得益于其可扩展性和通用性。然而，现有的AR图像生成由于解码过程中需要缓存所有之前生成的视觉令牌而遭受严重的内存瓶颈，导致存储需求高且吞吐量低。本文介绍了一种名为LineAR的创新性、无需训练的渐进式键值（KV）缓存压缩管道，用于自回归图像生成。通过充分利用视觉注意力的内在特性，LineAR在二维视图中按行级管理缓存，保留视觉依赖区域的同时，逐步移除对后续行生成无害的不具信息性的令牌，由行间注意力引导。LineAR通过仅使用几行缓存实现高效的自回归（AR）图像生成，同时实现内存节省和吞吐量提升，同时保持或甚至提高生成质量。在六个自回归图像生成模型中，包括类别条件和文本到图像生成的广泛实验验证了其有效性和通用性。LineAR在LlamaGen-XL和Janus-Pro-1B上将ImageNet FID从2.77提高到2.68，COCO FID从23.85降低到22.86，同时仅保留1/6的KV缓存。它还在Lumina-mGPT-768上仅使用1/8的KV缓存提高了DPG。此外，LineAR实现了显著的内存和吞吐量增益，包括在LlamaGen-XL上高达67.61%的内存减少和7.57倍的加速，在Janus-Pro-7B上则为39.66%的内存减少和5.62倍的加速。

Summary / 总结

The paper addresses the memory bottleneck in autoregressive (AR) image generation by introducing LineAR, a training-free cache compression pipeline. LineAR uses a 2D view to manage cache at the line level, preserving visual dependencies while evicting less-informative tokens. This method reduces memory usage and increases throughput while maintaining or improving generation quality. Experiments show LineAR improves FID scores and reduces memory usage and throughput time on various AR models, including LlamaGen-XL and Janus-Pro-1B.

本文提出了一种名为LineAR的方法，通过使用2D视图和跨行注意机制压缩关键值缓存，解决了自回归（AR）图像生成中的内存瓶颈问题。LineAR在行级别管理缓存，保留视觉依赖区域的同时逐步移除无害的低信息量令牌。这种方法使得AR图像生成既高效又节省内存，同时提高了吞吐量，且保持或提升了生成质量。实验表明，该方法在多种AR模型上显著提高了FID分数，并实现了内存和吞吐量的大幅提升。

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

Authors: GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, Peng Li, Qiuping Deng, Runqi Ouyang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yilong Li, Yiran Ding, Yuan Xu, Yun Ye, Yukun Zhou, Zhehao Dong, Zhenan Wang, Zhichao Liu, Zheng Zhu

First: 2025-10-22T09:57:13+00:00 · Latest: 2025-12-04T14:28:04+00:00

Comments: https://gigabrain0.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.

中文标题/摘要

标题：GigaBrain-0：一种基于世界模型的视觉-语言-行动模型

训练通用机器人视觉-语言-行动（VLA）模型通常需要大量的真实世界机器人数据，这既昂贵又耗时。物理数据收集的低效严重限制了当前VLA系统的可扩展性和泛化能力。为了解决这一挑战，我们引入了GigaBrain-0，这是一种由世界模型生成数据（例如视频生成、真实到真实转移、人类转移、视角转移、模拟到真实转移数据）赋能的新型VLA基础模型。通过利用世界模型生成大规模的多样化数据，GigaBrain-0显著减少了对真实机器人数据的依赖，同时提高了跨任务的泛化能力。我们的方法进一步通过RGBD输入建模和具身思维链（CoT）监督，提高了策略的鲁棒性，使模型在执行任务时能够推理空间几何、物体状态和长时依赖关系，从而在灵巧、长时依赖和移动操作任务上取得了显著的现实世界性能提升。大量实验表明，GigaBrain-0在外观变化（例如纹理、颜色）、物体摆放和摄像机视角等方面实现了卓越的泛化能力。此外，我们还介绍了GigaBrain-0-Small，这是一种优化的轻量级变体，旨在高效运行在NVIDIA Jetson AGX Orin等设备上。

Summary / 总结

GigaBrain-0 is a Vision-Language-Action (VLA) foundation model that uses world models to generate diverse data, reducing the need for expensive real-world robot data. This approach enhances cross-task generalization and policy robustness, leading to better performance on dexterous, long-horizon, and mobile manipulation tasks. GigaBrain-0 achieves superior generalization across various task variations and has an optimized lightweight variant for efficient device deployment.

GigaBrain-0 是一种视觉-语言-行动 (VLA) 基础模型，通过世界模型生成多样化数据，减少了对昂贵的实地机器人数据的需求。这种方法增强了跨任务的一般化和策略的鲁棒性，使其在灵巧、长时序和移动操作任务上表现出色。GigaBrain-0 在各种任务变化中实现了更好的一般化，并有一个轻量级优化版本，适用于设备上的高效部署。

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Authors: Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li

First: 2025-05-21T12:18:15+00:00 · Latest: 2025-12-04T14:24:47+00:00

Comments: https://github.com/xtong-zhang/Chain-of-Focus

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks. However, the multimodal reasoning capability has not been fully explored in existing models. In this paper, we propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions based on obtained visual cues and the given questions, achieving efficient multimodal reasoning. To enable this CoF capability, we present a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct the MM-CoF dataset, comprising 3K samples derived from a visual agent designed to adaptively identify key regions to solve visual tasks with different image resolutions and questions. We use MM-CoF to fine-tune the Qwen2.5-VL model for cold start. In the RL stage, we leverage the outcome accuracies and formats as rewards to update the Qwen2.5-VL model, enabling further refining the search and reasoning strategy of models without human priors. Our model achieves significant improvements on multiple benchmarks. On the V* benchmark that requires strong visual reasoning capability, our model outperforms existing VLMs by 5% among 8 image resolutions ranging from 224 to 4K, demonstrating the effectiveness of the proposed CoF method and facilitating the more efficient deployment of VLMs in practical applications.

中文标题/摘要

标题：基于动态视觉搜索与缩放的自适应焦点链推理方法以提高高效VLMs

视觉语言模型（VLMs）在各种计算机视觉任务中取得了令人印象深刻的性能。然而，现有的模型尚未充分探索其多模态推理能力。本文提出了一种焦点链（CoF）方法，使VLMs能够根据获得的视觉线索和给定的问题，自适应地聚焦并放大关键图像区域，实现高效的多模态推理。为了使VLMs具备这种CoF能力，我们提出了一种两阶段训练管道，包括监督微调（SFT）和强化学习（RL）。在SFT阶段，我们构建了MM-CoF数据集，包含3000个样本，这些样本来自一个视觉代理，该代理能够自适应地识别关键区域以解决不同图像分辨率和问题的视觉任务。我们使用MM-CoF对Qwen2.5-VL模型进行冷启动微调。在RL阶段，我们利用结果准确性和格式作为奖励来更新Qwen2.5-VL模型，从而进一步优化模型的搜索和推理策略，无需人类先验知识。我们的模型在多个基准测试中取得了显著改进。在需要强大视觉推理能力的V*基准测试中，我们的模型在8种图像分辨率（从224到4K）中比现有VLMs提高了5%，证明了所提出的CoF方法的有效性，并促进了VLMs在实际应用中的更高效部署。

Summary / 总结

This paper introduces a Chain-of-Focus (CoF) method for VLMs to perform adaptive focusing and zooming on key image regions based on visual cues and questions. It uses a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL), to improve multimodal reasoning. The model shows significant improvements on multiple benchmarks, outperforming existing VLMs by 5% on the V* benchmark across various image resolutions, highlighting the effectiveness of the CoF method for efficient VLM deployment.

本文提出了一种Chain-of-Focus (CoF) 方法，使VLM能够在获得的视觉线索和问题的基础上，对关键图像区域进行自适应聚焦和放大。该方法采用监督微调（SFT）和强化学习（RL）的两阶段训练流程来提升多模态推理能力。模型在多个基准测试中表现出显著改进，在V*基准测试中，该模型在从224到4K的8种不同图像分辨率下比现有VLMs高出5%，证明了CoF方法的有效性，并促进了VLM在实际应用中的更高效部署。

FreeGen: Feed-Forward Reconstruction-Generation Co-Training for Free-Viewpoint Driving Scene Synthesis

Authors: Shijie Chen, Peixi Peng

First: 2025-12-04T14:14:21+00:00 · Latest: 2025-12-04T14:14:21+00:00

Comments: Novel View Synthesis, Driving Scene, Free Trajectory, Image Generation

Abs · PDF · Code1 · Code2

Abstract

Closed-loop simulation and scalable pre-training for autonomous driving require synthesizing free-viewpoint driving scenes. However, existing datasets and generative pipelines rarely provide consistent off-trajectory observations, limiting large-scale evaluation and training. While recent generative models demonstrate strong visual realism, they struggle to jointly achieve interpolation consistency and extrapolation realism without per-scene optimization. To address this, we propose FreeGen, a feed-forward reconstruction-generation co-training framework for free-viewpoint driving scene synthesis. The reconstruction model provides stable geometric representations to ensure interpolation consistency, while the generation model performs geometry-aware enhancement to improve realism at unseen viewpoints. Through co-training, generative priors are distilled into the reconstruction model to improve off-trajectory rendering, and the refined geometry in turn offers stronger structural guidance for generation. Experiments demonstrate that FreeGen achieves state-of-the-art performance for free-viewpoint driving scene synthesis.

中文标题/摘要

标题：FreeGen：前馈重建-生成联合训练在自由视角驾驶场景合成中的应用

闭环模拟和可扩展预训练需要合成自由视角的驾驶场景。然而，现有的数据集和生成管道很少提供一致的离轨迹观测，限制了大规模评估和训练。尽管最近的生成模型展示了很强的视觉真实性，但在无需场景优化的情况下同时实现插值一致性和外推真实性方面仍存在困难。为了解决这个问题，我们提出了一种前馈重建-生成联合训练框架FreeGen，用于自由视角驾驶场景合成。重建模型提供稳定的几何表示以确保插值一致性，而生成模型则进行几何感知增强以提高在未见视角下的真实性。通过联合训练，生成先验知识被提炼到重建模型中以改善离轨迹渲染，而细化的几何结构反过来为生成提供了更强的结构指导。实验表明，FreeGen 在自由视角驾驶场景合成中达到了最先进的性能。

Summary / 总结

FreeGen is a feed-forward reconstruction-generation co-training framework designed to synthesize free-viewpoint driving scenes. It addresses the limitations of existing datasets and generative models by ensuring interpolation consistency and extrapolation realism. The reconstruction model provides stable geometric representations, while the generation model enhances realism at unseen viewpoints. Co-training improves off-trajectory rendering and structural guidance for generation, achieving state-of-the-art performance in free-viewpoint driving scene synthesis.

研究旨在合成自由视角的驾驶场景，以支持自主驾驶的闭环模拟和大规模预训练。提出的FreeGen框架采用前向重建-生成联合训练方法，其中重建模型确保插值一致性，生成模型增强未见视角的逼真度。实验表明，FreeGen在自由视角驾驶场景合成中表现出色，达到最先进的性能。

Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships

Authors: Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen

Venue: WACV 2026

First: 2024-05-29T05:20:02+00:00 · Latest: 2025-12-04T13:44:07+00:00

Comments: WACV 2026 Accepted. Code available at https://github.com/CyberAgentAI/multimodal-adversarial-training

Abs · PDF · Code1 · Code2 · Code3

Abstract

Pre-trained vision-language (VL) models are highly vulnerable to adversarial attacks. However, existing defense methods primarily focus on image classification, overlooking two key aspects of VL tasks: multimodal attacks, where both image and text can be perturbed, and the one-to-many relationship of images and texts, where a single image can correspond to multiple textual descriptions and vice versa (1:N and N:1). This work is the first to explore defense strategies against multimodal attacks in VL tasks, whereas prior VL defense methods focus on vision robustness. We propose multimodal adversarial training (MAT), which incorporates adversarial perturbations in both image and text modalities during training, significantly outperforming existing unimodal defenses. Furthermore, we discover that MAT is limited by deterministic one-to-one (1:1) image-text pairs in VL training data. To address this, we conduct a comprehensive study on leveraging one-to-many relationships to enhance robustness, investigating diverse augmentation techniques. Our analysis shows that, for a more effective defense, augmented image-text pairs should be well-aligned, diverse, yet avoid distribution shift -- conditions overlooked by prior research. This work pioneers defense strategies against multimodal attacks, providing insights for building robust VLMs from both optimization and data perspectives. Our code is publicly available at https://github.com/CyberAgentAI/multimodal-adversarial-training.

中文标题/摘要

标题：利用一对多关系的多模态对抗防御方法研究

预训练的视觉-语言（VL）模型对对抗攻击极为敏感。然而，现有的防御方法主要集中在图像分类上，忽视了VL任务中的两个关键方面：多模态攻击，其中图像和文本都可以被扰动，以及一对多关系，即一个图像可以对应多个文本描述，反之亦然（1:N和N:1）。本工作是首次探索VL任务中对抗多模态攻击的防御策略，而之前的VL防御方法主要关注视觉鲁棒性。我们提出了多模态对抗训练（MAT），在训练过程中同时在图像和文本模态中引入对抗扰动，显著优于现有的单模态防御方法。此外，我们发现MAT受限于VL训练数据中确定的一对一（1:1）图像-文本对。为了解决这一问题，我们对利用一对多关系增强鲁棒性进行了全面研究，探讨了多种增强技术。我们的分析表明，为了更有效的防御，增强的图像-文本对应该对齐良好、多样化，但要避免分布偏移——这是先前研究中被忽视的条件。本工作开创了对抗多模态攻击的防御策略，从优化和数据两个角度提供了构建鲁棒VL模型的见解。我们的代码已公开发布在https://github.com/CyberAgentAI/multimodal-adversarial-training。

Summary / 总结

This work addresses the vulnerability of pre-trained vision-language models to adversarial attacks by proposing multimodal adversarial training (MAT), which incorporates adversarial perturbations in both image and text modalities. The method significantly outperforms existing unimodal defenses. The study also highlights the importance of leveraging one-to-many relationships in image-text pairs to enhance robustness, suggesting that augmented pairs should be well-aligned, diverse, and avoid distribution shift. This research provides new insights for building robust vision-language models from both optimization and data perspectives.

该研究提出了一种多模态对抗训练（MAT）方法，通过在图像和文本模态中引入对抗扰动来解决预训练的视觉语言模型对抗攻击的脆弱性问题。研究强调了利用图像-文本对的一对多关系来增强鲁棒性的重要性，表明增强的图像-文本对应具备对齐良好、多样化且避免分布偏移的特点。MAT方法优于现有的单模态防御方法，并为构建鲁棒的视觉语言模型提供了优化和数据方面的见解。

ASTRIDE: A Security Threat Modeling Platform for Agentic-AI Applications

Authors: Eranga Bandara, Amin Hass, Ross Gore, Sachin Shetty, Ravi Mukkamala, Safdar H. Bouk, Xueping Liang, Ng Wee Keong, Kasun De Zoysa, Aruna Withanage, Nilaan Loganathan

First: 2025-12-04T13:32:40+00:00 · Latest: 2025-12-04T13:32:40+00:00

Abs · PDF · Code1 · Code2

Abstract

AI agent-based systems are becoming increasingly integral to modern software architectures, enabling autonomous decision-making, dynamic task execution, and multimodal interactions through large language models (LLMs). However, these systems introduce novel and evolving security challenges, including prompt injection attacks, context poisoning, model manipulation, and opaque agent-to-agent communication, that are not effectively captured by traditional threat modeling frameworks. In this paper, we introduce ASTRIDE, an automated threat modeling platform purpose-built for AI agent-based systems. ASTRIDE extends the classical STRIDE framework by introducing a new threat category, A for AI Agent-Specific Attacks, which encompasses emerging vulnerabilities such as prompt injection, unsafe tool invocation, and reasoning subversion, unique to agent-based applications. To automate threat modeling, ASTRIDE combines a consortium of fine-tuned vision-language models (VLMs) with the OpenAI-gpt-oss reasoning LLM to perform end-to-end analysis directly from visual agent architecture diagrams, such as data flow diagrams(DFDs). LLM agents orchestrate the end-to-end threat modeling automation process by coordinating interactions between the VLM consortium and the reasoning LLM. Our evaluations demonstrate that ASTRIDE provides accurate, scalable, and explainable threat modeling for next-generation intelligent systems. To the best of our knowledge, ASTRIDE is the first framework to both extend STRIDE with AI-specific threats and integrate fine-tuned VLMs with a reasoning LLM to fully automate diagram-driven threat modeling in AI agent-based applications.

中文标题/摘要

标题：ASTRIDE：面向代理AI应用的安全威胁建模平台

基于AI代理的系统正逐渐成为现代软件架构中的重要组成部分，通过大型语言模型（LLMs）实现自主决策、动态任务执行和多模态交互。然而，这些系统引入了新型且不断演变的安全挑战，包括提示注入攻击、上下文污染、模型操控和代理间不透明的通信，这些挑战未能被传统的威胁建模框架有效捕捉。在本文中，我们介绍了ASTRIDE，一个专为基于代理的AI系统设计的自动化威胁建模平台。ASTRIDE通过引入一个新的威胁类别A（针对AI代理的特定攻击），扩展了经典的STRIDE框架，该类别涵盖了诸如提示注入、不安全工具调用和推理篡改等新兴漏洞，这些漏洞是代理应用特有的。为了自动化威胁建模，ASTRIDE结合了一个由微调的视觉-语言模型（VLMs）组成的联盟和OpenAI-gpt-oss推理LLM，直接从视觉代理架构图（如数据流图DFDs）进行端到端分析。LLM代理协调整个威胁建模自动化过程，协调VLM联盟与推理LLM之间的交互。我们的评估表明，ASTRIDE能够为下一代智能系统提供准确、可扩展和可解释的威胁建模。据我们所知，ASTRIDE是第一个扩展STRIDE以包含AI特定威胁并结合微调的VLMs与推理LLM以完全自动化基于代理的AI应用中的图驱动威胁建模的框架。

Summary / 总结

ASTRIDE is an automated threat modeling platform designed for AI agent-based systems, extending the classical STRIDE framework to include AI-specific threats like prompt injection and reasoning subversion. It uses a consortium of fine-tuned vision-language models and the OpenAI-gpt-oss reasoning LLM to analyze visual agent architecture diagrams, providing accurate, scalable, and explainable threat modeling. To the best of the authors' knowledge, ASTRIDE is the first framework to integrate fine-tuned VLMs with a reasoning LLM for fully automated diagram-driven threat modeling in AI agent-based applications.

ASTRIDE 是一个自动化威胁建模平台，专为基于AI代理的系统设计，扩展了经典的STRIDE框架，增加了一个新的类别A，针对AI特定的攻击。它使用一个由细调的视觉语言模型组成的联盟和OpenAI-gpt-oss推理LLM来分析视觉代理架构图，自动化威胁建模过程。评估显示，ASTRIDE 提供了准确、可扩展和可解释的威胁建模，是第一个将细调的VLM与推理LLM结合用于基于AI代理的应用程序的图驱动威胁建模的框架。

TTRV: Test-Time Reinforcement Learning for Vision Language Models

Authors: Akshit Singh, Shyam Marjit, Wei Lin, Paul Gavrikov, Serena Yeung-Levy, Hilde Kuehne, Rogerio Feris, Sivan Doveh, James Glass, M. Jehanzeb Mirza

First: 2025-10-08T09:10:31+00:00 · Latest: 2025-12-04T13:17:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of the model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets. Remarkably, on image recognition, TTRV applied to InternVL 8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to 5.5% in recognition tasks.

中文标题/摘要

标题：TTRV：视觉语言模型的测试时强化学习

现有的强化学习中提取奖励信号的方法通常依赖于标记数据和专门的训练分割，这与人类直接从环境中学习的方式不同。在本工作中，我们提出了TTRV，通过在推理时使模型实时适应，从而增强视觉语言理解，无需任何标记数据。具体而言，我们通过基于基模型输出频率设计奖励，结合多次对每个测试样本进行推理，改进了Group Relative Policy Optimization (GRPO)框架。此外，我们还提出通过同时奖励模型以获得输出经验分布的低熵来控制模型输出的多样性。我们的方法在对象识别和视觉问答（VQA）中均表现出一致的改进，分别提高了52.4%和29.8%，并在16个数据集中平均提高了24.6%和10.0%。值得注意的是，在图像识别方面，TTRV应用于InternVL 8B在8个基准测试中平均优于GPT-4o 2.3%，同时在VQA方面保持高度竞争力，表明测试时的强化学习可以匹配或超越最强的专有模型。最后，我们发现测试时的RL对于VLMs有许多有趣的特性：例如，在极端数据受限的场景中，即使在单个随机选择的未标记测试样本上进行适应，TTRV仍能带来高达5.5%的识别任务改进。

Summary / 总结

TTRV proposes a test-time reinforcement learning approach to enhance vision language models by adapting the model at inference time without labeled data. It uses the frequency of the base model's output to design rewards and infers on each test sample multiple times to control output diversity. TTRV shows consistent improvements in object recognition and visual question answering, with up to 52.4% and 29.8% gains, respectively, and average boosts of 24.6% and 10.0% across 16 datasets. On image recognition, TTRV outperforms GPT-4o by 2.3% on average across 8 benchmarks while maintaining competitiveness in VQA.

TTRV通过在推理时调整模型而不使用标注数据，利用基模型输出的频率和通过低熵奖励控制输出多样性来增强视觉语言理解。它在物体识别和视觉问答中实现了持续改进，分别达到52.4%和29.8%的提升，以及在16个数据集上的平均提升24.6%和10.0%。TTRV在图像识别基准测试中也超越了GPT-4o，同时在视觉问答上保持竞争力，展示了即使在数据受限的情况下，测试时的强化学习也能匹配或超越强大的专有模型。

MemLoRA: Distilling Expert Adapters for On-Device Memory Systems

Authors: Massimo Bini, Ondrej Bohdal, Umberto Michieli, Zeynep Akata, Mete Ozay, Taha Ceritli

First: 2025-12-04T12:56:30+00:00 · Latest: 2025-12-04T12:56:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Memory-augmented Large Language Models (LLMs) have demonstrated remarkable consistency during prolonged dialogues by storing relevant memories and incorporating them as context. Such memory-based personalization is also key in on-device settings that allow users to keep their conversations and data private. However, memory-augmented systems typically rely on LLMs that are too costly for local on-device deployment. Even though Small Language Models (SLMs) are more suitable for on-device inference than LLMs, they cannot achieve sufficient performance. Additionally, these LLM-based systems lack native visual capabilities, limiting their applicability in multimodal contexts. In this paper, we introduce (i) MemLoRA, a novel memory system that enables local deployment by equipping SLMs with specialized memory adapters, and (ii) its vision extension MemLoRA-V, which integrates small Vision-Language Models (SVLMs) to memory systems, enabling native visual understanding. Following knowledge distillation principles, each adapter is trained separately for specific memory operations$\unicode{x2013}$knowledge extraction, memory update, and memory-augmented generation. Equipped with memory adapters, small models enable accurate on-device memory operations without cloud dependency. On text-only operations, MemLoRA outperforms 10$\times$ larger baseline models (e.g., Gemma2-27B) and achieves performance comparable to 60$\times$ larger models (e.g., GPT-OSS-120B) on the LoCoMo benchmark. To evaluate visual understanding operations instead, we extend LoCoMo with challenging Visual Question Answering tasks that require direct visual reasoning. On this, our VLM-integrated MemLoRA-V shows massive improvements over caption-based approaches (81.3 vs. 23.7 accuracy) while keeping strong performance in text-based tasks, demonstrating the efficacy of our method in multimodal contexts.

中文标题/摘要

标题：MemLoRA：为本地内存系统配备专家适配器

增强内存的大型语言模型（LLMs）在长时间对话中表现出显著的一致性，通过存储相关记忆并将其作为上下文进行整合。这种基于记忆的个性化在允许用户保持对话和数据隐私的本地设备设置中也至关重要。然而，增强内存的系统通常依赖于成本过高的LLMs，不适合本地设备部署。尽管小型语言模型（SLMs）比LLMs更适合本地推理，但它们无法达到足够的性能。此外，这些基于LLM的系统缺乏原生的视觉能力，限制了它们在多模态环境中的应用。在本文中，我们介绍了(i) MemLoRA，一种新型的内存系统，通过为SLMs配备专门的记忆适配器实现本地部署，以及(ii) 其视觉扩展MemLoRA-V，将小型视觉-语言模型（SVLMs）集成到内存系统中，实现原生的视觉理解。遵循知识蒸馏原则，每个适配器分别针对特定的记忆操作进行训练——知识提取、记忆更新和增强记忆的生成。配备记忆适配器的小型模型能够在没有云依赖的情况下实现准确的本地内存操作。在仅文本操作上，MemLoRA在LoCoMo基准测试中优于10倍更大的基线模型（例如，Gemma2-27B），并在性能上与60倍更大的模型（例如，GPT-OSS-120B）相当。为了评估视觉理解操作，我们扩展了LoCoMo，加入了具有直接视觉推理要求的挑战性视觉问答任务。在这些任务上，我们的VLM集成的MemLoRA-V在准确率上大幅优于基于字幕的方法（81.3 vs. 23.7），同时在基于文本的任务上保持了强大的性能，证明了我们方法在多模态环境中的有效性。

Summary / 总结

The research aims to enable local deployment of memory-augmented systems by equipping Small Language Models (SLMs) with specialized memory adapters, leading to MemLoRA. This system outperforms larger models on text-only operations and shows significant improvements in visual understanding tasks when integrated with Vision-Language Models (VLMs), as demonstrated by the LoCoMo benchmark and Visual Question Answering tasks.

本文介绍了MemLoRA系统，该系统通过为小型语言模型（SLMs）配备专门的记忆适配器来实现本地部署，并引入了MemLoRA-V，该系统集成了小型视觉语言模型（SVLMs）以实现视觉理解。这些适配器分别针对记忆提取、更新和生成进行训练。在文本操作中，MemLoRA的表现优于更大规模的模型，并且在某些情况下可与更大规模的模型相媲美。MemLoRA-V在视觉问答任务中表现出显著的改进，同时在文本任务中保持了强大的性能，证明了其在多模态环境中的有效性。

Jina-VLM: Small Multilingual Vision Language Model

Authors: Andreas Koukounas, Georgios Mastrapas, Florian Hönicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao

First: 2025-12-03T18:13:41+00:00 · Latest: 2025-12-04T12:45:29+00:00

Comments: 18 pages, 1-7 main content, 13-18 appendix for tables and dataset

Abs · PDF · Code1 · Code2 · Code3

Abstract

We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm .

中文标题/摘要

标题：Jina-VLM：小型多语言视觉语言模型

我们提出了Jina-VLM，这是一种参数量为24亿的视觉-语言模型，在开放的2B规模VLM中实现了最先进的多语言视觉问答效果。该模型通过一种注意力池化连接器将SigLIP2视觉编码器与Qwen3语言骨干网络耦合，从而能够高效处理任意分辨率的图像。该模型在标准VQA基准测试和多语言评估中取得了领先结果，同时保持了竞争力的纯文本性能。模型权重和代码已公开发布在https://huggingface.co/jinaai/jina-vlm 。

Summary / 总结

Jina-VLM is a 2.4B parameter vision-language model that excels in multilingual visual question answering, achieving state-of-the-art results among open 2B-scale models. It uses a SigLIP2 vision encoder and a Qwen3 language backbone connected via an attention-pooling mechanism for efficient image processing. The model performs well on standard VQA benchmarks and multilingual evaluations while maintaining competitive text-only performance. The model weights and code are publicly available.

Jina-VLM 是一个24亿参数的视觉语言模型，专为多语言视觉问答设计，达到了最先进的效果。它通过注意力池化连接器将 SigLIP2 视觉编码器与 Qwen3 语言骨干结合，实现对图像的高效处理。该模型在标准 VQA 基准测试和多语言评估中表现出色，同时保持了强大的文本-only 性能。模型权重和代码已公开发布。

E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

Authors: Yihong Tang, Haicheng Liao, Tong Nie, Junlin He, Ao Qu, Kehua Chen, Wei Ma, Zhenning Li, Lijun Sun, Chengzhong Xu

First: 2025-12-04T12:17:25+00:00 · Latest: 2025-12-04T12:17:25+00:00

Abs · PDF · Code1 · Code2

Abstract

End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and human-centric feedback.

中文标题/摘要

标题：E3AD：一种面向人类的端到端情绪感知视觉-语言-行动模型

端到端自动驾驶（AD）系统越来越多地采用视觉-语言-行动（VLA）模型，但通常忽视了乘客的情绪状态，这在舒适性和AD接受度方面至关重要。我们提出了开放域端到端（OD-E2E）自动驾驶，其中自动驾驶车辆（AV）必须解释自由形式的自然语言命令，推断情绪，并规划一个物理上可行的轨迹。我们提出了E3AD，这是一种情绪感知的VLA框架，通过两个认知启发式的组件增强了语义理解：一个连续的愉悦-唤醒-支配（VAD）情绪模型，用于捕捉语言中的语气和紧迫感，以及一个双路径空间推理模块，将第一人称和第三人称视角融合，实现类似人类的空间认知。一种以一致性为导向的训练方案，结合模态预训练与偏好对齐，进一步确保了情绪意图与驾驶行为之间的连贯性。在现实世界数据集上，E3AD 提高了视觉定位和航点规划，并实现了情绪估计的最新技术水平（SOTA）VAD 相关性。这些结果表明，将情绪注入VLA风格的驾驶中，可以实现更符合人类的定位、规划和以人为本的反馈。

Summary / 总结

The research aims to enhance end-to-end autonomous driving systems by incorporating the passenger's emotional state, which is crucial for comfort and acceptance. E3AD, an emotion-aware vision-language-action framework, is proposed to interpret natural-language commands, infer emotions, and plan feasible trajectories. The model uses a continuous VAD emotion model and a dual-pathway spatial reasoning module to improve visual grounding and waypoint planning, achieving state-of-the-art VAD correlation for emotion estimation. This demonstrates that integrating emotion into VLA-style driving can lead to more human-aligned outcomes.

研究旨在通过引入乘客的情绪状态来提升端到端自动驾驶系统的性能，这对于舒适性和接受度至关重要。E3AD 是一个情感感知的视觉-语言-行动框架，能够解析自然语言指令、推断情绪并规划可行的路径。该模型采用连续的VAD情绪模型和双路径空间推理模块，以提高视觉定位和航点规划，实现了情感估计的最新技术水平（SOTA）。这表明将情感融入VLA风格的驾驶可以带来更符合人类行为的成果。

Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild

Authors: Yigui Feng, Qinglin Wang, Haotian Mo, Yang Liu, Ke Liu, Gencheng Liu, Xinhai Chen, Siqi Shen, Songzhu Mei, Jie Liu

First: 2025-12-04T12:13:18+00:00 · Latest: 2025-12-04T12:13:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Generative psychological analysis of in-the-wild conversations faces two fundamental challenges: (1) existing Vision-Language Models (VLMs) fail to resolve Articulatory-Affective Ambiguity, where visual patterns of speech mimic emotional expressions; and (2) progress is stifled by a lack of verifiable evaluation metrics capable of assessing visual grounding and reasoning depth. We propose a complete ecosystem to address these twin challenges. First, we introduce Multilevel Insight Network for Disentanglement(MIND), a novel hierarchical visual encoder that introduces a Status Judgment module to algorithmically suppress ambiguous lip features based on their temporal feature variance, achieving explicit visual disentanglement. Second, we construct ConvoInsight-DB, a new large-scale dataset with expert annotations for micro-expressions and deep psychological inference. Third, Third, we designed the Mental Reasoning Insight Rating Metric (PRISM), an automated dimensional framework that uses expert-guided LLM to measure the multidimensional performance of large mental vision models. On our PRISM benchmark, MIND significantly outperforms all baselines, achieving a +86.95% gain in micro-expression detection over prior SOTA. Ablation studies confirm that our Status Judgment disentanglement module is the most critical component for this performance leap. Our code has been opened.

中文标题/摘要

标题：测量未言说之事：一种心理分析解缠模型及基准

在自然对话中的生成性心理分析面临两大根本挑战：(1) 现有的视觉-语言模型（VLMs）无法解决发音-情感模糊性问题，即视觉上的言语模式模仿情感表达；(2) 缺乏可验证的评估指标阻碍了视觉定位和推理深度的评估。我们提出了一整套生态系统来应对这些挑战。首先，我们引入了多级洞察网络解缠（MIND），这是一种新颖的分层视觉编码器，引入了状态判断模块，基于其时间特征变化算法性地抑制模糊唇部特征，实现显式的视觉解缠。其次，我们构建了ConvoInsight-DB，这是一个新的大规模数据集，包含专家标注的微表情和深层次心理推断。第三，我们设计了心理推理洞察评分指标（PRISM），这是一种自动化的多维度框架，使用专家指导的大规模语言模型来衡量大型心理视觉模型的多维度性能。在我们的PRISM基准上，MIND显著优于所有基线，微表情检测的性能提高了86.95%。消融研究证实，我们的状态判断解缠模块是这一性能飞跃的关键组成部分。我们的代码已开源。

Summary / 总结

This paper addresses the challenges of analyzing in-the-wild conversations by proposing MIND, a novel hierarchical visual encoder that disentangles articulatory-affective ambiguity. It also introduces ConvoInsight-DB, a new dataset for micro-expressions and psychological inference, and PRISM, an automated evaluation metric. MIND outperforms existing methods by 86.95% in micro-expression detection, with the Status Judgment module being the key component for this improvement.

本文提出了一种名为MIND的分层视觉编码器来解决在野对话分析中的表达-情感歧义问题，并构建了ConvoInsight-DB数据集，用于微表情和心理推理。作者还引入了PRISM自动评估框架，用于评估大型心智视觉模型的性能。MIND在PRISM基准上的微表情检测性能比现有方法提高了86.95%，关键在于Status Judgment去歧解模块的贡献。

EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models

Authors: Haiyang Yu, Mengyang Zhao, Jinghui Lu, Ke Niu, Yanjie Wang, Weijie Yin, Weitao Jia, Teng Fu, Yang Liu, Jun Liu, Hong Chen

First: 2025-03-06T03:19:56+00:00 · Latest: 2025-12-04T11:50:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Video subtitles play a crucial role in short videos and movies, as they not only help models better understand video content but also support applications such as video translation and content retrieval. Existing video subtitle extraction methods typically rely on multi-stage frameworks, where errors accumulate across stages and temporal dependencies are underutilized due to frame-wise processing. Moreover, although some Large Vision-Language Models (LVLMs) possess strong OCR capabilities, predicting accurate timestamps for subtitle texts remains challenging. To this end, we propose an End-to-end Video subtitle Extraction framework based on LVLMs, named EVE, which can output subtitles and their timestamps simultaneously. Specifically, we introduce a dual-branch Spatiotemporal Subtitle-Salient (S\textsuperscript{3}) Module that serves as an adapter for LVLMs, capable of representing subtitle-related content and considering inter-frame correlations using only a small number of tokens. Within this module, the Spatial Semantic Context Aggregate branch aggregates high-level global semantics to provide spatial visual contextual information, while the Temporal Subtitle Token Query branch explicitly queries subtitle-relevant tokens while considering temporal correlation across frames. The small number of tokens retained by the S\textsuperscript{3} module are fed to the language model, which then directly outputs the subtitle text along with its timestamps. Furthermore, we construct the first large-scale dataset dedicated to video subtitle extraction, ViSa, containing over 2.5M videos with timestamped and bilingual annotation, thereby providing the community with a well-organized training and evaluation benchmark.

中文标题/摘要

标题：EVE：基于视觉语言模型的端到端视频字幕提取

视频字幕在短视频和电影中起着关键作用，不仅有助于模型更好地理解视频内容，还支持视频翻译和内容检索等应用。现有的视频字幕提取方法通常依赖多阶段框架，各阶段的错误会累积，且由于逐帧处理，时间依赖性被严重低估。此外，尽管一些大型视觉语言模型（LVLMs）具有强大的OCR能力，但预测字幕文本的准确时间戳仍然具有挑战性。为此，我们提出了一种基于LVLMs的端到端视频字幕提取框架EVE，该框架可以同时输出字幕及其时间戳。具体而言，我们引入了一种双分支时空字幕显著性（S³）模块，作为LVLMs的适配器，仅使用少量令牌即可表示与字幕相关的内容并考虑帧间相关性。在该模块中，空间语义上下文聚合分支聚合高层次的全局语义以提供空间视觉上下文信息，而时间字幕令牌查询分支则明确查询与字幕相关的令牌并考虑帧间的时间相关性。S³模块保留的少量令牌被送入语言模型，该模型直接输出字幕文本及其时间戳。此外，我们构建了第一个专门用于视频字幕提取的大规模数据集ViSa，包含超过250万条带有时间戳和双语注释的视频，从而为社区提供了一个组织良好的训练和评估基准。

Summary / 总结

The paper proposes EVE, an end-to-end video subtitle extraction framework using Large Vision-Language Models (LVLMs) to simultaneously generate subtitles and their timestamps. It introduces a dual-branch Spatiotemporal Subtitle-Salient (S³) Module that aggregates spatial and temporal information with a small number of tokens, improving subtitle extraction accuracy. The framework is evaluated on a newly constructed dataset, ViSa, which contains over 2.5 million videos with timestamped and bilingual annotations, demonstrating better performance compared to existing multi-stage methods in terms of subtitle accuracy and timestamp prediction.

论文提出了使用大型视觉-语言模型（LVLMs）的端到端框架EVE，用于视频字幕提取。该框架通过直接输出字幕和时间戳来解决多阶段框架的局限性。双分支时空字幕显著（S³）模块增强了LVLMs，使其能够考虑时空相关性，并使用少量的令牌。实验结果表明，EVE在新构建的ViSa数据集上在字幕提取和时间戳预测的准确性上优于现有方法。