arXiv 论文速递

2025-12-01 03:28
Snapshot: 20251201_0328
G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
Authors: Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, Jiangmiao Pang
First: 2025-11-26T18:59:39+00:00 · Latest: 2025-11-26T18:59:39+00:00
Comments: code are released at https://github.com/InternRobotics/G2VLM
Abstract
Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G$^2$VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.
中文标题/摘要
标题:G$^2$VLM: 基于几何的视觉语言模型,统一三维重建与空间推理
视觉-语言模型(VLMs)在空间智能方面仍然缺乏稳健性,表现出在空间理解和推理任务上的较差性能。我们将其差距归因于缺乏一种能够从二维图像重建三维空间的视觉几何学习过程。我们提出了G$^2$VLM,这是一种基于几何的视觉语言模型,它连接了空间智能的两个基本方面:三维重建和空间理解。G$^2$VLM 本征地利用学习到的三维视觉几何特征,直接预测三维属性,并通过上下文学习和交替推理增强空间推理任务。我们的统一设计在空间理解方面具有高度可扩展性:它在丰富的多视角图像和视频数据上进行训练,同时利用通常仅从难以收集的注释中获得的三维视觉先验的好处。实验结果表明,G$^2$VLM 在两个任务上都表现出色,其三维重建性能与最先进的前馈三维重建模型相当,并且在空间理解和推理任务上取得了更好的或可竞争的结果。通过将语义强的 VLM 与低级三维视觉任务统一起来,我们希望 G$^2$VLM 能够为社区提供一个强大的基线,并解锁更多未来的应用,如三维场景编辑。
Summary / 总结
The research aims to enhance the spatial intelligence of Vision-Language Models (VLMs) by addressing their poor performance in spatial understanding and reasoning. G$^2$VLM, a geometry-grounded vision-language model, integrates 3D reconstruction and spatial reasoning, leveraging 3D visual geometry features to improve spatial understanding. Experimental results show that G$^2$VLM performs comparably to state-of-the-art 3D reconstruction models and outperforms or matches other models in spatial understanding and reasoning tasks.
研究旨在通过解决视觉语言模型在空间理解和推理方面的不足,提升其空间智能。G$^2$VLM 是一种基于几何的视觉语言模型,结合了 3D 重建和空间推理,利用 3D 视觉几何特征来提升空间理解能力。实验结果显示,G$^2$VLM 在 3D 重建任务中的表现与最先进的模型相当,并且在空间理解和推理任务中表现出更好的或相当的性能。
Escaping the Verifier: Learning to Reason via Demonstrations
Authors: Locke Cai, Ivan Provilkov
First: 2025-11-26T18:42:52+00:00 · Latest: 2025-11-26T18:42:52+00:00
Abstract
Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization) that learns strong reasoning capabilities from only expert demonstrations via Inverse Reinforcement Learning. Our method sets up an adversarial interaction between a policy (generator) and a relativistic critic (discriminator): the policy learns to mimic expert answers, while the critic learns to compare and distinguish between policy and expert answers. Our method trains both the policy and the critic jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines on all of our evaluation tasks -- Countdown, DeepMath, and Poetry Writing -- and enjoys the same robust scaling trends as RL on verifiable tasks. These results demonstrate that our method effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable.
中文标题/摘要
标题:超越验证者:通过示范学习推理
训练大型语言模型(LLMs)进行推理通常依赖于特定任务的强化学习(RL)和验证器。然而,许多实际的推理密集型任务缺乏验证器,尽管这些任务提供了大量未充分利用的专家示范。我们引入了RARO(相对对抗推理优化),通过逆向强化学习仅从专家示范中学习强大的推理能力。我们的方法设置了一种对抗性交互,其中策略(生成器)和相对批评者(判别器)之间进行对抗:策略学习模仿专家答案,而批评者学习比较和区分策略和专家答案。我们的方法通过RL联合和连续训练策略和批评者,并确定了实现稳健学习的关键稳定化技术。实验上,RARO在我们的所有评估任务——Countdown、DeepMath和诗歌创作——中显著优于强大的无验证器基线,并且享受与验证任务上RL相同的稳健扩展趋势。这些结果表明,我们的方法能够仅从专家示范中有效激发强大的推理性能,即使在特定任务验证器不可用时也能实现稳健的推理学习。
Summary / 总结
The paper introduces RARO (Relativistic Adversarial Reasoning Optimization), which uses Inverse Reinforcement Learning to train Large Language Models to reason from expert demonstrations without task-specific verifiers. RARO sets up an adversarial interaction between a policy (generator) and a relativistic critic (discriminator), where the policy learns to mimic expert answers and the critic learns to distinguish between policy and expert answers. Empirically, RARO outperforms strong verifier-free baselines on tasks like Countdown, DeepMath, and Poetry Writing, showing that it can effectively elicit strong reasoning performance from expert demonstrations alone.
论文提出了RARO(相对对抗推理优化),该方法使用逆强化学习从专家演示中训练大型语言模型进行推理,而无需特定任务的验证器。RARO 设置了一个对抗交互,其中策略(生成器)学习模仿专家答案,而评论家(判别器)学习区分策略和专家的答案。实验结果显示,RARO 在 Countdown、DeepMath 和诗歌创作等任务上优于无验证器的强基线,表明它可以从专家演示中有效提取出强大的推理性能,即使在没有特定任务验证器的情况下也能实现稳健的推理学习。
Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models
Authors: Naifu Zhang, Wei Tao, Xi Xiao, Qianpu Sun, Yuxin Zheng, Wentao Mo, Peiqiang Wang, Nan Zhang
First: 2025-11-26T18:37:54+00:00 · Latest: 2025-11-26T18:37:54+00:00
Abstract
In recent years, Vision-Language-Action (VLA) models in embodied intelligence have developed rapidly. However, existing adversarial attack methods require costly end-to-end training and often generate noticeable perturbation patches. To address these limitations, we propose ADVLA, a framework that directly applies adversarial perturbations on features projected from the visual encoder into the textual feature space. ADVLA efficiently disrupts downstream action predictions under low-amplitude constraints, and attention guidance allows the perturbations to be both focused and sparse. We introduce three strategies that enhance sensitivity, enforce sparsity, and concentrate perturbations. Experiments demonstrate that under an $L_{\infty}=4/255$ constraint, ADVLA combined with Top-K masking modifies less than 10% of the patches while achieving an attack success rate of nearly 100%. The perturbations are concentrated on critical regions, remain almost imperceptible in the overall image, and a single-step iteration takes only about 0.06 seconds, significantly outperforming conventional patch-based attacks. In summary, ADVLA effectively weakens downstream action predictions of VLA models under low-amplitude and locally sparse conditions, avoiding the high training costs and conspicuous perturbations of traditional patch attacks, and demonstrates unique effectiveness and practical value for attacking VLA feature spaces.
中文标题/摘要
标题:注意力引导的视觉-语言-动作模型分块稀疏对抗攻击
近年来,嵌入式智能中的视觉-语言-动作(VLA)模型发展迅速。然而,现有的对抗攻击方法需要昂贵的端到端训练,并且通常会产生明显的扰动块。为了解决这些限制,我们提出了ADVLA框架,该框架直接在视觉编码器投影到文本特征空间的特征上应用对抗扰动。ADVLA在低振幅约束下有效地破坏了下游动作预测,并且注意力引导使得扰动既集中又稀疏。我们引入了三种策略以增强敏感性、强制稀疏性和集中扰动。实验表明,在$L_{\infty}=4/255$约束下,ADVLA结合Top-K掩码修改的块少于10%,而攻击成功率接近100%。扰动集中在关键区域,几乎不会在整体图像中被察觉,单步迭代仅需约0.06秒,显著优于传统的块基攻击。总之,ADVLA在低振幅和局部稀疏条件下有效地削弱了VLA模型的下游动作预测,避免了传统块攻击的高训练成本和明显扰动,并且在攻击VLA特征空间方面展示了独特的有效性和实用性。
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
Authors: Boshen Xu, Zihan Xiao, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, Qin Jin
First: 2025-11-20T17:48:21+00:00 · Latest: 2025-11-26T18:30:04+00:00
Comments: Project page: https://xuboshen.github.io/TimeViper; Code: https://github.com/xiaomi-research/timeviper
Abstract
We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.
中文标题/摘要
标题:TimeViper:一种混合Mamba-Transformer视觉语言模型,用于高效理解长视频
我们介绍了TimeViper,一种混合视觉语言模型,旨在解决长视频理解的挑战。处理长视频需要高效的模型架构和有效的机制来处理长时间上下文。为此,TimeViper采用了一种混合Mamba-Transformer骨干,结合了状态空间模型的效率和注意力机制的表达能力。通过这种混合设计,我们揭示了视觉到文本信息聚合的现象,其中信息随着LLM深度增加,从视觉标记逐渐流向文本标记,导致视觉标记冗余严重。受此观察的启发,我们提出了TransV,一种标记信息传输模块,将视觉标记转换并压缩为指令标记,同时保持多模态理解能力。这种设计使TimeViper能够处理超过10,000帧的长达一小时的视频。在多个基准上的广泛实验表明,TimeViper在与最先进的模型竞争的同时,扩展了帧数。我们进一步分析了Mamba和Transformer层的注意力行为,提供了关于混合模型可解释性的新见解。这项工作代表了开发、解释和压缩混合Mamba-Transformer架构的初步步骤。
Summary / 总结
TimeViper is a hybrid Mamba-Transformer model designed for efficient long video understanding. It combines the efficiency of state-space models with the expressivity of attention mechanisms. The model reveals a vision-to-text information aggregation phenomenon and proposes TransV, a token information transfer module, to compress vision tokens while maintaining multimodal understanding. Experiments show that TimeViper can process hour-long videos and competes with state-of-the-art models. The work also provides insights into the attention behaviors of hybrid models.
TimeViper 是一种结合了状态空间模型效率和注意力机制表达性的混合 Mamba-Transformer 模型,旨在高效处理长视频理解任务。该模型揭示了视觉信息向文本信息聚合的现象,并提出了一种 Token 信息转移模块 TransV,将视觉 Token 压缩为指令 Token 同时保持多模态理解能力。实验表明,TimeViper 可以处理超过 10,000 帧的小时长视频,并在多个基准上优于最先进的模型。该工作还提供了关于混合模型可解释性的新见解。
Qwen3-VL Technical Report
Authors: Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, Ke Zhu
First: 2025-11-26T17:59:08+00:00 · Latest: 2025-11-26T17:59:08+00:00
Comments: 42 pages
Abstract
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.
中文标题/摘要
标题:Qwen3-VL技术报告
我们介绍了Qwen系列中最强大的视觉-语言模型Qwen3-VL,它在多种跨模态基准测试中表现出色。该模型原生支持多达256K个令牌的交错上下文,无缝集成文本、图像和视频。模型系列包括密集型(2B/4B/8B/32B)和专家混合型(30B-A3B/235B-A22B)变体,以适应不同的延迟-质量权衡。Qwen3-VL提供三个核心支柱:(i)显著增强的纯文本理解,在某些情况下超越了可比的纯文本骨干模型;(ii)强大的长上下文理解,具有原生256K个令牌窗口,适用于文本和交错多模态输入,能够忠实保留、检索和跨长文档和视频进行交叉引用;(iii)跨单图像、多图像和视频任务的高级多模态推理,展示了在MMMU和视觉数学基准测试(如MathVista和MathVision)中的领先性能。从架构上看,我们引入了三个关键升级:(i)增强的交错-MRoPE,以增强图像和视频中的空间-时间建模;(ii)DeepStack集成,有效利用多级ViT特征以加强视觉-语言对齐;(iii)基于文本的时间对齐,从T-RoPE发展为显式的文本时间戳对齐,以实现更精确的时间定位。在可比的令牌预算和延迟约束下,Qwen3-VL在密集型和专家混合型(MoE)架构中均表现出色。我们设想Qwen3-VL将成为图像驱动推理、自主决策和多模态代码智能在实际工作流程中的基础引擎。
Summary / 总结
Qwen3-VL is the most advanced vision-language model in the Qwen series, enhancing pure-text understanding, long-context comprehension, and multimodal reasoning. It supports up to 256K tokens and includes both dense and mixture-of-experts variants. Key upgrades include enhanced interleaved-MRoPE, DeepStack integration, and text-based time alignment for video. Qwen3-VL outperforms comparable models in various benchmarks and architectures.
Qwen3-VL 是 Qwen 系列中最先进的视觉语言模型,增强了纯文本理解、长上下文理解和多模态推理。它支持多达 256K 个令牌,并包括密集型和混合专家型变体。关键升级包括增强的 interleaved-MRoPE、DeepStack 集成以及视频中的文本时间对齐。Qwen3-VL 在各种基准和架构中表现出色,超越了同类模型。
Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy
Authors: Teng Hu, Zhentao Yu, Guozhen Zhang, Zihan Su, Zhengguang Zhou, Youliang Zhang, Yuan Zhou, Qinglin Lu, Ran Yi
First: 2025-11-26T16:53:05+00:00 · Latest: 2025-11-26T16:53:05+00:00
Abstract
The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.
中文标题/摘要
标题:和谐:通过跨任务协同实现音频和视频生成的同步
同步音频-视觉内容的合成是生成式AI中的一个关键挑战,开源模型在稳健的音频-视频对齐方面面临挑战。我们的分析表明,这一问题源于联合扩散过程中的三个基本挑战:(1)对应关系漂移,同时进化的噪声潜在变量阻碍了对齐的稳定学习;(2)低效的全局注意力机制,无法捕捉细微的时间线索;(3)传统无分类器自由引导(CFG)的模内偏差,增强了条件性但未提高跨模态同步。为克服这些挑战,我们引入了Harmony,一种新的框架,机械地确保音频-视觉同步。我们首先提出了一种跨任务协同训练范式,通过利用由音频驱动的视频生成和视频驱动的音频生成任务提供的强监督信号来减轻漂移。然后,我们设计了一种全局-局部解耦交互模块,以实现高效和精确的时间-风格对齐。最后,我们提出了一种新的同步增强CFG(SyncCFG),在推理过程中明确隔离并放大对齐信号。广泛的实验表明,Harmony建立了新的最先进的水平,在生成保真度方面显著优于现有方法,并且在实现细微的音频-视觉同步方面更为关键。
Summary / 总结
The paper addresses the challenge of robust audio-video alignment in generative AI by introducing Harmony, a framework that tackles three key issues: correspondence drift, inefficient global attention, and intra-modal bias. It proposes a Cross-Task Synergy training paradigm, a Global-Local Decoupled Interaction Module, and a Synchronization-Enhanced Classifier-Free Guidance (SyncCFG) to mitigate these issues. Experiments show that Harmony improves generation fidelity and achieves better fine-grained audio-visual synchronization compared to existing methods.
论文通过引入Harmony框架解决了生成AI中稳健的音频-视频对齐问题,该框架解决了三个基本问题:对应偏差、全局注意力效率低下以及模内偏差。Harmony使用跨任务协同训练范式、全局-局部解耦交互模块以及同步增强的分类器自由引导来强制执行音频-视频同步。实验表明,Harmony在生成保真度和细粒度的音频-视频同步方面均优于现有方法。
VacuumVLA: Boosting VLA Capabilities via a Unified Suction and Gripping Tool for Complex Robotic Manipulation
Authors: Hui Zhou, Siyuan Huang, Minxing Li, Hao Zhang, Lue Fan, Shaoshuai Shi
First: 2025-11-26T16:29:24+00:00 · Latest: 2025-11-26T16:29:24+00:00
Comments: 8 pages
Abstract
Vision Language Action models have significantly advanced general purpose robotic manipulation by harnessing large scale pretrained vision and language representations. Among existing approaches, a majority of current VLA systems employ parallel two finger grippers as their default end effectors. However, such grippers face inherent limitations in handling certain real world tasks such as wiping glass surfaces or opening drawers without handles due to insufficient contact area or lack of adhesion. To overcome these challenges, we present a low cost, integrated hardware design that combines a mechanical two finger gripper with a vacuum suction unit, enabling dual mode manipulation within a single end effector. Our system supports flexible switching or synergistic use of both modalities, expanding the range of feasible tasks. We validate the efficiency and practicality of our design within two state of the art VLA frameworks: DexVLA and Pi0. Experimental results demonstrate that with the proposed hybrid end effector, robots can successfully perform multiple complex tasks that are infeasible for conventional two finger grippers alone. All hardware designs and controlling systems will be released.
中文标题/摘要
标题:VacuumVLA:通过统一吸盘和夹持工具提升VLA能力以实现复杂机器人操作
视觉语言行动模型通过利用大规模预训练的视觉和语言表示,显著提升了通用机器人操作的能力。现有方法中,大多数当前的VLA系统默认使用并行双指夹具作为末端执行器。然而,这种夹具在处理某些实际任务时存在局限性,例如擦拭玻璃表面或打开无把手的抽屉,因为接触面积不足或缺乏附着力。为克服这些挑战,我们提出了一种低成本的集成硬件设计,结合了机械双指夹具和真空吸盘单元,使单一末端执行器能够在两种模式之间进行切换或协同使用。我们的系统支持灵活切换或协同使用这两种模式,从而扩大了可行任务的范围。我们通过两个最先进的VLA框架:DexVLA和Pi0验证了该设计的效率和实用性。实验结果表明,使用提出的混合末端执行器,机器人可以成功完成单个双指夹具无法完成的多个复杂任务。所有硬件设计和控制系统的代码将被发布。
Summary / 总结
The research aims to enhance robotic manipulation capabilities by addressing the limitations of parallel two-finger grippers. The study introduces a hybrid end effector combining a mechanical two-finger gripper with a vacuum suction unit, allowing flexible switching or synergistic use of both modes. Experiments conducted within DexVLA and Pi0 frameworks show that this hybrid design enables robots to perform complex tasks that are not feasible with traditional grippers alone, thereby expanding the range of tasks robots can handle.
研究旨在通过将真空吸盘单元与机械两指夹爪集成到单个末端执行器中,以增强机器人操作能力,解决传统夹爪在擦拭玻璃表面或打开无把手抽屉等任务中的局限性。该研究在DexVLA和Pi0框架下验证了这种混合末端执行器,展示了其在执行传统夹爪无法完成的复杂任务方面的改进性能。该设计成本效益高,并允许在两种操作模式之间灵活切换。
TinyChemVL: Advancing Chemical Vision-Language Models via Efficient Visual Token Reduction and Complex Reaction Tasks
Authors: Xuanle Zhao, Shuxin Zeng, Xinyuan Cai, Xiang Cheng, Duzhen Zhang, Xiuyi Chen, Bo Xu
Venue: AAAI 2026
First: 2025-11-09T08:37:18+00:00 · Latest: 2025-11-26T16:22:26+00:00
Comments: Accepted by AAAI 2026
Abstract
While Vision Language Models (VLMs) have demonstrated remarkable capabilities in general visual understanding, their application in the chemical domain has been limited, with previous works predominantly focusing on text and thus overlooking critical visual information, such as molecular structures. Current approaches that directly adopt standard VLMs for chemical tasks suffer from two primary issues: (i) computational inefficiency of processing entire chemical images with non-informative backgrounds. (ii) a narrow scope on molecular-level tasks that restricts progress in chemical reasoning. In this work, we propose \textbf{TinyChemVL}, an efficient and powerful chemical VLM that leverages visual token reduction and reaction-level tasks to improve model efficiency and reasoning capacity. Also, we propose \textbf{ChemRxn-V}, a reaction-level benchmark for assessing vision-based reaction recognition and prediction tasks. Directly predicting reaction products from molecular images poses a non-trivial challenge, as it requires models to integrate both recognition and reasoning capacities. Our results demonstrate that with only 4B parameters, TinyChemVL achieves superior performance on both molecular and reaction tasks while demonstrating faster inference and training speeds compared to existing models. Notably, TinyChemVL outperforms ChemVLM while utilizing only 1/16th of the visual tokens. This work builds efficient yet powerful VLMs for chemical domains by co-designing model architecture and task complexity.
Summary / 总结
TinyChemVL is designed to enhance chemical vision-language models by reducing visual tokens and focusing on reaction-level tasks, addressing the inefficiency and narrow scope issues of previous models. It achieves superior performance on both molecular and reaction tasks with only 4B parameters, demonstrating faster inference and training speeds. Notably, TinyChemVL outperforms ChemVLM using only 1/16th of the visual tokens.
TinyChemVL 通过减少视觉标记并专注于反应级任务来提升化学视觉语言模型,解决了之前模型的效率低和任务范围狭窄问题。它仅使用40亿参数就在分子和反应任务上取得了优越性能,并且具有更快的推理和训练速度。值得注意的是,它使用视觉标记的1/16就超过了ChemVLM。
Video Generation Models Are Good Latent Reward Models
Authors: Xiaoyue Mi, Wenqing Yu, Jiesong Lian, Shibo Jie, Ruizhe Zhong, Zijun Liu, Guozhen Zhang, Zixiang Zhou, Zhiyong Xu, Yuan Zhou, Qinglin Lu, Fan Tang
First: 2025-11-26T16:14:18+00:00 · Latest: 2025-11-26T16:14:18+00:00
Abstract
Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.
中文标题/摘要
标题:视频生成模型是良好的潜在空间奖励模型
奖励反馈学习(ReFL)已被证明对于使图像生成与人类偏好对齐非常有效。然而,将其扩展到视频生成面临着重大挑战。现有的视频奖励模型依赖于为像素空间输入设计的视觉语言模型,这将ReFL优化限制在昂贵的VAE解码之后的近完全去噪步骤中。这种像素空间的方法会产生大量的内存开销并增加训练时间,而且其后期优化缺乏早期监督,只能优化视觉质量而不是基本的运动动态和结构一致性。在本文中,我们展示了预训练的视频生成模型天然适合在噪声潜在空间中进行奖励建模,因为它们明确设计为可以处理任意时间步的噪声潜在表示,并通过其序列建模能力内在地保留时间信息。因此,我们提出了过程奖励反馈学习(PRFL)框架,该框架在潜在空间中完全进行偏好优化,从而在整个去噪链中实现高效的梯度反向传播,而无需VAE解码。广泛的实验表明,PRFL在与人类偏好对齐方面显著提高,同时与RGB ReFL相比在内存消耗和训练时间上实现了显著减少。
Summary / 总结
This work addresses the challenge of applying reward feedback learning (ReFL) to video generation by proposing Process Reward Feedback Learning (PRFL). PRFL leverages pre-trained video generation models to optimize preferences in the noisy latent space, avoiding the need for computationally expensive VAE decoding. This approach leads to better alignment with human preferences, reduced memory consumption, and shorter training times compared to traditional pixel-space ReFL methods.
本文解决了将奖励反馈学习(ReFL)应用于视频生成的问题,尽管ReFL在图像生成中有效,但在扩展到视频时面临重大挑战。作者提出了Process Reward Feedback Learning(PRFL),利用预训练的视频生成模型在潜在空间中优化偏好,避免了昂贵的VAE解码。这种方法减少了内存使用和训练时间,同时提高了与人类偏好的一致性。
Flow Matching Meets PDEs: A Unified Framework for Physics-Constrained Generation
Authors: Giacomo Baldan, Qiang Liu, Alberto Guardone, Nils Thuerey
First: 2025-06-10T09:13:37+00:00 · Latest: 2025-11-26T15:56:35+00:00
Abstract
Generative machine learning methods, such as diffusion models and flow matching, have shown great potential in modeling complex system behaviors and building efficient surrogate models. However, these methods typically learn the underlying physics implicitly from data. We propose Physics-Based Flow Matching (PBFM), a novel generative framework that explicitly embeds physical constraints, both PDE residuals and algebraic relations, into the flow matching objective. We also introduce temporal unrolling at training time that improves the accuracy of the final, noise-free sample prediction. Our method jointly minimizes the flow matching loss and the physics-based residual loss without requiring hyperparameter tuning of their relative weights. Additionally, we analyze the role of the minimum noise level, $σ_{\min}$, in the context of physical constraints and evaluate a stochastic sampling strategy that helps to reduce physical residuals. Through extensive benchmarks on three representative PDE problems, we show that our approach yields up to an $8\times$ more accurate physical residuals compared to FM, while clearly outperforming existing algorithms in terms of distributional accuracy. PBFM thus provides a principled and efficient framework for surrogate modeling, uncertainty quantification, and accelerated simulation in physics and engineering applications.
中文标题/摘要
标题:流匹配与PDEs的结合:一种统一的物理约束生成框架
生成式机器学习方法,如扩散模型和流匹配,已经在建模复杂系统行为和构建高效代理模型方面显示出巨大潜力。然而,这些方法通常从数据中隐式地学习潜在的物理规律。我们提出了基于物理的流匹配(PBFM),这是一种新颖的生成框架,它将物理约束,包括PDE残差和代数关系,明确嵌入到流匹配目标中。我们还在训练时引入了时间展开,提高了最终无噪声样本预测的准确性。我们的方法同时最小化流匹配损失和基于物理的残差损失,无需调整它们相对权重的超参数。此外,我们分析了最小噪声水平$σ_{\min}$在物理约束下的作用,并评估了一种随机采样策略,有助于减少物理残差。通过在三个代表性PDE问题上的广泛基准测试,我们展示了我们的方法在物理残差准确性上比流匹配高出8倍,同时在分布准确性方面明显优于现有算法。因此,PBFM为物理和工程应用中的代理建模、不确定性量化和加速仿真提供了一个原理上和效率上的框架。
Summary / 总结
The research aims to improve generative models by explicitly incorporating physical constraints, such as partial differential equations (PDEs), into the flow matching objective. The proposed Physics-Based Flow Matching (PBFM) framework uses temporal unrolling during training to enhance the accuracy of noise-free sample predictions. Key experimental results show that PBFM achieves up to an 8-fold reduction in physical residuals compared to traditional flow matching (FM) methods, while also excelling in distributional accuracy. This approach provides a principled and efficient framework for physics and engineering applications.
该论文提出了基于物理的流匹配(PBFM)方法,该方法将物理约束显式地嵌入到流匹配目标中,以提高复杂系统(由偏微分方程PDEs描述)预测的准确性。该方法在训练过程中使用时间展开来增强最终样本预测,并同时最小化流匹配和基于物理的残差损失。实验结果表明,PBFM在三个PDE问题上的物理残差比传统流匹配方法低8倍,同时在分布准确性方面也表现出色。
EoS-FM: Can an Ensemble of Specialist Models act as a Generalist Feature Extractor?
Authors: Pierre Adorni, Minh-Tan Pham, Stéphane May, Sébastien Lefèvre
First: 2025-11-26T15:52:56+00:00 · Latest: 2025-11-26T15:52:56+00:00
Abstract
Recent advances in foundation models have shown great promise in domains such as natural language processing and computer vision, and similar efforts are now emerging in the Earth Observation community. These models aim to generalize across tasks with limited supervision, reducing the need for training separate models for each task. However, current strategies, which largely focus on scaling model size and dataset volume, require prohibitive computational and data resources, limiting accessibility to only a few large institutions. Moreover, this paradigm of ever-larger models stands in stark contrast with the principles of sustainable and environmentally responsible AI, as it leads to immense carbon footprints and resource inefficiency. In this work, we present a novel and efficient alternative: an Ensemble-of-Specialists framework for building Remote Sensing Foundation Models (RSFMs). Our method decomposes the training process into lightweight, task-specific ConvNeXtV2 specialists that can be frozen and reused. This modular approach offers strong advantages in efficiency, interpretability, and extensibility. Moreover, it naturally supports federated training, pruning, and continuous specialist integration, making it particularly well-suited for collaborative and resource-constrained settings. Our framework sets a new direction for building scalable and efficient RSFMs.
中文标题/摘要
标题:EoS-FM:专家模型集合能否充当通用特征提取器?
基础模型在自然语言处理和计算机视觉等领域取得了巨大进展,类似的努力现在也在地球观测领域出现。这些模型旨在在有限监督的情况下泛化任务,减少为每个任务单独训练模型的需要。然而,当前的策略主要集中在扩大模型规模和数据集规模上,这需要巨大的计算和数据资源,限制了其仅对少数大型机构的可用性。此外,这种不断扩大的模型范式与可持续和环境友好的人工智能原则背道而驰,因为它导致了巨大的碳足迹和资源低效。在本文中,我们提出了一种新颖且高效的替代方案:用于构建遥感基础模型(RSFM)的专家模型集合框架。我们的方法将训练过程分解为轻量级、任务特定的ConvNeXtV2专家,这些专家可以冻结并重用。这种模块化方法在效率、可解释性和可扩展性方面具有明显优势。此外,它自然支持联邦训练、剪枝和持续专家集成,使其特别适合协作和资源受限的环境。我们的框架为构建可扩展和高效的RSFM指明了新方向。
Summary / 总结
This paper explores the feasibility of using an Ensemble-of-Specialists framework to build Remote Sensing Foundation Models (RSFMs), addressing the limitations of current large-scale models in terms of computational and data resource requirements. The method involves training lightweight, task-specific ConvNeXtV2 specialists that can be reused, offering advantages in efficiency, interpretability, and extensibility. Key findings include the framework's ability to support federated training, pruning, and continuous integration of specialists, making it suitable for collaborative and resource-constrained settings and setting a new direction for scalable and efficient RSFMs.
本文探讨了使用Ensemble-of-Specialists框架构建遥感基础模型(RSFMs)的可能性,以解决当前大规模模型在计算和数据资源需求方面的局限性。该方法涉及训练轻量级、任务特定的ConvNeXtV2专家,可以重复使用,提供高效性、可解释性和可扩展性的优势。关键发现包括该框架支持联邦训练、剪枝和持续集成专家,使其适用于协作和资源受限的环境,并为构建可扩展和高效的RSFMs开辟了新方向。
IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference
Authors: Wanli Zhong, Haibo Feng, Zirui Zhou, Hanyang Peng, Shiqi Yu
First: 2025-11-26T15:46:22+00:00 · Latest: 2025-11-26T15:46:22+00:00
Abstract
Deploying Transformer models on edge devices is limited by latency and energy budgets. While INT8 quantization effectively accelerates the primary matrix multiplications, it exposes the softmax as the dominant bottleneck. This stage incurs a costly dequantize-softmax-requantize detour, which can account for up to 65% of total attention latency and disrupts the end-to-end integer dataflow critical for edge hardware efficiency. To address this limitation, we present IntAttention, the first fully integer, plug-and-play attention pipeline without retraining. At the core of our approach lies IndexSoftmax, a hardware-friendly operator that replaces floating-point exponentials entirely within the integer domain. IntAttention integrates sparsity-aware clipping, a 32-entry lookup-table approximation, and direct integer normalization, thereby eliminating all datatype conversion overhead. We evaluate IntAttention and demonstrate consistent and substantial gains. Our method achieves up to 3.7x speedup and 61% energy reduction over FP16 baselines and 2.0x faster than conventional INT8 attention pipelines on Armv8 CPUs. These gains are achieved with high-fidelity accuracy comparable to baselines across diverse language and vision models, enabling practical and efficient Transformer inference on commodity edge devices. Code will be released in later version of this work.
中文标题/摘要
标题:IntAttention:一种用于高效边缘推理的全整数注意流水线
在边缘设备上部署Transformer模型受到延迟和能量预算的限制。虽然INT8量化有效地加速了主要的矩阵乘法,但它暴露了softmax作为主要瓶颈。这一阶段会引发一个昂贵的去量化-softmax-再量化迂回,这可能占总注意延迟的65%以上,并破坏了对于边缘硬件效率至关重要的端到端整数数据流。为了解决这一限制,我们提出了IntAttention,这是第一个无需重新训练即可插拔的全整数注意流水线。我们方法的核心是IndexSoftmax,这是一种硬件友好的操作符,完全在整数域内替代了浮点数指数。IntAttention集成了感知稀疏剪枝、32条目查找表近似和直接整数归一化,从而消除了所有数据类型转换开销。我们评估了IntAttention,并展示了持续且显著的收益。我们的方法在Armv8 CPU上比FP16基线快3.7倍,能效降低61%,比传统的INT8注意流水线快2.0倍。这些收益是在各种语言和视觉模型中与基线保持高保真准确性的前提下实现的,使得实用且高效的Transformer推理成为可能,适用于商品化边缘设备。代码将在本工作的后续版本中发布。
Summary / 总结
IntAttention addresses the bottleneck of softmax in deploying Transformer models on edge devices by introducing IndexSoftmax, a hardware-friendly operator that operates entirely within the integer domain. This approach, combined with sparsity-aware clipping, a lookup-table approximation, and direct integer normalization, eliminates datatype conversion overhead. Experiments show that IntAttention achieves up to 3.7x speedup and 61% energy reduction compared to FP16 baselines and 2.0x faster than conventional INT8 attention pipelines on Armv8 CPUs, while maintaining high-fidelity accuracy across various models.
IntAttention通过引入IndexSoftmax,一种完全在整数域内操作的硬件友好型操作符,解决了在边缘设备上部署Transformer模型时softmax的瓶颈问题。该方法结合了稀疏感知裁剪、32项查找表近似和直接整数归一化,消除了数据类型转换的开销。IntAttention在Armv8 CPU上实现了与FP16基线相比高达3.7倍的加速和61%的能量减少,并且在各种模型上保持了高保真度的准确性,比传统的INT8注意力管道快2.0倍。
Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning
Authors: Liqin Luo, Guangyao Chen, Xiawu Zheng, Yongxing Dai, Yixiong Zou, Yonghong Tian
Venue: AAAI 2026
First: 2025-11-24T03:11:08+00:00 · Latest: 2025-11-26T15:38:35+00:00
Comments: AAAI 2026
Abstract
Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive task-specific annotations and fine-tuning, limiting their ability to generalize effectively to novel or out-of-distribution scenarios. To address these limitations, we introduce GroundingAgent, a novel agentic visual grounding framework that operates without any task-specific fine-tuning. GroundingAgent employs a structured, iterative reasoning mechanism that integrates pretrained open-vocabulary object detectors, multimodal large language models (MLLMs), and large language models (LLMs) to progressively refine candidate regions through joint semantic and spatial analyses. Remarkably, GroundingAgent achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks (RefCOCO, RefCOCO+, RefCOCOg), entirely without fine-tuning. Furthermore, by substituting MLLM-generated captions with the original query texts, the accuracy at the selection stage alone reaches approximately 90 %, closely matching supervised performance and underscoring the critical role of LLM reasoning capabilities. GroundingAgent also offers strong interpretability, transparently illustrating each reasoning step and providing clear insights into its decision-making process.
中文标题/摘要
标题:连接点:无需训练的代理视觉定位
视觉定位,即将文本查询与图像中的特定区域联系起来的任务,在视觉-语言整合中起着关键作用。现有方法通常依赖于大量特定任务的注释和微调,限制了它们在新颖或分布外场景中的泛化能力。为了解决这些限制,我们引入了GroundingAgent,这是一种新颖的无需特定任务微调的代理视觉定位框架。GroundingAgent 使用结构化的迭代推理机制,结合预训练的开放式词汇对象检测器、多模态大型语言模型(MLLM)和大型语言模型(LLM),通过联合语义和空间分析逐步细化候选区域。令人惊讶的是,GroundingAgent 在广泛使用的基准测试(RefCOCO、RefCOCO+、RefCOCOg)上实现了65.1%的零样本定位准确率,完全无需微调。此外,通过用MLLM生成的描述替换原始查询文本,选择阶段的准确率达到了约90%,接近监督性能,突显了LLM推理能力的关键作用。GroundingAgent 还提供了强大的可解释性,透明地展示了每个推理步骤,并提供了其决策过程的清晰见解。
Summary / 总结
The paper introduces GroundingAgent, a training-free visual grounding framework that uses agentic reasoning to link textual queries to specific image regions. It leverages pretrained object detectors, multimodal large language models, and large language models to iteratively refine candidate regions. GroundingAgent achieves 65.1% zero-shot grounding accuracy on benchmarks like RefCOCO, RefCOCO+, and RefCOCOg without fine-tuning, and its accuracy at the selection stage alone reaches 90% when using original query texts, highlighting the importance of LLM reasoning capabilities.
论文提出了GroundingAgent,这是一种无需训练的视觉定位框架,利用代理推理将文本查询与图像中的特定区域连接起来。该框架结合了预训练的对象检测器、多模态大型语言模型和大型语言模型,逐步细化候选区域。在无需微调的情况下,GroundingAgent 在基准测试上的零样本准确率达到65.1%,仅在选择阶段使用原始查询文本的准确率就达到了约90%,突显了LLM推理能力的重要性。
Step-Audio-R1 Technical Report
Authors: Fei Tian, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Yuxin Li, Daijiao Liu, Yayue Deng, Donghang Wu, Jun Chen, Liang Zhao, Chengyuan Yao, Hexin Liu, Eng Siong Chng, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu
First: 2025-11-19T20:12:50+00:00 · Latest: 2025-11-26T14:55:41+00:00
Comments: 22 pages, 5 figures. Technical Report
Abstract
Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.
中文标题/摘要
标题:Step-Audio-R1 技术报告
近期在推理模型方面的进展通过扩展的链式思考在文本和视觉领域取得了显著的成功。然而,在音频语言模型中存在一个令人困惑的现象:它们在几乎没有或完全没有推理的情况下表现更好,这引发了一个基本问题——音频智能是否真的可以从深思熟虑中受益?我们提出了Step-Audio-R1,这是第一个成功在音频领域解锁推理能力的音频推理模型。通过我们提出的模态导向推理蒸馏(MGRD)框架,Step-Audio-R1 学会生成与音频相关的推理链,这些链真正扎根于声学特征,而不是产生与声学特征无关的幻觉。我们的模型展示了强大的音频推理能力,超越了Gemini 2.5 Pro,并在涵盖语音、环境声音和音乐的全面音频理解和推理基准测试中达到了与Gemini 3 Pro相当的性能。这些结果表明,当适当锚定时,推理能力可以在不同模态之间转移,将扩展的思考从负担转化为音频智能的强大资产。通过建立第一个成功的音频推理模型,Step-Audio-R1 打开了通往构建真正多模态推理系统的道路,这些系统可以在所有感官模态中进行深入思考。
MADRA: Multi-Agent Debate for Risk-Aware Embodied Planning
Authors: Junjian Wang, Lidan Zhao, Xi Sheryl Zhang
First: 2025-11-26T14:51:37+00:00 · Latest: 2025-11-26T14:51:37+00:00
Abstract
Ensuring the safety of embodied AI agents during task planning is critical for real-world deployment, especially in household environments where dangerous instructions pose significant risks. Existing methods often suffer from either high computational costs due to preference alignment training or over-rejection when using single-agent safety prompts. To address these limitations, we propose MADRA, a training-free Multi-Agent Debate Risk Assessment framework that leverages collective reasoning to enhance safety awareness without sacrificing task performance. MADRA employs multiple LLM-based agents to debate the safety of a given instruction, guided by a critical evaluator that scores responses based on logical soundness, risk identification, evidence quality, and clarity. Through iterative deliberation and consensus voting, MADRA significantly reduces false rejections while maintaining high sensitivity to dangerous tasks. Additionally, we introduce a hierarchical cognitive collaborative planning framework that integrates safety, memory, planning, and self-evolution mechanisms to improve task success rates through continuous learning. We also contribute SafeAware-VH, a benchmark dataset for safety-aware task planning in VirtualHome, containing 800 annotated instructions. Extensive experiments on AI2-THOR and VirtualHome demonstrate that our approach achieves over 90% rejection of unsafe tasks while ensuring that safe-task rejection is low, outperforming existing methods in both safety and execution efficiency. Our work provides a scalable, model-agnostic solution for building trustworthy embodied agents.
中文标题/摘要
标题:MADRA:多智能体辩论以风险意识指导体态规划
确保体态AI代理在任务规划过程中的安全性对于实际部署至关重要,特别是在家庭环境中,危险指令会带来重大风险。现有方法往往由于偏好对齐训练导致高计算成本,或者由于使用单智能体安全提示而过度拒绝。为了解决这些限制,我们提出了MADRA,一种无需训练的多智能体辩论风险评估框架,利用集体推理来增强安全意识而不牺牲任务性能。MADRA 使用多个基于LLM的智能体来辩论给定指令的安全性,由一个关键评估器根据逻辑严谨性、风险识别、证据质量和清晰度来评分。通过迭代辩论和共识投票,MADRA 显著减少了错误拒绝,同时保持对危险任务的高敏感度。此外,我们引入了一种分层认知协作规划框架,结合了安全、记忆、规划和自我进化机制,通过持续学习提高任务成功率。我们还贡献了SafeAware-VH,一个用于VirtualHome中安全感知任务规划的基准数据集,包含800条标注指令。在AI2-THOR和VirtualHome上的广泛实验表明,我们的方法在安全性和执行效率方面均优于现有方法,能够拒绝超过90%的不安全任务,同时确保安全任务的拒绝率较低。我们的工作提供了一种可扩展、模型无关的解决方案,用于构建可信赖的体态代理。
Summary / 总结
The research aims to ensure the safety of embodied AI agents during task planning in household environments. MADRA, a training-free Multi-Agent Debate Risk Assessment framework, uses multiple LLM-based agents to debate the safety of instructions, guided by a critical evaluator. This approach significantly reduces false rejections while maintaining high sensitivity to dangerous tasks. Experiments show that MADRA achieves over 90% rejection of unsafe tasks with low safe-task rejection, outperforming existing methods in safety and execution efficiency.
MADRA 是一个无需训练的框架,使用多个基于LLM的代理进行安全指令的辩论,并由关键评估器引导。这种方法在不牺牲任务性能的情况下增强了安全意识。实验表明,MADRA 显著减少了误拒绝,并在安全和执行效率方面优于现有方法,实现了对危险任务超过90%的拒绝率,同时保持了对安全任务的低拒绝率。
SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation
Authors: Zhenjie Mao, Yuhuan Yang, Chaofan Ma, Dongsheng Jiang, Jiangchao Yao, Ya Zhang, Yanfeng Wang
Venue: NeurIPS 2025
First: 2025-10-11T10:50:58+00:00 · Latest: 2025-11-26T14:51:06+00:00
Comments: NeurIPS 2025; Project page: https://zhenjiemao.github.io/SaFiRe/
Abstract
Referring Image Segmentation (RIS) aims to segment the target object in an image given a natural language expression. While recent methods leverage pre-trained vision backbones and more training corpus to achieve impressive results, they predominantly focus on simple expressions--short, clear noun phrases like "red car" or "left girl". This simplification often reduces RIS to a key word/concept matching problem, limiting the model's ability to handle referential ambiguity in expressions. In this work, we identify two challenging real-world scenarios: object-distracting expressions, which involve multiple entities with contextual cues, and category-implicit expressions, where the object class is not explicitly stated. To address the challenges, we propose a novel framework, SaFiRe, which mimics the human two-phase cognitive process--first forming a global understanding, then refining it through detail-oriented inspection. This is naturally supported by Mamba's scan-then-update property, which aligns with our phased design and enables efficient multi-cycle refinement with linear complexity. We further introduce aRefCOCO, a new benchmark designed to evaluate RIS models under ambiguous referring expressions. Extensive experiments on both standard and proposed datasets demonstrate the superiority of SaFiRe over state-of-the-art baselines.
中文标题/摘要
标题:SaFiRe: 跳跃-固定重复与蟒蛇用于参考图像分割
参考图像分割(RIS)旨在根据自然语言表达将图像中的目标对象进行分割。虽然最近的方法利用预训练的视觉骨干网络和更多的训练语料库取得了令人印象深刻的成果,但它们主要关注简单的表达——简短、清晰的名词短语,如“红色汽车”或“左边的女孩”。这种简化往往将RIS简化为关键词/概念匹配问题,限制了模型处理表达中的指称歧义的能力。在本文中,我们识别了两个具有挑战性的现实场景:对象分散的表达,涉及多个实体并带有上下文线索,以及类别隐含的表达,其中对象类别未明确陈述。为了解决这些挑战,我们提出了一种新的框架SaFiRe,它模拟了人类的两阶段认知过程——首先形成全局理解,然后通过细节检查进行细化。这自然地得到了Mamba的扫描-更新属性的支持,与我们分阶段的设计相契合,并允许具有线性复杂度的高效多轮细化。我们还引入了aRefCOCO,这是一个新的基准,用于评估在模糊的参考表达下RIS模型的表现。在标准数据集和提出的数据集上的广泛实验表明,SaFiRe在最先进的基线之上具有优越性。
Summary / 总结
The paper addresses the limitations of existing referring image segmentation methods in handling complex and ambiguous expressions. It proposes SaFiRe, a framework that mimics human cognitive processes by first forming a global understanding and then refining it through detailed inspection. SaFiRe uses Mamba's scan-then-update property to enable efficient multi-cycle refinement. Experiments show that SaFiRe outperforms current state-of-the-art methods on both standard and proposed datasets.
研究针对现有方法在Referring Image Segmentation (RIS)中对复杂表达的局限性,这些表达涉及多个实体和隐含类别。提出了一种名为SaFiRe的新框架,模仿人类的认知过程,利用Mamba的扫描-更新特性进行高效的多轮次细化。实验表明,SaFiRe在标准数据集和新提出的模糊表达数据集上均优于现有方法。
Probabilistic Robustness for Free? Revisiting Training via a Benchmark
Authors: Yi Zhang, Zheng Wang, Zhen Chen, Wenjie Ruan, Qing Guo, Siddartha Khastgir, Carsten Maple, Xingyu Zhao
First: 2025-11-03T16:33:57+00:00 · Latest: 2025-11-26T14:24:35+00:00
Abstract
Deep learning models are notoriously vulnerable to imperceptible perturbations. Most existing research centers on adversarial robustness (AR), which evaluates models under worst-case scenarios by examining the existence of deterministic adversarial examples (AEs). In contrast, probabilistic robustness (PR) adopts a statistical perspective, measuring the probability that predictions remain correct under stochastic perturbations. While PR is widely regarded as a practical complement to AR, dedicated training methods for improving PR are still relatively underexplored, albeit with emerging progress. Among the few PR-targeted training methods, we identify three limitations: i non-comparable evaluation protocols; ii limited comparisons to strong AT baselines despite anecdotal PR gains from AT; and iii no unified framework to compare the generalization of these methods. Thus, we introduce PRBench, the first benchmark dedicated to evaluating improvements in PR achieved by different robustness training methods. PRBench empirically compares most common AT and PR-targeted training methods using a comprehensive set of metrics, including clean accuracy, PR and AR performance, training efficiency, and generalization error (GE). We also provide theoretical analysis on the GE of PR performance across different training methods. Main findings revealed by PRBench include: AT methods are more versatile than PR-targeted training methods in terms of improving both AR and PR performance across diverse hyperparameter settings, while PR-targeted training methods consistently yield lower GE and higher clean accuracy. A leaderboard comprising 222 trained models across 7 datasets and 10 model architectures is publicly available at https://tmpspace.github.io/PRBenchLeaderboard/.
中文标题/摘要
标题:免费的概率鲁棒性?重新审视基于基准的训练
深度学习模型对不可感知的扰动极其脆弱。现有大多数研究集中在对抗鲁棒性(AR),通过检查确定性对抗样本(AEs)的存在性来评估模型在最坏情况下的表现。相比之下,概率鲁棒性(PR)采用统计视角,衡量在随机扰动下预测保持正确的概率。尽管PR被视为AR的一种实用补充,但专门提高PR的训练方法仍然相对未被充分探索,尽管有逐渐出现的进步。在少数针对PR的训练方法中,我们识别出三个局限:i)不可比较的评估协议;ii)尽管对抗训练(AT)在PR方面有 anecdotal 收益,但与强大的AT基线相比,比较有限;iii)没有统一框架来比较这些方法的泛化能力。因此,我们引入了PRBench,这是第一个专门用于评估不同鲁棒性训练方法在提高PR方面取得的改进的基准。PRBench使用一系列综合指标(包括干净准确率、PR和AR性能、训练效率和泛化误差(GE))来实证比较最常见的AT和PR目标训练方法。我们还对不同训练方法下的PR性能的泛化误差进行了理论分析。PRBench的主要发现包括:AT方法在各种超参数设置下提高AR和PR性能方面比PR目标训练方法更具通用性,而PR目标训练方法始终具有更低的泛化误差和更高的干净准确率。公开可用的排行榜包括7个数据集和10种模型架构下的222个训练模型,可在https://tmpspace.github.io/PRBenchLeaderboard/找到。
Summary / 总结
The paper aims to evaluate probabilistic robustness (PR) in deep learning models by introducing PRBench, the first benchmark for comparing PR-targeted training methods. It empirically compares adversarial training (AT) and PR-targeted methods using metrics such as clean accuracy, PR and AR performance, training efficiency, and generalization error (GE). Key findings show that AT methods are more versatile in improving both AR and PR performance across different hyperparameters, while PR-targeted methods consistently achieve lower GE and higher clean accuracy.
论文针对深度学习模型在概率鲁棒性(PR)训练方法方面的局限性,引入了PRBench,这是首个用于评估PR改进的基准。PRBench 使用包括干净准确率、PR 和 AR 性能、训练效率和泛化误差在内的多种指标来比较 AT 和 PR 目标训练方法。主要发现表明,AT 方法在提高 AR 和 PR 性能方面更具通用性,而 PR 目标训练方法则始终具有更低的泛化误差和更高的干净准确率。
From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
Authors: Jiajie Zhang, Sören Schwertfeger, Alexander Kleiner
First: 2025-11-26T14:19:44+00:00 · Latest: 2025-11-26T14:19:44+00:00
Comments: 10 pages, 5 figures
Abstract
We present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel "Latent Action Energy" metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. To our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.
中文标题/摘要
标题:从观察到行动:基于潜在行动的隐式分割方法在工业场景中为VLA预训练
我们提出了一种新颖的无监督框架,以解锁来自连续工业视频流的大量未标记的人类演示数据,用于视觉-语言-行动(VLA)模型的预训练。该方法首先训练一个轻量级的运动分词器以编码运动动力学,然后利用一种新颖的“潜在行动能量”度量的无监督行动分割器来发现和分割语义上一致的动作基元。该流水线输出分割后的视频片段及其对应的潜在行动序列,提供直接适用于VLA预训练的结构化数据。在公共基准测试和一个专有的电动机装配数据集上的评估表明,该方法能够有效分割工作台上的关键任务。进一步的聚类和通过视觉-语言模型进行的定量评估证实了发现的动作基元的语义一致性。据我们所知,这是第一个从非结构化工业视频中自动提取和组织VLA预训练数据的端到端系统,为制造中的嵌入式人工智能集成提供了可扩展的解决方案。
SAM Guided Semantic and Motion Changed Region Mining for Remote Sensing Change Captioning
Authors: Futian Wang, Mengqi Wang, Xiao Wang, Haowen Wang, Jin Tang
First: 2025-11-26T14:11:19+00:00 · Latest: 2025-11-26T14:11:19+00:00
Abstract
Remote sensing change captioning is an emerging and popular research task that aims to describe, in natural language, the content of interest that has changed between two remote sensing images captured at different times. Existing methods typically employ CNNs/Transformers to extract visual representations from the given images or incorporate auxiliary tasks to enhance the final results, with weak region awareness and limited temporal alignment. To address these issues, this paper explores the use of the SAM (Segment Anything Model) foundation model to extract region-level representations and inject region-of-interest knowledge into the captioning framework. Specifically, we employ a CNN/Transformer model to extract global-level vision features, leverage the SAM foundation model to delineate semantic- and motion-level change regions, and utilize a specially constructed knowledge graph to provide information about objects of interest. These heterogeneous sources of information are then fused via cross-attention, and a Transformer decoder is used to generate the final natural language description of the observed changes. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple widely used benchmark datasets. The source code of this paper will be released on https://github.com/Event-AHU/SAM_ChangeCaptioning
中文标题/摘要
标题:SAM引导的语义和运动变化区域挖掘在遥感变化描述中的应用
遥感变化描述是一项新兴且流行的科研任务,旨在使用自然语言描述两幅在不同时间拍摄的遥感图像之间的内容变化。现有方法通常使用CNNs/Transformers从给定图像中提取视觉表示,或结合辅助任务以增强最终结果,但缺乏区域意识且时间对齐有限。为解决这些问题,本文探讨了使用SAM(Segment Anything Model)基础模型提取区域级表示,并将感兴趣区域知识注入描述框架。具体而言,我们使用CNN/Transformer模型提取全局视觉特征,利用SAM基础模型界定语义和运动变化区域,并通过一个特别构建的知识图谱提供感兴趣对象的信息。这些异构信息随后通过交叉注意力融合,使用Transformer解码器生成最终的自然语言描述。大量实验结果表明,我们的方法在多个广泛使用的基准数据集上达到了最先进的性能。本文的源代码将在https://github.com/Event-AHU/SAM_ChangeCaptioning上发布
Summary / 总结
The research aims to improve remote sensing change captioning by addressing weak region awareness and limited temporal alignment. It uses the SAM model to extract semantic and motion change regions and integrates this with a CNN/Transformer model and a knowledge graph. The method achieves state-of-the-art performance on multiple benchmark datasets.
该论文通过提出一种使用SAM(Segment Anything Model)提取区域级表示并将区域兴趣知识注入描述框架的方法,来解决遥感变化描述的挑战。该方法使用CNN/Transformer模型提取全局视觉特征,利用SAM勾勒出语义和运动变化区域,并将这些信息与知识图谱融合。实验结果表明,该方法在多个基准数据集上优于现有方法,实现了最先进的性能。
Odin: Oriented Dual-module Integration for Text-rich Network Representation Learning
Authors: Kaifeng Hong, Yinglong Zhang, Xiaoying Hong, Xuewen Xia, Xing Xu
First: 2025-11-26T14:07:07+00:00 · Latest: 2025-11-26T14:07:07+00:00
Comments: 32 pages, 2 figures
Abstract
Text-attributed graphs require models to effectively combine strong textual understanding with structurally informed reasoning. Existing approaches either rely on GNNs--limited by over-smoothing and hop-dependent diffusion--or employ Transformers that overlook graph topology and treat nodes as isolated sequences. We propose Odin (Oriented Dual-module INtegration), a new architecture that injects graph structure into Transformers at selected depths through an oriented dual-module mechanism.Unlike message-passing GNNs, Odin does not rely on multi-hop diffusion; instead, multi-hop structures are integrated at specific Transformer layers, yielding low-, mid-, and high-level structural abstraction aligned with the model's semantic hierarchy. Because aggregation operates on the global [CLS] representation, Odin fundamentally avoids over-smoothing and decouples structural abstraction from neighborhood size or graph topology. We further establish that Odin's expressive power strictly contains that of both pure Transformers and GNNs.To make the design efficient in large-scale or low-resource settings, we introduce Light Odin, a lightweight variant that preserves the same layer-aligned structural abstraction for faster training and inference. Experiments on multiple text-rich graph benchmarks show that Odin achieves state-of-the-art accuracy, while Light Odin delivers competitive performance with significantly reduced computational cost. Together, Odin and Light Odin form a unified, hop-free framework for principled structure-text integration. The source code of this model has been released at https://github.com/hongkaifeng/Odin.
中文标题/摘要
标题:Odin:面向文本丰富网络表示学习的定向双模块集成
文本属性图需要模型有效地结合强大的文本理解与结构导向的推理。现有方法要么依赖于受限于过平滑和跳依赖扩散的GNNs,要么使用忽视图拓扑结构并孤立处理节点的Transformer。我们提出Odin(定向双模块集成),这是一种新的架构,通过定向双模块机制在选定的深度将图结构注入Transformer。与消息传递GNN不同,Odin 不依赖于多跳扩散;相反,多跳结构在特定的Transformer层中被集成,从而产生与模型语义层次相一致的低、中、高层次结构抽象。由于聚合操作在全局[CLS]表示上进行,Odin 从根本上避免了过平滑,并将结构抽象与邻域大小或图拓扑结构脱钩。我们进一步证明,Odin 的表达能力严格包含纯Transformer和GNN两者。为了在大规模或低资源设置中使设计高效,我们引入了Light Odin,这是一种轻量级变体,保留了相同的层对齐结构抽象,以实现更快的训练和推理。在多个文本丰富图基准上的实验表明,Odin 达到了最先进的准确率,而Light Odin 以显著降低的计算成本实现了竞争力的性能。Odin 和 Light Odin 一起构成了一个统一的、无跳的框架,用于原理性的结构-文本集成。该模型的源代码已发布在 https://github.com/hongkaifeng/Odin。
Summary / 总结
Odin is a new architecture designed to integrate textual understanding and graph structure effectively. It injects graph structure into Transformers at specific layers, avoiding over-smoothing and decoupling structural abstraction from neighborhood size. Experiments show that Odin achieves state-of-the-art accuracy on text-rich graph benchmarks, while its lightweight variant, Light Odin, offers competitive performance with reduced computational cost. Together, they form a unified framework for structure-text integration without relying on multi-hop diffusion.
Odin 是一种新架构,通过在特定深度将图结构注入到 Transformer 中,有效结合了文本理解和结构推理。与 GNNs 不同,Odin 避免了过平滑现象,并且将结构抽象与邻域大小解耦。实验表明,Odin 在多个文本丰富的图基准测试中达到了最先进的准确性,而其轻量级变体 Light Odin 在显著降低计算成本的情况下提供了竞争力的性能。两者共同形成了一个无跳结构-文本集成的统一框架。
Do Reasoning Vision-Language Models Inversely Scale in Test-Time Compute? A Distractor-centric Empirical Analysis
Authors: Jiyun Bae, Hyunjong Ok, Sangwoo Mo, Jaeho Lee
First: 2025-11-26T13:49:08+00:00 · Latest: 2025-11-26T13:49:08+00:00
Comments: preprint
Abstract
How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.
中文标题/摘要
标题:推理视觉语言模型在测试时计算量上是否反向缩放?基于干扰项的实证分析
无关信息(即干扰项)如何影响视觉语言模型(VLMs)的测试时缩放?先前对语言模型的研究报告了反向缩放效应,其中文本干扰项导致推理时间更长但效果较差。为了调查类似现象是否在多模态设置中发生,我们引入了Idis(带有干扰项的图像)视觉问答数据集,系统地在语义、数值和空间维度上变化干扰项。我们的分析表明,视觉干扰项与文本干扰项有根本不同:尽管存在反向缩放现象,但增加视觉干扰项会降低准确性而不增加推理时间。我们进一步表明,在推理轨迹中跟踪属性计数提供了关于干扰项、推理时间和准确性之间相互作用的关键见解。最后,我们证明这些趋势扩展到了诸如Waterbirds等现有的视觉偏差基准,并提出了一种简单的提示策略来减轻推理模型中的偏差驱动预测。
Summary / 总结
This study explores how irrelevant visual information (distractors) impacts the test-time scaling of vision-language models (VLMs). Using a new dataset called Idis, which varies distractors along semantic, numerical, and spatial dimensions, the research finds that while inverse scaling persists, adding visual distractors decreases accuracy without extending reasoning time. The study also suggests that tracking attribute counts in reasoning traces can provide insights into the interaction between distractors, reasoning length, and accuracy.
该研究探讨了无关视觉信息(干扰项)如何影响视觉语言模型(VLMs)的测试时缩放。通过使用一个名为Idis的数据集,该数据集沿语义、数值和空间维度系统地变化干扰项,研究发现虽然逆缩放现象仍然存在,但增加视觉干扰项会降低准确性而不延长推理时间。研究还表明,跟踪推理痕迹中的属性计数有助于理解干扰项、推理时间和准确性之间的相互作用,并提出了一种简单的提示策略来减轻推理模型中的偏差驱动预测。
Monet: Reasoning in Latent Visual Space Beyond Images and Language
Authors: Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, Yisen Wang
First: 2025-11-26T13:46:39+00:00 · Latest: 2025-11-26T13:46:39+00:00
Abstract
"Thinking with images" has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning: high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates. To support SFT, we construct Monet-SFT-125K, a high-quality text-image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning. Our model, data, and code are available at https://github.com/NOVAglow646/Monet.
中文标题/摘要
标题:莫内:超越图像和语言的潜在视觉空间推理
"以图像思考"已成为推进视觉推理的有效范式,通过在中间推理步骤中注入视觉证据,超越了仅基于文本的思维链。然而,现有方法在实现类似人类的抽象视觉思维方面仍存在不足,因为它们的灵活性从根本上受限于外部工具。在本工作中,我们引入了莫内(Monet),这是一种训练框架,使多模态大型语言模型(MLLMs)能够在潜在视觉空间中直接进行推理,通过生成连续嵌入作为中间视觉思考。我们确定了训练MLLMs进行潜在视觉推理的两个核心挑战:潜在视觉对齐的高计算成本和潜在嵌入监督不足,并通过三阶段的基于蒸馏的监督微调(SFT)管道解决了这些问题。我们进一步揭示了将GRPO应用于潜在推理的局限性:它主要增强基于文本的推理而不是潜在推理。为克服这一局限,我们提出了VLPO(视觉-潜在策略优化),这是一种强化学习方法,明确将潜在嵌入纳入策略梯度更新中。为了支持SFT,我们构建了Monet-SFT-125K,这是一个高质量的文本-图像交错的CoT数据集,包含125K个真实世界的、图表、OCR和几何CoT。我们的模型Monet-7B在现实世界的感知和推理基准测试中表现出一致的改进,并在具有挑战性的抽象视觉推理任务中表现出强大的离群值外推能力。我们还对每个训练组件的作用进行了实证分析,并讨论了我们早期的不成功的尝试,为未来视觉潜在推理的发展提供了见解。我们的模型、数据和代码可在https://github.com/NOVAglow646/Monet/ 获取。
Summary / 总结
Monet is a training framework that allows multimodal large language models to reason within a latent visual space by generating continuous embeddings. It addresses the challenges of high computational cost and insufficient supervision through a three-stage distillation-based supervised fine-tuning pipeline and introduces VLPO, a reinforcement learning method that incorporates latent embeddings. The model, Monet-7B, demonstrates consistent improvements on real-world perception and reasoning benchmarks and shows strong out-of-distribution generalization on abstract visual reasoning tasks.
Monet 是一个训练框架,使多模态大型语言模型能够在潜在视觉空间中进行推理,通过生成连续嵌入。它通过三阶段蒸馏监督微调管道解决高计算成本和监督不足的问题,并提出 VLPO 以更好地进行潜在推理。Monet-7B 在现实世界的感知和推理基准测试中表现出一致的改进,并在抽象视觉任务上具有强大的泛化能力。
Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning
Authors: Xin Gu, Haoji Zhang, Qihang Fan, Jingxuan Niu, Zhipeng Zhang, Libo Zhang, Guang Chen, Fan Chen, Longyin Wen, Sijie Zhu
First: 2025-11-26T13:21:15+00:00 · Latest: 2025-11-26T13:21:15+00:00
Abstract
Spatio-temporal video grounding (STVG) requires localizing a target object in untrimmed videos both temporally and spatially from natural language descriptions. Despite their strong language understanding, multimodal large language models (MLLMs) underperform on STVG due to misaligned training objectives and weak fine-grained region-word alignment in standard visual encoders. To address this, we propose STVG-o1, the first framework that enables off-the-shelf MLLMs to achieve state-of-the-art STVG performance without any architectural modifications. Our method introduces a bounding-box chain-of-thought mechanism that explicitly reasons about spatio-temporal locations in an intermediate step before producing the final prediction. We further design a multi-dimensional reinforcement reward function consisting of format, consistency, temporal, spatial, and think rewards, which provides geometry-aware supervision through reinforcement fine-tuning. Evaluated on HCSTVG-v1/v2 and VidSTG, STVG-o1 sets new state-of-the-art results on HCSTVG, outperforming the best task-specific method by 7.3\% m\_tIoU on HCSTVG-v1, matching specialized models on VidSTG, and surpassing all existing MLLM-based approaches by large margins. It also demonstrates strong open-vocabulary generalization across datasets, establishing MLLMs as viable and powerful backbones for precise spatio-temporal grounding. Our code and models will be released.
中文标题/摘要
标题:用边界框思考:通过强化微调增强时空视频定位
时空视频定位(STVG)需要从自然语言描述中在未剪辑的视频中同时在时间和空间上定位目标对象。尽管多模态大型语言模型(MLLMs)在语言理解方面表现出色,但由于训练目标不匹配和标准视觉编码器中细粒度区域-词对齐较弱,它们在STVG任务上的表现不佳。为了解决这个问题,我们提出了STVG-o1框架,这是第一个能够在不进行任何架构修改的情况下使现成的MLLMs达到最先进的STVG性能的方法。我们的方法引入了一种边界框链式思考机制,在生成最终预测之前,在中间步骤中明确地推理时空位置。我们还设计了一个多维度的强化奖励函数,包括格式、一致性、时间、空间和思考奖励,通过强化微调提供几何感知的监督。在HCSTVG-v1/v2和VidSTG上评估,STVG-o1在HCSTVG上设定了新的最先进的结果,比HCSTVG-v1的最佳特定任务方法高出7.3%的m_tIoU,与专门模型在VidSTG上持平,并大幅超越所有现有的基于MLLM的方法。它还展示了在不同数据集上的强大的开放词汇泛化能力,确立了MLLMs作为精确时空定位的可行且强大的骨干网络的地位。我们的代码和模型将被发布。
Summary / 总结
The paper addresses the challenge of spatio-temporal video grounding (STVG) by proposing STVG-o1, a framework that enhances the performance of multimodal large language models (MLLMs) without architectural changes. It introduces a bounding-box chain-of-thought mechanism and a multi-dimensional reinforcement reward function for fine-tuning, which improves alignment between language and visual elements. STVG-o1 achieves state-of-the-art results on HCSTVG and VidSTG, surpassing specialized models and existing MLLM-based approaches by significant margins, and demonstrates strong generalization across datasets.
论文提出STVG-o1框架,通过增强多模态大型语言模型(MLLMs)的性能来解决时空视频定位(STVG)的挑战,而无需进行架构修改。它引入了边界框链式思考机制和多维度强化奖励函数进行微调,以提高语言和视觉元素之间的对齐。STVG-o1在HCSTVG和VidSTG上取得了最先进的结果,超越了专门模型和现有的MLLM基线方法,并且在不同数据集上展示了强大的泛化能力。
ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation
Authors: Sanghyun Jo, Wooyeol Lee, Ziseok Lee, Kyungsu Kim
First: 2025-05-27T09:23:10+00:00 · Latest: 2025-11-26T12:29:27+00:00
Comments: 36 pages
Abstract
Text-to-image diffusion models have recently become highly capable, yet their behavior in multi-object scenes remains unreliable: models often produce an incorrect number of instances and exhibit semantics leaking across objects. We trace these failures to vague instance boundaries; self-attention already reveals instance layouts early in the denoising process, but existing approaches act only on semantic signals. We introduce $\textbf{ISAC}$ ($\textbf{I}$nstance-to-$\textbf{S}$emantic $\textbf{A}$ttention $\textbf{C}$ontrol), a training-free, model-agnostic objective that performs hierarchical attention control by first carving out instance layouts from self-attention and then binding semantics to these instances. In Phase 1, ISAC clusters self-attention into the number of instances and repels overlaps, establishing an instance-level structural hierarchy; in Phase 2, it injects these instance cues into cross-attention to obtain instance-aware semantic masks and decomposes mixing semantics by tying attributes within each instance. ISAC yields consistent gains on T2I-CompBench, HRS-Bench, and IntraCompBench, our new benchmark for intra-class compositions where failures are most frequent, with improvements of at least 50% in multi-class accuracy and 7% in multi-instance accuracy on IntraCompBench, without any fine-tuning or external models. Beyond text-to-image setups, ISAC also strengthens layout-to-image controllers under overlapping boxes by refining coarse box layouts into dense instance masks, indicating that hierarchical decoupling of instance formation and semantic assignment is a key principle for robust, controllable multi-object generation. Code will be released upon publication.
中文标题/摘要
标题:ISAC:无需训练的实例到语义注意力控制以提高多实例生成
文本到图像的扩散模型最近变得非常强大,但在多对象场景中的行为仍然不可靠:模型经常生成错误数量的实例,并且在对象之间表现出语义泄漏。我们把这些失败归因于模糊的实例边界;自我注意力已经在去噪过程中早期揭示了实例布局,但现有方法仅作用于语义信号。我们引入了ISAC(实例到语义注意力控制),这是一种无需训练、模型无关的目标,通过首先从自我注意力中分割出实例布局,然后将语义绑定到这些实例来进行分层注意力控制。在第一阶段,ISAC将自我注意力聚类为实例数量,并排斥重叠,建立实例级别的结构层次;在第二阶段,它将这些实例提示注入交叉注意力以获得实例感知的语义掩码,并通过在每个实例内部绑定属性来分解混合语义。ISAC在T2I-CompBench、HRS-Bench和我们新提出的IntraCompBench(用于内类别组合,失败最频繁)上均取得了稳定收益,IntraCompBench上的多类别准确性和多实例准确性分别提高了至少50%和7%,无需任何微调或外部模型。除了文本到图像设置外,ISAC还通过细化粗略的框布局为密集实例掩码,增强了重叠框下的布局到图像控制器,表明实例形成和语义分配的分层解耦是实现稳健、可控的多对象生成的关键原则。代码将在发表后发布。
Summary / 总结
ISAC is a training-free, model-agnostic method that enhances multi-instance generation in text-to-image diffusion models by controlling instance-to-semantic attention. It first carves out instance layouts from self-attention and then binds semantics to these instances. ISAC improves multi-class and multi-instance accuracy by at least 50% and 7% respectively on new benchmarks without fine-tuning or external models.
论文提出了ISAC(Instance-to-Semantic Attention Control),这是一种无需训练的方法,用于增强文本到图像扩散模型中的多对象生成。ISAC通过首先从自我注意中定义实例布局,然后将语义绑定到这些实例来控制注意力的层次结构。它提高了实例和语义处理的准确性,在T2I-CompBench、HRS-Bench和IntraCompBench上分别实现了至少50%和7%的改进,无需微调或外部模型。
Think Visually, Reason Textually: Vision-Language Synergy in ARC
Authors: Beichen Zhang, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang
First: 2025-11-19T18:59:04+00:00 · Latest: 2025-11-26T12:23:11+00:00
Abstract
Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33\% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code is released at https://github.com/InternLM/ARC-VL.
中文标题/摘要
标题:视觉思考,文本推理:在ARC中的视觉-语言协同作用
从少量示例中进行抽象推理仍然是前沿基础模型如GPT-5和Grok 4的核心未解问题。这些模型仍然无法从少量示例中推断出结构化的转换规则,这是人类智能的关键标志之一。人工通用智能抽象和推理语料库(ARC-AGI)为这种能力提供了一个严格的测试平台,要求概念规则归纳和向新任务的迁移。大多数现有方法将ARC-AGI视为纯粹的文本推理任务,忽视了人类在解决此类谜题时高度依赖视觉抽象的事实。然而,我们的初步实验揭示了一个悖论:简单地将ARC-AGI网格渲染为图像会因规则执行不精确而导致性能下降。这导致我们的核心假设是视觉和语言在不同的推理阶段具有互补的优势:视觉支持全局模式的抽象和验证,而语言则专门负责符号规则的制定和精确执行。基于这一见解,我们提出了两种协同策略:(1)视觉-语言协同推理(VLSR),将ARC-AGI分解为模态对齐的子任务;(2)模态切换自我校正(MSSC),利用视觉验证文本推理以进行内在错误校正。广泛的实验表明,我们的方法在多种旗舰模型和多个ARC-AGI任务上相对于纯文本基线提高了高达4.33%。我们的研究结果表明,将视觉抽象与语言推理统一起来是未来基础模型实现可泛化的、类人的智能的关键一步。源代码发布在https://github.com/InternLM/ARC-VL。
Summary / 总结
The paper addresses the challenge of abstract reasoning from minimal examples, a key aspect of human intelligence, using the ARC-AGI testbed. It proposes a vision-language synergy approach, decomposing the task into modality-aligned subtasks and leveraging visual verification to correct text-based reasoning errors. Experiments show a 4.33% improvement over text-only models across various tasks.
研究旨在通过抽象和推理语料库(ARC-AGI)解决从少量示例进行抽象推理的挑战,这是人类智能的关键方面。研究提出了一种视觉-语言协同方法,将任务分解为模态对齐的子任务,并利用视觉验证来纠正文本推理中的错误。实验表明,这种方法在各种任务上的性能比纯文本模型提高了最多4.33%。
HTTM: Head-wise Temporal Token Merging for Faster VGGT
Authors: Weitian Wang, Lukas Meiner, Rai Shubham, Cecilia De La Parra, Akash Kumar
First: 2025-11-26T12:04:03+00:00 · Latest: 2025-11-26T12:04:03+00:00
Abstract
The Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform all-to-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT. Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers' output, which hinders the model's representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the uniqueness of feature tokens after head concatenation. Additionally, this enables HTTM to leverage the spatial locality and temporal correspondence observed at the head level to achieve higher merging ratios with lower merging costs compared to existing methods. Thus, HTTM achieves up to 7x acceleration with negligible performance drops in a GPU-based inference.
中文标题/摘要
标题:HTTM:头部级时空令牌合并以加速VGGT
视觉几何导向变换器(VGGT)在3D场景重建方面取得了重大突破,它是第一个能够一次性联合推断所有关键3D属性(相机姿态、深度和密集几何)的模型。然而,这种联合推断机制需要全局注意力层在所有视图的令牌之间进行全对全的注意力计算。对于包含长序列输入的大场景重建,这会导致显著的延迟瓶颈。在本文中,我们提出了一种无需训练的3D令牌合并方法——头部级时空合并(HTTM),以加速VGGT。现有的合并技术在不同的注意力头之间均匀合并令牌,导致层输出中的令牌相同,这阻碍了模型的表示能力。HTTM通过在多头粒度上合并令牌来解决这一问题,这在头部连接后保留了特征令牌的独特性。此外,这使得HTTM能够利用头部级别观察到的空间局部性和时间对应关系,以较低的合并成本实现更高的合并比率。因此,HTTM在基于GPU的推理中实现了最高可达7倍的加速,且性能下降可以忽略不计。
Summary / 总结
The paper proposes HTTM (Head-wise Temporal Token Merging) to address the latency bottleneck in VGGT (Visual Geometry Grounded Transformer) for 3D scene reconstruction. HTTM merges tokens in a head-wise manner, preserving the uniqueness of feature tokens and leveraging spatial and temporal correspondences to achieve higher merging ratios with lower costs. This method accelerates VGGT by up to 7x with negligible performance drops during GPU-based inference.
论文提出了一种头级时间合并(HTTM)方法,以解决Visual Geometry Grounded Transformer(VGGT)在3D场景重建中的延迟问题。HTTM在多头粒度上合并令牌,保持特征令牌的独特性,并利用空间和时间对应关系,以较低的成本实现更高的合并比率。这种方法在GPU基础上的推理中最多可加速7倍,且性能损失可以忽略不计。
ConStellaration: A dataset of QI-like stellarator plasma boundaries and optimization benchmarks
Authors: Santiago A. Cadena, Andrea Merlo, Emanuel Laude, Alexander Bauer, Atul Agrawal, Maria Pascu, Marija Savtchouk, Enrico Guiraud, Lukas Bonauer, Stuart Hudson, Markus Kaiser
First: 2025-06-24T12:49:00+00:00 · Latest: 2025-11-26T11:13:25+00:00
Abstract
Stellarators are magnetic confinement devices under active development to deliver steady-state carbon-free fusion energy. Their design involves a high-dimensional, constrained optimization problem that requires expensive physics simulations and significant domain expertise. Recent advances in plasma physics and open-source tools have made stellarator optimization more accessible. However, broader community progress is currently bottlenecked by the lack of standardized optimization problems with strong baselines and datasets that enable data-driven approaches, particularly for quasi-isodynamic (QI) stellarator configurations, considered as a promising path to commercial fusion due to their inherent resilience to current driven disruptions. Here, we release an open dataset of diverse QI-like stellarator plasma boundary shapes, paired with their ideal magnetohydrodynamic (MHD) equilibria and performance metrics. We generated this dataset by sampling a variety of QI fields and optimizing corresponding stellarator plasma boundaries. We introduce three optimization benchmarks of increasing complexity: (1) a single objective geometric optimization problem, (2) a "simple-to-build" QI stellarator, and (3) a multi-objective ideal-MHD stable QI stellarator that investigates trade-offs between compactness and coil simplicity. For every benchmark, we provide reference code, evaluation scripts, and strong baselines based on classical optimization techniques. Finally, we show how learned models trained on our dataset can efficiently generate novel, feasible configurations without querying expensive physics oracles. By openly releasing the dataset along with benchmark problems and baselines, we aim to lower the entry barrier for optimization and machine learning researchers to engage in stellarator design and to accelerate cross-disciplinary progress toward bringing fusion energy to the grid.
中文标题/摘要
标题:ConStellaration:一种类似QI的托卡马克等离子体边界数据集及优化基准
托卡马克是正在积极开发的磁约束装置,旨在提供持续的无碳核聚变能源。其设计涉及高维约束优化问题,需要昂贵的物理模拟和大量的专业知识。近期的等离子体物理进展和开源工具使得托卡马克优化变得更加可行。然而,当前社区的进一步进展受到缺乏标准化的优化问题和基准数据集的限制,特别是对于被认为具有商业核聚变潜力的准等离子体等离子体(QI)托卡马克配置。我们发布了一个开放的数据集,包含多样化的QI类似托卡马克等离子体边界形状,以及它们的理想磁流体动力学(MHD)平衡和性能指标。我们通过采样各种QI场并优化相应的托卡马克等离子体边界生成了此数据集。我们引入了三个复杂度递增的优化基准:(1)单一目标几何优化问题,(2)“易于构建”的QI托卡马克,(3)多目标理想MHD稳定QI托卡马克,探讨紧凑性和线圈简单性之间的权衡。对于每个基准,我们提供了参考代码、评估脚本和基于经典优化技术的强基准。最后,我们展示了基于我们数据集训练的模型如何高效地生成新的可行配置,而无需查询昂贵的物理或acles。通过公开发布数据集、基准问题和强基准,我们旨在降低优化和机器学习研究人员参与托卡马克设计的门槛,并加速跨学科进展,以将核聚变能源引入电网。
Summary / 总结
The paper introduces ConStellaration, a dataset for optimizing QI-like stellarator plasma boundaries, addressing the need for standardized benchmarks in stellarator design. The method involves generating diverse plasma boundary shapes and their corresponding MHD equilibria through sampling QI fields and optimization. Key findings include the provision of three optimization benchmarks of increasing complexity, along with reference code and strong baselines, enabling efficient generation of new configurations without expensive simulations. This dataset aims to facilitate broader community progress in stellarator optimization and machine learning applications.
论文介绍了ConStellaration数据集,旨在优化类似QI的等离子体边界,解决等离子体设计中标准化基准的缺失问题。方法包括通过采样QI磁场和优化生成多样化的等离子体边界及其对应的MHD平衡。主要发现包括提供三个复杂度递增的优化基准,附带参考代码和强基准,能够高效生成新的配置而无需昂贵的模拟。该数据集旨在促进更广泛的社区在等离子体优化和机器学习应用方面的进展。
Co-Training Vision Language Models for Remote Sensing Multi-task Learning
Authors: Qingyun Li, Shuran Ma, Junwei Luo, Yi Yu, Yue Zhou, Fengxiang Wang, Xudong Lu, Xiaoxing Wang, Xin He, Yushi Chen, Xue Yang, Junchi Yan
First: 2025-11-26T10:55:07+00:00 · Latest: 2025-11-26T10:55:07+00:00
Comments: 14 pages, 6 figures
Abstract
With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation engine, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data engine effectively addresses complex RS data enviroment and generates flexible vision-language conversations. Furthermore, we propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery. For UHR images, we introduce the Zoom-in Chain mechanism together with its corresponding dataset, LRS-VQA-Zoom. The strategies are flexible and effectively mitigate the computational burdens. Additionally, we significantly enhance the model's object detection capability and propose a novel evaluation protocol that ensures fair comparison between VLMs and conventional detection models. Extensive experiments demonstrate that RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and even rivaling specialized expert models. All the training and evaluating tools, model weights, and datasets have been fully open-sourced to support reproducibility. We expect that this baseline will promote further progress toward general-purpose RS models.
中文标题/摘要
标题:遥感多任务学习中联合训练视觉语言模型
随着Transformer在遥感(RS)单一任务上的卓越表现,我们正接近通过多任务学习(MTL)实现跨多个任务的统一模型。与单一任务方法相比,MTL方法在泛化能力、扩展性和实用性方面具有优势。最近,视觉语言模型(VLMs)在RS图像理解、语义分割和超高清(UHR)图像推理方面取得了令人鼓舞的结果。此外,统一的文本界面展示了MTL的巨大潜力。因此,在这项工作中,我们提出了RSCoVLM,这是一种简单而灵活的RS MTL VLM基线。首先,我们创建了数据编排引擎,包括数据获取、离线处理和集成,以及在线加载和加权。该数据引擎有效地解决了复杂RS数据环境的问题,并生成了灵活的视觉-语言对话。此外,我们提出了一种统一的动态分辨率策略,以应对RS图像中固有的不同图像尺度。对于UHR图像,我们引入了缩放链机制及其相应的数据集LRS-VQA-Zoom。这些策略灵活且有效地减轻了计算负担。此外,我们显著增强了模型的物体检测能力,并提出了一种新的评估协议,以确保VLMs与传统检测模型之间的公平比较。广泛的实验表明,RSCoVLM在多种任务上达到了最先进的性能,优于现有的RS VLMs,甚至与专门的专家模型相媲美。所有训练和评估工具、模型权重和数据集均已完全开源,以支持可再现性。我们期望这一基线将促进通用RS模型的进一步发展。
Summary / 总结
This paper introduces RSCoVLM, a vision language model designed for remote sensing multi-task learning. The authors address the challenges of diverse image scales in remote sensing imagery by proposing a unified dynamic-resolution strategy and the Zoom-in Chain mechanism. They also develop a data curation engine and a novel evaluation protocol. Experimental results show that RSCoVLM outperforms existing remote sensing vision language models and even matches specialized expert models across various tasks.
本文介绍了RSCoVLM,一种用于遥感多任务学习的视觉语言模型。作者通过提出统一的动态分辨率策略和缩放链机制来应对遥感图像中多样化的图像尺度挑战。他们还开发了一个数据编目引擎和一个新的评估协议。实验结果表明,RSCoVLM 在各种任务中优于现有的遥感视觉语言模型,并且甚至能够匹敌专门的专家模型。
QiMeng-CRUX: Narrowing the Gap between Natural Language and Verilog via Core Refined Understanding eXpression
Authors: Lei Huang, Rui Zhang, Jiaming Guo, Yang Zhang, Di Huang, Shuyao Cheng, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Qi Guo, Yunji Chen
First: 2025-11-25T09:17:32+00:00 · Latest: 2025-11-26T10:46:28+00:00
Comments: Accepted by the AAAI26 Conference Main Track
Abstract
Large language models (LLMs) have shown promising capabilities in hardware description language (HDL) generation. However, existing approaches often rely on free-form natural language descriptions that are often ambiguous, redundant, and unstructured, which poses significant challenges for downstream Verilog code generation. We treat hardware code generation as a complex transformation from an open-ended natural language space to a domain-specific, highly constrained target space. To bridge this gap, we introduce Core Refined Understanding eXpression (CRUX), a structured intermediate space that captures the essential semantics of user intent while organizing the expression for precise Verilog code generation. We further design a two-stage training framework, comprising Joint Expression Modeling and Dual-Space Optimization, to enhance the quality of both CRUX and Verilog code. Experiments across multiple Verilog generation benchmarks demonstrate that our model, CRUX-V, achieves state-of-the-art performance among general models, particularly under challenging design tasks. Furthermore, the CRUX space proves transferable and beneficial when used as input prompts for other code models, highlighting its effectiveness in narrowing the gap between free-form natural language descriptions and precise Verilog generation.
中文标题/摘要
标题:QiMeng-CRUX:通过核心精炼理解表达缩小自然语言与Verilog之间的差距
大型语言模型(LLMs)在硬件描述语言(HDL)生成方面展示了有希望的能力。然而,现有方法通常依赖于自由形式的自然语言描述,这些描述往往是模糊的、冗余的和无结构的,这给后续的Verilog代码生成带来了重大挑战。我们将硬件代码生成视为从开放的自然语言空间到特定领域、高度受限的目标空间的复杂转换。为了弥合这一差距,我们引入了核心精炼理解表达(CRUX),这是一种结构化的中间空间,能够捕捉用户意图的核心语义并组织表达以实现精确的Verilog代码生成。我们进一步设计了一种两阶段训练框架,包括联合表达建模和双空间优化,以提高CRUX和Verilog代码的质量。在多个Verilog生成基准测试中的实验表明,我们的模型CRUX-V在一般模型中达到了最先进的性能,特别是在具有挑战性的设计任务中。此外,CRUX空间在作为其他代码模型输入提示时具有可转移性和有效性,突显了其在缩小自由形式自然语言描述与精确Verilog生成之间差距方面的有效性。
Summary / 总结
The research aims to improve the accuracy and efficiency of generating Verilog code from natural language descriptions by introducing a structured intermediate space called Core Refined Understanding eXpression (CRUX). The method involves a two-stage training framework that enhances both CRUX and Verilog code quality. Experiments show that the proposed model, CRUX-V, outperforms existing general models, especially in complex design tasks, and the CRUX space is beneficial for other code models as input prompts.
研究旨在通过引入一个结构化的中间空间Core Refined Understanding eXpression (CRUX),提高从自然语言描述生成Verilog代码的准确性和效率。方法包括一个两阶段训练框架,以提升CRUX和Verilog代码的质量。实验表明,提出的模型CRUX-V在复杂设计任务中优于现有的一般模型,并且CRUX空间作为其他代码模型的输入提示时具有良好的转移性和有效性。
Filter Like You Test: Data-Driven Data Filtering for CLIP Pretraining
Authors: Mikey Shechter, Yair Carmon
First: 2025-03-11T18:34:12+00:00 · Latest: 2025-11-26T10:25:06+00:00
Abstract
We introduce Filter Like You Test (FLYT), an algorithm for curating large-scale vision-language datasets that learns the usefulness of each data point as a pretraining example. FLYT trains a scoring model that learns to weigh each example's features using gradient signals from downstream tasks training sets. Based on FLYT, we implement Mixing-FLYT (M-FLYT), which takes the per-example scores generated by different scoring methods as features, and learns to unify them into a single score. FLYT naturally produces a distribution over the training examples, which we leverage through Soft Cap Sampling (SCS), a strategy for obtaining a filtered pretraining dataset from per-example probabilities that samples examples while preventing over-representation through a repetition penalty. Using these methods, we achieve 40.1% ImageNet zero-shot accuracy on the DataComp medium scale filtering benchmark, a 2% absolute accuracy increase over all previous results and a 5.5% increase over results that - like us - use only public resources. Our approach also yields 37.7\% on the average of 38 DataComp evaluation tasks, outperforming previous public-resource approaches by 0.4\%.
中文标题/摘要
标题:Filter Like You Test:基于数据驱动的数据过滤方法用于CLIP预训练
我们介绍了Filter Like You Test (FLYT),一种用于编目大规模视觉-语言数据集的算法,该算法能够学习每个数据点作为预训练示例的有用性。FLYT训练了一个评分模型,该模型能够使用下游任务训练集中的梯度信号来学习衡量每个示例特征的重要性。基于FLYT,我们实现了Mixing-FLYT (M-FLYT),该方法将不同评分方法生成的每个示例得分作为特征,并学习将它们统一为一个得分。FLYT自然地产生了一个训练示例的分布,我们通过Soft Cap Sampling (SCS)策略利用这一分布,这是一种从每个示例的概率中获取过滤预训练数据集的策略,该策略在采样示例时通过重复惩罚防止过度代表。使用这些方法,我们在DataComp中等规模过滤基准测试中实现了40.1%的ImageNet零样本准确率,绝对准确率提高了2%,比所有先前结果提高了5.5%,并且比仅使用公共资源的方法提高了2%。我们的方法在38个DataComp评估任务的平均得分上也达到了37.7%,超过了先前公共资源方法的0.4%。
Summary / 总结
The research introduces FLYT, an algorithm that learns the usefulness of each data point for pretraining CLIP models by training a scoring model with gradient signals from downstream tasks. M-FLYT further combines scores from different methods into a unified score. SCS is used to sample examples while preventing over-representation. The methods achieve 40.1% ImageNet zero-shot accuracy, a 2% increase over previous results and a 5.5% increase over approaches using only public resources. The approach also outperforms previous public-resource approaches on the average of 38 DataComp evaluation tasks by 0.4%.
研究引入了FLYT算法,该算法通过下游任务的梯度信号学习每个数据点在视觉-语言数据集中的预训练有用性。它训练了一个评分模型。Mixing-FLYT将不同方法生成的评分统一成一个评分。使用Soft Cap Sampling策略从每个示例的概率中过滤预训练数据集。该方法在ImageNet零样本准确性上达到了40.1%,比之前的结果提高了2%,比仅使用公共资源的结果提高了5.5%。在DataComp基准测试中,它在38个评估任务的平均得分上也超过了之前的公共资源方法0.4%。
History
20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553