arXiv 论文速递

2025-11-26 03:29
Snapshot: 20251126_0329
Mixture of Horizons in Action Chunking
Authors: Dong Jing, Gang Wang, Jiaqi Liu, Weiliang Tang, Zelong Sun, Yunchao Yao, Zhenyu Wei, Yunhui Liu, Zhiwu Lu, Mingyu Ding
First: 2025-11-24T18:59:51+00:00 · Latest: 2025-11-24T18:59:51+00:00
Comments: 15 pages, 14 figures
Abstract
Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $π_0$, $π_{0.5}$, and one-step regression policy $π_{\text{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $π_{0.5}$ with MoH reaches a new state-of-the-art with 99$\%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://github.com/Timsty1/MixtureOfHorizons
中文标题/摘要
标题:行动分段中的视野混合
视觉-语言-行动(VLA)模型在机器人操作方面展现了显著的能力,但它们的表现对训练期间使用的$\textbf{行动分段长度}$(称为$\textbf{视野}$)非常敏感。我们的实证研究表明,存在固有的权衡:较长的视野提供更强的全局预见性,但会降低细粒度的准确性,而较短的视野则能增强局部控制,但在长期任务上却力不从心,这表明固定选择单一视野是次优的。为了缓解这种权衡,我们提出了一种$\textbf{视野混合(MoH)}$策略。MoH将行动分段重新排列为具有不同视野的多个段落,使用共享的动作变换器并行处理这些段落,并通过轻量级线性门融合输出。它具有三个吸引人的优点。1)MoH在单一模型中同时利用长期预见性和短期精确性,从而提高性能和对复杂任务的泛化能力。2)MoH可以无缝集成到全注意动作模块中,且几乎不增加训练或推理开销。3)MoH支持动态推理和自适应视野,通过跨视野共识选择稳定动作,与基线相比,吞吐量提高2.5倍,同时保持了优越的性能。广泛的实验表明,MoH在模拟和实际任务中都取得了持续且显著的改进。值得注意的是,在混合任务设置下,$π_{0.5}$在应用MoH后,在LIBERO上仅经过30000次训练迭代后,平均成功率达到了99%。项目页面:https://github.com/Timsty1/MixtureOfHorizons
Summary / 总结
The study addresses the trade-off between global foresight and local precision in VLA models by proposing a mixture of horizons (MoH) strategy. MoH combines different action chunk lengths within a single model, enhancing both performance and generalizability. Experiments show consistent and significant improvements across various policies and tasks, with MoH achieving a 99% success rate on LIBERO after 30k training iterations.
研究通过提出混合视界(MoH)策略解决了VLA模型中全局预见性和局部精确性之间的权衡问题。MoH在单一模型中结合不同的动作片段长度,提升了性能和泛化能力。实验结果显示,MoH在多种策略和任务中表现出一致且显著的改进,在LIBERO上仅经过30k次训练迭代后,成功率达到99%。
Cognitive Foundations for Reasoning and Their Manifestation in LLMs
Authors: Priyanka Kargupta, Shuyue Stella Li, Haocheng Wang, Jinu Lee, Shan Chen, Orevaoghene Ahia, Dean Light, Thomas L. Griffiths, Max Kleiman-Weiner, Jiawei Han, Asli Celikyilmaz, Yulia Tsvetkov
First: 2025-11-20T18:59:00+00:00 · Latest: 2025-11-24T18:59:30+00:00
Comments: 40 pages, 4 tables, 6 figures
Abstract
Large language models (LLMs) solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning. To understand this gap, we synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning reasoning invariants, meta-cognitive controls, representations for organizing reasoning & knowledge, and transformation operations. We introduce a fine-grained evaluation framework and conduct the first large-scale empirical analysis of 192K traces from 18 models across text, vision, and audio, complemented by 54 human think-aloud traces, which we make publicly available. We find that models under-utilize cognitive elements correlated with success, narrowing to rigid sequential processing on ill-structured problems where diverse representations and meta-cognitive monitoring are critical. Human traces show more abstraction and conceptual processing, while models default to surface-level enumeration. Meta-analysis of 1.6K LLM reasoning papers reveals the research community concentrates on easily quantifiable elements (sequential organization: 55%, decomposition: 60%) but neglecting meta-cognitive controls (self-awareness: 16%) that correlate with success. Models possess behavioral repertoires associated with success but fail to deploy them spontaneously. Leveraging these patterns, we develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 66.7% on complex problems. By establishing a shared vocabulary between cognitive science and LLM research, our framework enables systematic diagnosis of reasoning failures and principled development of models that reason through robust cognitive mechanisms rather than spurious shortcuts, while providing tools to test theories of human cognition at scale.
中文标题/摘要
标题:认知基础与推理及其在大语言模型中的表现
大型语言模型(LLMs)能够解决复杂问题,但在简单变体上却失败了,这表明它们通过与人类推理本质上不同的机制实现了正确输出。为了理解这种差距,我们综合认知科学的研究成果,构建了一个包含28个认知元素的分类体系,这些元素涵盖了推理不变量、元认知控制、组织推理与知识的表示以及转换操作。我们引入了一种精细的评估框架,并首次对18个模型的192,000条轨迹进行了大规模实证分析,涵盖了文本、视觉和音频领域,同时补充了54条人类口头思考轨迹,这些数据已公开提供。我们发现,模型未能充分利用与成功相关的认知元素,而是倾向于在结构不良的问题上进行僵化的顺序处理,而多样化的表示和元认知监控是至关重要的。人类轨迹显示了更多的抽象和概念处理,而模型则默认进行表面层次的枚举。对1,600篇LLM推理论文的元分析显示,研究社区集中在易于量化的元素上(顺序组织:55%,分解:60%),但忽视了与成功相关的元认知控制(自我意识:16%)。模型具备与成功相关的行为模式,但未能自发地运用它们。利用这些模式,我们开发了一种测试时的推理指导,能够自动搭建成功的结构,从而在复杂问题上将性能提高66.7%。通过在认知科学与LLM研究之间建立共享词汇,我们的框架能够系统地诊断推理失败,并基于稳健的认知机制有原则地开发模型,而不是依赖虚假的捷径,同时提供了大规模测试人类认知理论的工具。
Summary / 总结
The study aims to understand the differences between human reasoning and the mechanisms used by large language models (LLMs) to solve problems. It introduces a taxonomy of 28 cognitive elements and a fine-grained evaluation framework, analyzing 192K model traces and 54 human think-aloud traces. Key findings show that models under-utilize cognitive elements critical for success, especially in ill-structured problems, and tend to default to surface-level processing. The research also highlights the need for more attention to meta-cognitive controls in LLMs. By developing reasoning guidance, the study improves model performance on complex problems by up to 66.7%. This work bridges cognitive science and LLM research, enabling better diagnosis and development of reasoning capabilities in models.
研究旨在理解人类推理与大型语言模型(LLMs)解决问题时所用机制之间的差异。引入了28个认知元素的分类,并开发了一种细粒度的评估框架,分析了192K模型痕迹和54个人类口头思考痕迹。关键发现表明,模型在关键认知元素的使用上存在不足,尤其是在复杂问题上,而人类则表现出更多的抽象和概念处理。研究强调了模型更有效地部署这些认知元素的必要性,从而在复杂问题上的性能提高了66.7%。该框架提供了诊断推理失败和开发通过稳健认知机制推理的模型的工具。
Flow Map Distillation Without Data
Authors: Shangyuan Tong, Nanye Ma, Saining Xie, Tommi Jaakkola
First: 2025-11-24T18:58:55+00:00 · Latest: 2025-11-24T18:58:55+00:00
Abstract
State-of-the-art flow models achieve remarkable quality but require slow, iterative sampling. To accelerate this, flow maps can be distilled from pre-trained teachers, a procedure that conventionally requires sampling from an external dataset. We argue that this data-dependency introduces a fundamental risk of Teacher-Data Mismatch, as a static dataset may provide an incomplete or even misaligned representation of the teacher's full generative capabilities. This leads us to question whether this reliance on data is truly necessary for successful flow map distillation. In this work, we explore a data-free alternative that samples only from the prior distribution, a distribution the teacher is guaranteed to follow by construction, thereby circumventing the mismatch risk entirely. To demonstrate the practical viability of this philosophy, we introduce a principled framework that learns to predict the teacher's sampling path while actively correcting for its own compounding errors to ensure high fidelity. Our approach surpasses all data-based counterparts and establishes a new state-of-the-art by a significant margin. Specifically, distilling from SiT-XL/2+REPA, our method reaches an impressive FID of 1.45 on ImageNet 256x256, and 1.49 on ImageNet 512x512, both with only 1 sampling step. We hope our work establishes a more robust paradigm for accelerating generative models and motivates the broader adoption of flow map distillation without data.
中文标题/摘要
标题:无需数据的流图蒸馏
最先进的流模型在质量上取得了显著成就,但需要缓慢的迭代采样。为了加速这一过程,可以从预训练的教师中蒸馏出流图,这一过程通常需要从外部数据集中进行采样。我们认为这种数据依赖性引入了教师数据不匹配的基本风险,因为静态数据集可能无法提供教师生成能力的完整或甚至正确的表示。这使我们质疑这种数据依赖性是否真正对于成功的流图蒸馏是必要的。在本文中,我们探索了一种无需数据的替代方案,仅从先验分布中进行采样,这是一种教师在构造上必然遵循的分布,从而完全避免了不匹配的风险。为了证明这一理念的实用性,我们引入了一个原理性的框架,该框架能够学习预测教师的采样路径,并积极纠正自身的累积错误以确保高保真度。我们的方法超越了所有基于数据的对应方法,并在显著的范围内建立了新的最先进的水平。具体而言,从SiT-XL/2+REPA蒸馏,我们的方法在ImageNet 256x256上达到了令人印象深刻的FID为1.45,在ImageNet 512x512上达到了1.49,两者仅需一次采样步骤。我们希望我们的工作能够建立一个更稳健的加速生成模型的范式,并促进无需数据的流图蒸馏的更广泛采用。
Summary / 总结
This work addresses the need for data in flow map distillation, a technique to accelerate generative models. Instead of relying on external datasets, the authors propose a data-free method that samples only from the prior distribution, thus avoiding the risk of Teacher-Data Mismatch. The method introduces a framework that predicts the teacher's sampling path and corrects its own errors to ensure high fidelity. Experiments show that this approach outperforms data-based methods, achieving an FID of 1.45 and 1.49 on ImageNet 256x256 and 512x512 respectively, with just one sampling step. This work demonstrates a more robust paradigm for accelerating generative models without the need for external data.
该研究解决了流图蒸馏中对数据的需求问题,这是一种用于加速生成模型的技术。作者提出了一种无数据的方法,仅从先验分布中采样,从而避免了教师数据不匹配的风险。他们的方法通过主动纠正错误,实现了在ImageNet 256x256上的新最佳FID分数1.45和在ImageNet 512x512上的1.49,仅需一次采样步骤,超越了所有基于数据的方法。
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
Authors: Yiming Qin, Bomin Wei, Jiaxin Ge, Konstantinos Kallidromitis, Stephanie Fu, Trevor Darrell, Xudong Wang
First: 2025-11-24T18:55:19+00:00 · Latest: 2025-11-24T18:55:19+00:00
Comments: Project page: https://wakalsprojectpage.github.io/comt-website/
Abstract
Vision-Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, COVT distills knowledge from lightweight vision experts, capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, the VLM with COVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating COVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16% and demonstrates that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.
中文标题/摘要
标题:视觉思维链:通过连续视觉标记提高VLMs的视觉理解和推理能力
视觉-语言模型(VLMs)在语言推理方面表现出色,但在需要密集视觉感知的感知理解方面存在局限,例如空间推理和几何意识。这种局限性源于当前VLMs在捕捉跨空间维度的密集视觉信息方面的机制有限。我们提出了视觉思维链(COVT),这是一种框架,使VLMs不仅在语言中进行推理,还能通过连续视觉标记进行推理——这些紧凑的潜在表示编码丰富的感知线索。在大约20个标记的预算内,COVT从轻量级视觉专家中提炼知识,捕捉诸如2D外观、3D几何、空间布局和边缘结构等互补属性。在训练过程中,带有COVT的VLM自回归地预测这些视觉标记以重建密集监督信号(例如,深度、分割、边缘和DINO特征)。在推理时,模型直接在连续视觉标记空间中进行推理,保持效率的同时可选地解码密集预测以提高可解释性。在CV-Bench、MMVP、RealWorldQA、MMStar、WorldMedQA和HRBench等多个多样化的感知基准测试中评估,将COVT整合到强大的VLMs如Qwen2.5-VL和LLaVA中,性能提高了3%到16%,表明紧凑的连续视觉思维能够实现更精确、更具体的多模态智能。
Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration
Authors: James Y. Huang, Sheng Zhang, Qianchu Liu, Guanghui Qin, Tinghui Zhu, Tristan Naumann, Muhao Chen, Hoifung Poon
First: 2025-11-24T18:55:16+00:00 · Latest: 2025-11-24T18:55:16+00:00
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in challenging, knowledge-intensive reasoning tasks. However, extending LLMs to perceive and reason over a new modality (e.g., vision), often requires costly development of large-scale vision language models (VLMs) with LLMs as backbones. Smaller VLMs are more efficient and adaptable but often lack the broad knowledge and reasoning capabilities of frontier LLMs. In this work, we propose BeMyEyes, a modular, multi-agent framework for extending LLMs to multimodal reasoning by orchestrating collaboration between efficient, adaptable VLMs as perceivers and powerful LLMs as reasoners through conversations. We then introduce a data synthesis and supervised fine-tuning pipeline to train the perceiver agent to effectively collaborate with the reasoner agent. By combining the complementary strengths of perception and reasoning agents, BeMyEyes avoids the need for training large-scale multimodal models, preserves the generalization and reasoning capabilities of LLMs, and allows flexible extension to new domains and modalities. Experiments show that our framework unlocks the multimodal reasoning capabilities for LLMs, enabling a lightweight and fully open-source solution, i.e. equipping text-only DeepSeek-R1 with Qwen2.5-VL-7B perceiver, to outperform large-scale proprietary VLMs such as GPT-4o on a wide range of knowledge-intensive multimodal tasks. These results demonstrate the effectiveness, modularity, and scalability of our multi-agent approach for building future multimodal reasoning systems.
中文标题/摘要
标题:Be My Eyes: 通过多智能体协作将大型语言模型扩展到新模态
大型语言模型(LLMs)在具有挑战性的、知识密集型的推理任务中展现了卓越的能力。然而,将LLMs扩展到感知和推理新的模态(例如,视觉),通常需要开发大规模的视觉语言模型(VLMs),以LLMs作为基础。较小的VLMs更高效、更适应,但往往缺乏前沿LLMs的广泛知识和推理能力。在本工作中,我们提出了一种名为BeMyEyes的模块化多智能体框架,通过协调感知智能体(高效的、适应性强的VLMs)和推理智能体(强大的LLMs)之间的对话合作,将LLMs扩展到多模态推理。我们随后介绍了一种数据合成和监督微调管道,以训练感知智能体有效地与推理智能体协作。通过结合感知和推理智能体的互补优势,BeMyEyes避免了训练大规模多模态模型的需要,保留了LLMs的泛化和推理能力,并允许灵活扩展到新的领域和模态。实验表明,我们的框架为LLMs解锁了多模态推理能力,提供了一个轻量级且完全开源的解决方案,即仅用文本的DeepSeek-R1配备Qwen2.5-VL-7B感知器,能够超越诸如GPT-4o等大规模专有VLMs在一系列知识密集型多模态任务中的表现。这些结果证明了我们多智能体方法的有效性、模块化和可扩展性,用于构建未来的多模态推理系统。
Summary / 总结
BeMyEyes proposes a modular multi-agent framework to extend large language models (LLMs) to multimodal reasoning by collaborating between efficient vision language models (VLMs) and powerful LLMs. The framework uses a data synthesis and supervised fine-tuning pipeline to train the VLMs to work effectively with LLMs. Experiments show that BeMyEyes enables a lightweight, open-source solution to outperform large-scale proprietary VLMs on various knowledge-intensive multimodal tasks, demonstrating the effectiveness and scalability of the multi-agent approach.
BeMyEyes 提出了一种模块化的多代理框架,通过合作将大型语言模型(LLMs)扩展到多模态推理,结合高效的视觉语言模型(VLMs)和强大的 LLMs。该框架使用数据合成和监督微调管道来训练 VLM 有效与 LLM 协作。实验表明,这种方法在各种知识密集型多模态任务上优于大型专有 VLMs,展示了多代理方法的有效性、模块化和可扩展性。
UISearch: Graph-Based Embeddings for Multimodal Enterprise UI Screenshots Retrieval
Authors: Maroun Ayli, Youssef Bakouny, Tushar Sharma, Nader Jalloul, Hani Seifeddine, Rima Kilany
First: 2025-11-24T18:20:08+00:00 · Latest: 2025-11-24T18:20:08+00:00
Comments: 12 pages, 2 figures, 3 algorithms, 4 tables
Abstract
Enterprise software companies maintain thousands of user interface screens across products and versions, creating critical challenges for design consistency, pattern discovery, and compliance check. Existing approaches rely on visual similarity or text semantics, lacking explicit modeling of structural properties fundamental to user interface (UI) composition. We present a novel graph-based representation that converts UI screenshots into attributed graphs encoding hierarchical relationships and spatial arrangements, potentially generalizable to document layouts, architectural diagrams, and other structured visual domains. A contrastive graph autoencoder learns embeddings preserving multi-level similarity across visual, structural, and semantic properties. The comprehensive analysis demonstrates that our structural embeddings achieve better discriminative power than state-of-the-art Vision Encoders, representing a fundamental advance in the expressiveness of the UI representation. We implement this representation in UISearch, a multi-modal search framework that combines structural embeddings with semantic search through a composable query language. On 20,396 financial software UIs, UISearch achieves 0.92 Top-5 accuracy with 47.5ms median latency (P95: 124ms), scaling to 20,000+ screens. The hybrid indexing architecture enables complex queries and supports fine-grained UI distinction impossible with vision-only approaches.
中文标题/摘要
标题:UISearch:基于图的嵌入表示在多模态企业UI屏幕截图检索中的应用
企业软件公司维护着成千上万的产品和版本中的用户界面屏幕,这为企业设计一致性、模式发现和合规检查带来了关键挑战。现有方法依赖于视觉相似性或文本语义,缺乏对用户界面(UI)组成中至关重要的结构属性的显式建模。我们提出了一种新颖的基于图的表示方法,将UI屏幕截图转换为编码层次关系和空间布局的带属性图,该方法可能适用于文档布局、建筑图纸和其他结构化视觉领域。对比图自编码器学习保留多级视觉、结构和语义属性相似性的嵌入表示。全面的分析表明,我们的结构嵌入在区分能力上优于最先进的视觉编码器,代表了UI表示表达能力上的根本性进步。我们在此表示中实现了UISearch,这是一种多模态搜索框架,通过可组合查询语言结合结构嵌入和语义搜索。在20,396个金融软件UI上,UISearch实现了0.92的Top-5准确率,中位延迟为47.5毫秒(P95:124毫秒),可扩展到20,000多个屏幕。混合索引架构能够支持复杂查询,并支持与仅视觉方法无法实现的精细粒度的UI区分。
Summary / 总结
The research addresses the challenges of design consistency, pattern discovery, and compliance in enterprise software UIs by proposing a graph-based representation that captures hierarchical and spatial relationships in UI screenshots. A contrastive graph autoencoder is used to learn embeddings that preserve multi-level similarities. The method outperforms state-of-the-art Vision Encoders in discriminative power and is implemented in UISearch, achieving 0.92 Top-5 accuracy with low latency on 20,396 financial software UIs.
UISearch 通过图基表示法来表示 UI 截图,捕捉层次和空间关系。对比图自编码器学习保留多级相似性的嵌入。实验表明,UISearch 在辨别能力上优于最先进的视觉编码器,并在 20,396 个金融软件 UI 上实现了 0.92 的 Top-5 准确率和快速延迟。
SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation
Authors: Chaitat Utintu, Pinaki Nath Chowdhury, Aneeshan Sain, Subhadeep Koley, Ayan Kumar Bhunia, Yi-Zhe Song
First: 2024-05-29T02:53:59+00:00 · Latest: 2025-11-24T18:15:06+00:00
Comments: Project Page: \url{https://chaitron.github.io/SketchDeco/}
Abstract
We introduce SketchDeco, a training-free approach to sketch colourisation that bridges the gap between professional design needs and intuitive, region-based control. Our method empowers artists to use simple masks and colour palettes for precise spatial and chromatic specification, avoiding both the tediousness of manual assignment and the ambiguity of text-based prompts. We reformulate this task as a novel, training-free composition problem. Our core technical contribution is a guided latent-space blending process: we first leverage diffusion inversion to precisely ``paint'' user-defined colours into specified regions, and then use a custom self-attention mechanism to harmoniously blend these local edits with a globally consistent base image. This ensures both local colour fidelity and global harmony without requiring any model fine-tuning. Our system produces high-quality results in 15--20 inference steps on consumer GPUs, making professional-quality, controllable colourisation accessible.
中文标题/摘要
标题:SketchDeco:无需训练的隐空间组合用于精确素描着色
我们介绍了SketchDeco,一种无需训练的素描着色方法,填补了专业设计需求与直观的区域控制之间的空白。我们的方法使艺术家能够使用简单的蒙版和颜色调色板进行精确的空间和色彩指定,避免了手动分配的繁琐和基于文本提示的模糊性。我们将此任务重新定义为一种新颖的、无需训练的组合问题。我们的核心技术贡献是一种引导式的隐空间混合过程:我们首先利用扩散反演精确地“绘制”用户定义的颜色到指定区域,然后使用自定义的注意力机制将这些局部编辑和谐地与全局一致的基础图像融合。这确保了局部色彩的忠实性和全局和谐,而无需任何模型微调。我们的系统在消费级GPU上进行15-20步推理即可生成高质量的结果,使专业级别的、可控的着色变得可行。
Summary / 总结
SketchDeco is a training-free approach for sketch colorization that allows artists to use simple masks and color palettes for precise spatial and chromatic specification. The method reformulates the task as a composition problem and uses a guided latent-space blending process involving diffusion inversion and self-attention to ensure local color fidelity and global harmony. Results are generated in 15-20 inference steps on consumer GPUs, making professional-quality, controllable colorization accessible.
SketchDeco 是一种无需训练的方法,用于素描着色,允许艺术家使用简单的遮罩和颜色调色板进行精确的空间和色彩指定。该方法将任务重新表述为一个合成问题,并使用引导的潜空间混合过程,结合扩散反演将用户定义的颜色精确地“绘制”到指定区域,然后使用自定义的注意力机制将这些局部编辑和谐地与全局一致的基础图像混合。这在无需模型微调的情况下实现了高质量、可控的着色,并在消费级 GPU 上仅需 15-20 个推理步骤即可完成。
ALMAS: an Autonomous LLM-based Multi-Agent Software Engineering Framework
Authors: Vali Tawosi, Keshav Ramani, Salwa Alamir, Xiaomo Liu
First: 2025-10-03T19:35:23+00:00 · Latest: 2025-11-24T18:11:57+00:00
Comments: Accepted to MAS-GAIN Workshop at ASE 2025
Abstract
Multi-agent Large Language Model (LLM) systems have been leading the way in applied LLM research across a number of fields. One notable area is software development, where researchers have advanced the automation of code implementation, code testing, code maintenance, inter alia, using LLM agents. However, software development is a multifaceted environment that extends beyond just code. As such, a successful LLM system must factor in multiple stages of the software development life-cycle (SDLC). In this paper, we propose a vision for ALMAS, an Autonomous LLM-based Multi-Agent Software Engineering framework, which follows the above SDLC philosophy such that it may work within an agile software development team to perform several tasks end-to-end. ALMAS aligns its agents with agile roles, and can be used in a modular fashion to seamlessly integrate with human developers and their development environment. We showcase the progress towards ALMAS through our published works and a use case demonstrating the framework, where ALMAS is able to seamlessly generate an application and add a new feature.
中文标题/摘要
标题:ALMAS:基于自主LLM的多智能体软件工程框架
多智能体大型语言模型(LLM)系统在多个领域推动了应用LLM研究的进展。特别是在软件开发领域,研究人员利用LLM代理实现了代码实现、代码测试、代码维护等自动化。然而,软件开发是一个多方面的环境,不仅仅局限于代码。因此,成功的LLM系统必须考虑软件开发生命周期(SDLC)的多个阶段。在本文中,我们提出了ALMAS的愿景,这是一种基于自主LLM的多智能体软件工程框架,遵循上述SDLC理念,可以在敏捷软件开发团队中完成多个任务。ALMAS将智能体与敏捷角色对齐,并可以模块化地无缝集成到人类开发人员及其开发环境中。我们通过已发表的作品和一个使用案例展示了ALMAS的进展,其中ALMAS能够无缝生成应用程序并添加新功能。
Summary / 总结
The paper introduces ALMAS, an Autonomous LLM-based Multi-Agent Software Engineering framework designed to automate various stages of the software development life-cycle. It aligns its agents with agile roles and integrates seamlessly with human developers. Key experimental findings include ALMAS's capability to generate an application and add new features, showcasing its potential in agile software development environments.
论文介绍了ALMAS,一种自主的基于LLM的多代理软件工程框架,旨在自动化软件开发生命周期中的各个阶段。该框架将代理与敏捷角色对齐,并能够无缝集成到人类开发者的开发环境中。实验结果表明,ALMAS能够自主生成应用程序并添加新功能,展示了其在敏捷开发环境中的潜力。
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
Authors: Zehong Ma, Longhui Wei, Shuai Wang, Shiliang Zhang, Qi Tian
First: 2025-11-24T17:59:06+00:00 · Latest: 2025-11-24T17:59:06+00:00
Comments: Project Page: https://zehong-ma.github.io/DeCo. Code Repository: https://github.com/Zehong-Ma/DeCo
Abstract
Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at https://github.com/Zehong-Ma/DeCo.
中文标题/摘要
标题:DeCo:解耦频域像素扩散以实现端到端图像生成
像素扩散旨在以端到端的方式直接在像素空间生成图像。这种方法避免了VAE在两阶段潜在扩散中的局限性,提供了更高的模型容量。现有的像素扩散模型在训练和推理速度上较慢,因为它们通常在一个扩散变换器(DiT)中同时建模高频信号和低频语义。为了追求更高效的像素扩散范式,我们提出了解耦频域像素扩散框架。基于分离高频和低频成分生成的直觉,我们利用一个轻量级的像素解码器,在DiT的语义引导下生成高频细节,从而让DiT专注于建模低频语义。此外,我们引入了一种频率感知的流匹配损失,强调视觉显著的频率,同时抑制不重要的频率。大量实验表明,DeCo在像素扩散模型中表现出更优的性能,在ImageNet上达到FID为1.62(256x256)和2.22(512x512),接近潜在扩散方法的性能。此外,我们的预训练文本到图像模型在系统级比较中获得了GenEval的领先综合得分0.86。代码可在https://github.com/Zehong-Ma/DeCo公开获取。
Summary / 总结
DeCo proposes a frequency-decoupled pixel diffusion framework to generate images directly in pixel space. It decouples high-frequency details and low-frequency semantics, using a lightweight pixel decoder for high-frequency generation and a frequency-aware flow-matching loss to emphasize visually salient frequencies. Experiments show DeCo outperforms other pixel diffusion models, achieving FID scores of 1.62 and 2.22 on ImageNet, and a leading GenEval score of 0.86 in system-level comparison.
论文提出了DeCo,一种频率解耦的像素扩散框架,旨在直接在像素空间生成图像。该框架将高频细节与低频语义解耦,使用轻量级像素解码器生成高频细节,而扩散变换器专注于低频语义。作者还引入了一种频率感知的流匹配损失,以增强视觉显著性。实验表明,DeCo 在 ImageNet 上的 FID 分数分别为 1.62 和 2.22,且在系统级比较中以 0.86 的 GenEval 分数领先其他模型。
The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification
Authors: Dante Francisco Wasmuht, Otto Brookes, Maximillian Schall, Pablo Palencia, Chris Beirne, Tilo Burghardt, Majid Mirmehdi, Hjalmar Kühl, Mimi Arandjelovic, Sam Pottie, Peter Bermant, Brandon Asheim, Yi Jin Toh, Adam Elzinga, Jason Holmberg, Andrew Whitworth, Eleanor Flatt, Laura Gustafson, Chaitanya Ryali, Yuan-Ting Hu, Baishan Guo, Andrew Westbury, Kate Saenko, Didac Suris
First: 2025-11-19T17:07:08+00:00 · Latest: 2025-11-24T17:02:04+00:00
Abstract
Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multianimal tracking in the wild. The dataset is available at https://www.conservationxlabs.com/sa-fari.
中文标题/摘要
标题:SA-FARI数据集:在动物视频中进行任何动物的分割与识别
自动视频分析对于野生动物保护至关重要。该领域的一个基础任务是多动物跟踪(MAT),它支撑着个体再识别和行为识别等应用。然而,现有数据集在规模、物种范围或时空多样性方面存在局限,没有适合训练适用于野生动物种群的通用MAT模型的基准。为解决这一问题,我们引入了SA-FARI,这是首个针对野生动物的大型开源MAT数据集。它包含从四大洲741个地点收集的约10年(2014-2024)的11,609个相机陷阱视频,覆盖99个物种类别。每个视频都进行了详尽标注,最终包含约46小时密集标注的视频片段,包含16,224个掩码身份和942,702个个体边界框、分割掩码和物种标签。除了特定任务的标注,我们还发布了每个视频的匿名相机陷阱位置。最后,我们使用最先进的视觉-语言模型在SA-FARI上进行了检测和跟踪基准测试,包括SAM 3,使用了物种特定和通用动物提示进行评估。我们还与专门为野生动物分析开发的仅视觉方法进行了比较。SA-FARI是首个结合高物种多样性、多区域覆盖和高质量时空标注的大规模数据集,为推动野外多动物跟踪的泛化提供了新的基础。数据集可在https://www.conservationxlabs.com/sa-fari/获取。
Summary / 总结
The research aims to enhance automated video analysis for wildlife conservation by introducing SA-FARI, a large-scale dataset for multi-animal tracking (MAT) in wild animals. The dataset includes 11,609 camera trap videos from 741 locations across 4 continents, spanning 99 species categories, and contains detailed annotations such as bounding boxes, segmentation masks, and species labels. Experiments using state-of-the-art vision-language models, including SAM 3, show improved performance in species-specific and generic animal tracking compared to vision-only methods. SA-FARI fills the gap in existing datasets by providing a diverse and geographically extensive resource for training general-purpose MAT models. The dataset is available at https://www.conservationxlabs.com/sa-fari.
研究引入了SA-FARI数据集,这是一个用于野生动物多动物追踪的大规模数据集,解决了现有数据集在规模、物种多样性以及时间和地理覆盖范围方面的局限性。该数据集包含来自4大洲741个地点的11,609个相机陷阱视频,涵盖99个物种类别,并附有详细的注释。使用最先进的视觉-语言模型进行了全面的基准测试,并与专门用于野生动物分析的视觉方法进行了比较,展示了该数据集在野生动物保护中多动物追踪方面的潜在应用价值。数据集可在https://www.conservationxlabs.com/sa-fari 获取。
FOCUS: Efficient Keyframe Selection for Long Video Understanding
Authors: Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Zhenheng Yang, Yang You
First: 2025-10-31T08:41:13+00:00 · Latest: 2025-11-24T16:40:06+00:00
Abstract
Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region. On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs. Code is available at https://github.com/NUS-HPC-AI-Lab/FOCUS.
中文标题/摘要
标题:FOCUS:长视频理解中的高效关键帧选择
多模态大型语言模型(MLLMs)将图像和视频帧表示为视觉标记。从单张图像扩展到一小时长的视频,标记预算急剧膨胀到实际限制之上。因此,流行的流水线要么均匀下采样,要么使用较小的视觉语言模型的检索式评分进行关键帧选择。然而,这些关键帧选择方法仍然依赖于在选择之前进行预筛选以降低推理成本,可能会错过最有信息性的时刻。我们提出了FOCUS(Frame-Optimistic Confidence Upper-bound Selection),一种无需训练、模型无关的关键帧选择模块,在严格的标记预算下选择查询相关的帧。FOCUS将关键帧选择形式化为多臂老虎机中的组合纯探索(CPE)问题:将短时间片段视为臂,并使用经验均值和伯努利置信半径来识别信息区域,同时保留对不确定区域的探索。由此产生的两阶段探索-利用过程从理论上保证了顺序策略,首先识别高价值的时间区域,然后在每个区域选择最高评分的关键帧。在两个长视频问答基准上,FOCUS在处理不到2%的视频帧的情况下实现了显著的准确率提升。对于超过20分钟的视频,它在LongVideoBench上实现了11.9%的准确率提升,证明了其作为关键帧选择方法的有效性,并为使用MLLMs进行可扩展的长视频理解提供了简单且通用的解决方案。代码可在https://github.com/NUS-HPC-AI-Lab/FOCUS/ 获取。
Summary / 总结
FOCUS is a training-free, model-agnostic keyframe selection method that selects query-relevant frames under a strict token budget for long video understanding. It formulates keyframe selection as a combinatorial pure-exploration problem in multi-armed bandits, identifying informative regions while preserving exploration of uncertain areas. On two long-video question-answering benchmarks, FOCUS improves accuracy by processing less than 2% of video frames, achieving an 11.9% gain in accuracy on LongVideoBench for videos longer than 20 minutes.
FOCUS 是一种无需训练、模型无关的关键帧选择方法,能够在严格的 token 预算下选择与查询相关的关键帧,用于长视频理解。它将关键帧选择问题表述为多臂老虎机中的组合纯探索问题,同时识别信息丰富的区域并保留对不确定区域的探索。在两个长视频问答基准测试中,FOCUS 通过处理不到 2% 的视频帧提高了准确性,在 LongVideoBench 上对于超过 20 分钟的视频,其准确率提高了 11.9%。
InfoScale: Unleashing Training-free Variable-scaled Image Generation via Effective Utilization of Information
Authors: Guohui Zhang, Jiangtong Tan, Linjiang Huang, Zhonghang Yuan, Mingde Yao, Jie Huang, Feng Zhao
First: 2025-09-01T12:27:04+00:00 · Latest: 2025-11-24T16:26:00+00:00
Abstract
Diffusion models (DMs) have become dominant in visual generation but suffer performance drop when tested on resolutions that differ from the training scale, whether lower or higher. In fact, the key challenge in generating variable-scale images lies in the differing amounts of information across resolutions, which requires information conversion procedures to be varied for generating variable-scaled images. In this paper, we investigate the issues of three critical aspects in DMs for a unified analysis in variable-scaled generation: dilated convolution, attention mechanisms, and initial noise. Specifically, 1) dilated convolution in DMs for the higher-resolution generation loses high-frequency information. 2) Attention for variable-scaled image generation struggles to adjust the information aggregation adaptively. 3) The spatial distribution of information in the initial noise is misaligned with variable-scaled image. To solve the above problems, we propose \textbf{InfoScale}, an information-centric framework for variable-scaled image generation by effectively utilizing information from three aspects correspondingly. For information loss in 1), we introduce Progressive Frequency Compensation module to compensate for high-frequency information lost by dilated convolution in higher-resolution generation. For information aggregation inflexibility in 2), we introduce Adaptive Information Aggregation module to adaptively aggregate information in lower-resolution generation and achieve an effective balance between local and global information in higher-resolution generation. For information distribution misalignment in 3), we design Noise Adaptation module to re-distribute information in initial noise for variable-scaled generation. Our method is plug-and-play for DMs and extensive experiments demonstrate the effectiveness in variable-scaled image generation.
中文标题/摘要
标题:InfoScale:通过有效利用信息释放无需训练的可变比例图像生成
扩散模型(DMs)在视觉生成中已成为主流,但在测试不同训练比例的分辨率时会表现出性能下降,无论是较低还是较高分辨率。实际上,生成可变比例图像的关键挑战在于不同分辨率下的信息量不同,这需要信息转换程序的变化来生成可变比例的图像。在本文中,我们研究了DMs中三个关键方面的问题,以统一分析可变比例生成:膨胀卷积、注意力机制和初始噪声。具体来说,1)DMs中的膨胀卷积在高分辨率生成中会丢失高频信息。2)注意力机制在生成可变比例图像时难以适应性地聚合信息。3)初始噪声中的信息空间分布与可变比例图像不一致。为了解决上述问题,我们提出了**InfoScale**,这是一种信息为中心的框架,通过从三个方面有效利用信息来实现可变比例图像生成。对于1)中的信息损失,我们引入了渐进频率补偿模块,以补偿膨胀卷积在高分辨率生成中丢失的高频信息。对于2)中的信息聚合灵活性差,我们引入了自适应信息聚合模块,以适应性地聚合低分辨率生成中的信息,并在高分辨率生成中实现局部和全局信息的有效平衡。对于3)中的信息分布不一致,我们设计了噪声适应模块,以重新分配初始噪声中的信息,实现可变比例生成。我们的方法适用于DMs,广泛的实验表明其在可变比例图像生成中的有效性。
Summary / 总结
InfoScale addresses the performance drop of diffusion models (DMs) when generating images at resolutions different from their training scale. It proposes an information-centric framework that includes a Progressive Frequency Compensation module to recover high-frequency information, an Adaptive Information Aggregation module to adaptively aggregate information, and a Noise Adaptation module to re-distribute initial noise information. Experiments show that InfoScale effectively improves variable-scaled image generation across different resolutions without retraining the model.
InfoScale 解决了扩散模型(DMs)在生成与训练尺度不同的分辨率图像时性能下降的问题。它提出了一种以信息为中心的框架,包括用于恢复高频信息的渐进频率补偿模块、用于适应性聚合信息的自适应信息聚合模块以及用于重新分配初始噪声信息的噪声适应模块。实验表明,InfoScale 在不同分辨率的图像生成中有效提升了性能,无需重新训练模型。
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval
Authors: Dhiman Paul, Md Rizwan Parvez, Nabeel Mohammed, Shafin Rahman
First: 2024-12-02T14:45:53+00:00 · Latest: 2025-11-24T16:17:57+00:00
Abstract
Prevailing joint prediction transformers for Video Highlight Detection and Moment Retrieval (HD/MR) exhibit deficiencies in handling cross-task dynamics, achieving robust video-text alignment, and utilizing effective attention mechanisms, with the potential of Large Language/Vision-Language Models (LLMs/LVLMs) being largely untapped. This paper introduces VideoLights, a novel HD/MR framework addressing these limitations by incorporating: (i) Convolutional Projection and Feature Refinement modules with an alignment loss for enhanced video-text feature congruity; (ii) a Bi-Directional Cross-Modal Fusion network for strongly coupled query-aware representations; (iii) a Uni-directional joint-task feedback mechanism for synergistic task improvement; (iv) hard positive/negative losses for adaptive learning; and (v) the leveraging of LVLMs (e.g., BLIP-2) for superior multimodal feature integration and intelligent pre-training with synthetic data. Comprehensive evaluations on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate that VideoLights significantly surpasses existing baselines, establishing new state-of-the-art performances. Codes and model checkpoints are available at https://github.com/dpaul06/VideoLights .
中文标题/摘要
标题:VideoLights:联合视频高光检测和时刻检索的特征精炼和跨任务对齐变换器
现有的联合预测变换器在视频高光检测和时刻检索(HD/MR)中存在跨任务动态处理不足、实现稳健的视频-文本对齐以及利用有效的注意力机制的问题,大型语言/视觉语言模型(LLMs/LVLMs)的潜力尚未充分挖掘。本文提出VideoLights,一种解决这些限制的新颖HD/MR框架,通过引入:(i) 卷积投影和特征精炼模块以及对齐损失以增强视频-文本特征一致性;(ii) 双向跨模态融合网络以生成强关联的查询感知表示;(iii) 单向联合任务反馈机制以实现任务协同改进;(iv) 硬正负样本损失以实现自适应学习;以及(v) 利用LVLM(如BLIP-2)进行更优的多模态特征集成和智能预训练。在QVHighlights、TVSum和Charades-STA基准上的全面评估表明,VideoLights 显著超越现有基线,建立了新的性能基准。代码和模型检查点可在 https://github.com/dpaul06/VideoLights 获取。
Summary / 总结
VideoLights is a novel framework for joint video highlight detection and moment retrieval that addresses the limitations of previous transformers by incorporating convolutional projection and feature refinement, a bi-directional cross-modal fusion network, a uni-directional joint-task feedback mechanism, hard positive/negative losses, and the use of Large Vision-Language Models. Comprehensive evaluations show that VideoLights outperforms existing methods on QVHighlights, TVSum, and Charades-STA benchmarks, establishing new state-of-the-art performances.
VideoLights 是一种新颖的框架,用于联合视频高光检测和时刻检索,解决了处理跨任务动态和实现稳健的视频-文本对齐的局限性。它包含卷积投影和特征精炼模块、双向跨模态融合网络、单向联合任务反馈机制以及硬正/负样本损失。此外,它利用大型视觉-语言模型(如BLIP-2)进行多模态特征集成。在QVHighlights、TVSum和Charades-STA基准上的实验表明,VideoLights 在性能上超过了现有方法,并建立了新的最先进的性能。
LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models
Authors: Shuai Wang, Daoan Zhang, Tianyi Bai, Shitong Shao, Jiebo Luo, Jiaheng Wei
First: 2025-11-24T16:13:26+00:00 · Latest: 2025-11-24T16:13:26+00:00
Abstract
Humans can perceive and understand 3D space and long videos from sequential visual observations. But do vision-language models (VLMs) can? Recent work demonstrates that even state-of-the-art VLMs still struggle to understand 3D space and long videos, although they are powerful in typical vision-language tasks. Current methods often rely on specialized architectural designs to improve performance for 3D tasks and video understanding tasks separately. In contrast, we propose LAST, short for LeArn to Think in Space and Time, to jointly improve 3D spatial and long video understanding for general VLMs with only a set of 2D images as inputs. LAST makes VLMs think in space and time rather than only with text before giving the final answer, building visual thinking trajectories in 3D space and temporal dimension. We demonstrate the effectiveness of LAST in two scenarios: 1) zero-shot, where we directly prompt proprietary models; and 2) fine-tuning general VLMs with data that include thinking trajectories in 3D space and time. We show that LAST brings substantial gains in various benchmarks, including 3 spatial understanding, 4 video understanding, and 3 image understanding tasks. Notably, 15.8% gains on EgoSchema with GPT-4o in a zero-shot manner and 8.3 gains on VSI-Bench compared with Qwen2.5-VL-7B.
中文标题/摘要
标题:LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models
人类能够从连续的视觉观察中感知和理解三维空间和长时间视频。但视觉语言模型(VLMs)能做到吗?最近的研究表明,即使是最先进的VLMs在理解三维空间和长时间视频方面仍然存在困难,尽管它们在典型的视觉语言任务中表现强大。当前的方法通常依赖于专门的架构设计,分别提高三维任务和视频理解任务的性能。相比之下,我们提出了LAST,即LeArn to Think in Space and Time,仅通过一组2D图像作为输入,就可联合提高通用VLMs的三维空间理解和长时间视频理解能力。LAST使VLMs在给出最终答案之前,在三维空间和时间维度中进行视觉思考,而不是仅仅依赖于文本。我们通过两种场景展示了LAST的有效性:1)零样本场景,我们直接提示专有模型;2)微调通用VLMs,使用包含三维空间和时间思考轨迹的数据。我们展示了LAST在各种基准测试中的显著改进,包括3项空间理解任务、4项视频理解任务和3项图像理解任务。值得注意的是,在零样本方式下,LAST在使用GPT-4o的EgoSchema上带来了15.8%的提升,在VSI-Bench上相对于Qwen2.5-VL-7B带来了8.3%的提升。
Summary / 总结
The research aims to enhance vision-language models' ability to understand 3D space and long videos. LAST, a method that encourages VLMs to think in space and time, is proposed. It improves performance in various benchmarks, achieving 15.8% gains on EgoSchema with GPT-4o in a zero-shot manner and 8.3% gains on VSI-Bench compared with Qwen2.5-VL-7B.
研究旨在通过开发LAST,使视觉语言模型能够更好地理解3D空间和长视频,鼓励模型在空间和时间维度上思考,而不是仅依赖文本。LAST在多种基准测试中表现出色,实现了在零样本设置下使用GPT-4o获得15.8%的提升,以及在VSI-Bench中相对于Qwen2.5-VL-7B的8.3%提升。
Medusa: Cross-Modal Transferable Adversarial Attacks on Multimodal Medical Retrieval-Augmented Generation
Authors: Yingjia Shang, Yi Liu, Huimin Wang, Furong Li, Wenfang Sun, Wu Chengyu, Yefeng Zheng
Venue: KDD 2026
First: 2025-11-24T16:11:01+00:00 · Latest: 2025-11-24T16:11:01+00:00
Comments: Accepted at KDD 2026 First Cycle (full version). Authors marked with * contributed equally. Yi Liu is the lead author
Abstract
With the rapid advancement of retrieval-augmented vision-language models, multimodal medical retrieval-augmented generation (MMed-RAG) systems are increasingly adopted in clinical decision support. These systems enhance medical applications by performing cross-modal retrieval to integrate relevant visual and textual evidence for tasks, e.g., report generation and disease diagnosis. However, their complex architecture also introduces underexplored adversarial vulnerabilities, particularly via visual input perturbations. In this paper, we propose Medusa, a novel framework for crafting cross-modal transferable adversarial attacks on MMed-RAG systems under a black-box setting. Specifically, Medusa formulates the attack as a perturbation optimization problem, leveraging a multi-positive InfoNCE loss (MPIL) to align adversarial visual embeddings with medically plausible but malicious textual targets, thereby hijacking the retrieval process. To enhance transferability, we adopt a surrogate model ensemble and design a dual-loop optimization strategy augmented with invariant risk minimization (IRM). Extensive experiments on two real-world medical tasks, including medical report generation and disease diagnosis, demonstrate that Medusa achieves over 90% average attack success rate across various generation models and retrievers under appropriate parameter configuration, while remaining robust against four mainstream defenses, outperforming state-of-the-art baselines. Our results reveal critical vulnerabilities in the MMed-RAG systems and highlight the necessity of robustness benchmarking in safety-critical medical applications. The code and data are available at https://anonymous.4open.science/r/MMed-RAG-Attack-F05A.
中文标题/摘要
标题:Medusa:跨模态可转移对抗攻击在多模态医疗检索增强生成系统中的应用
随着检索增强视觉语言模型的迅速发展,多模态医疗检索增强生成(MMed-RAG)系统在临床决策支持中的应用越来越广泛。这些系统通过跨模态检索将相关视觉和文本证据整合到任务中,例如报告生成和疾病诊断,从而增强医疗应用。然而,其复杂的架构也引入了未被充分探索的对抗性漏洞,特别是通过视觉输入扰动。在本文中,我们提出了一种名为Medusa的新框架,用于在黑盒设置下对MMed-RAG系统进行跨模态可转移对抗攻击。具体而言,Medusa将攻击形式化为扰动优化问题,利用多正则化InfoNCE损失(MPIL)对齐对抗性视觉嵌入与医学上合理但恶意的文本目标,从而劫持检索过程。为了增强可转移性,我们采用了一组替代模型并设计了一种增强不变风险最小化(IRM)的双重循环优化策略。在两个实际医疗任务上的广泛实验,包括医疗报告生成和疾病诊断,表明在适当参数配置下,Medusa在各种生成模型和检索器下的平均攻击成功率超过90%,并且对四种主流防御措施具有鲁棒性,优于最先进的基线。我们的结果揭示了MMed-RAG系统中的关键漏洞,并强调了在安全关键的医疗应用中进行鲁棒性基准测试的必要性。代码和数据可在https://anonymous.4open.science/r/MMed-RAG-Attack-F05A/获取。
Summary / 总结
Medusa is a novel framework for crafting cross-modal transferable adversarial attacks on MMed-RAG systems, which integrate visual and textual evidence for medical tasks. It formulates the attack as a perturbation optimization problem using a multi-positive InfoNCE loss to align adversarial visual embeddings with malicious textual targets. Medusa achieves over 90% average attack success rate across various generation models and retrievers, and remains robust against mainstream defenses, outperforming state-of-the-art baselines. This work highlights critical vulnerabilities in MMed-RAG systems and the need for robustness benchmarking in medical applications.
Medusa 是一种针对 MMed-RAG 系统的新型跨模态可转移对抗攻击框架,该系统将视觉和文本证据整合用于医疗任务。它将攻击问题表述为使用多正则 InfoNCE 损失的扰动优化问题,以使对抗视觉嵌入与恶意文本目标对齐。Medusa 在各种生成模型和检索器上的平均攻击成功率超过 90%,并且能够抵御主流防御措施,优于最先进的基线。这项工作揭示了 MMed-RAG 系统中的关键漏洞,并强调了在医疗应用中进行鲁棒性基准测试的必要性。
Optimization-Free Style Transfer for 3D Gaussian Splats
Authors: Raphael Du Sablon, David Hart
First: 2025-08-07T19:35:01+00:00 · Latest: 2025-11-24T16:09:54+00:00
Abstract
The task of style transfer for 3D Gaussian splats has been explored in many previous works, but these require reconstructing or fine-tuning the splat while incorporating style information or optimizing a feature extraction network on the splat representation. We propose a reconstruction- and optimization-free approach to stylizing 3D Gaussian splats, allowing for direct stylization on a .ply or .splat file without requiring the original camera views. This is done by generating a graph structure across the implicit surface of the splat representation. A feed-forward, surface-based stylization method is then used and interpolated back to the individual splats in the scene. This also allows for fast stylization of splats with no additional training, achieving speeds under 2 minutes even on CPU-based consumer hardware. We demonstrate the quality results this approach achieves and compare to other 3D Gaussian splat style transfer methods. Code is publicly available at https://github.com/davidmhart/FastSplatStyler.
中文标题/摘要
标题:无需优化的3D高斯斑点风格迁移
许多先前的工作已经探索了3D高斯斑点的风格迁移任务,但这些方法需要重建或微调斑点以结合风格信息或在斑点表示上优化特征提取网络。我们提出了一种无需重建和优化的3D高斯斑点风格迁移方法,允许直接对.ply或.splat文件进行风格化,而无需要求原始相机视图。这通过在斑点表示的隐式曲面上生成图结构来实现。然后使用基于表面的前馈风格化方法,并将其插值回场景中的个别斑点。这种方法还允许在无需额外训练的情况下快速风格化斑点,即使在基于CPU的消费级硬件上,速度也低于2分钟。我们展示了该方法实现的质量结果,并将其与其他3D高斯斑点风格迁移方法进行了比较。代码可在https://github.com/davidmhart/FastSplatStyler上公开获取。
Summary / 总结
This paper addresses the challenge of style transfer for 3D Gaussian splats by proposing a novel method that does not require reconstruction or optimization. Instead, it generates a graph structure across the implicit surface of the splat representation and uses a feed-forward, surface-based stylization method. This approach enables direct stylization on .ply or .splat files without needing the original camera views, and it achieves fast stylization, even on CPU-based hardware, within under 2 minutes. The results demonstrate high-quality stylization without additional training. Comparison to other methods shows competitive performance in terms of quality and speed.
该论文提出了一种无需重建或优化的新方法,用于3D高斯点云的风格转移。该方法通过在点云的隐式曲面上生成图结构,并使用基于表面的前馈风格化方法,可以直接在.ply或.splat文件上进行风格化,无需原始相机视图。这种方法在基于CPU的硬件上实现了快速风格化,甚至在不到2分钟的时间内即可完成。结果表明,该方法能够实现高质量的风格化,且无需额外训练。与其他方法的比较显示,在质量和速度方面具有竞争力。
Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving
Authors: Jianhua Han, Meng Tian, Jiangtong Zhu, Fan He, Huixin Zhang, Sitong Guo, Dechang Zhu, Hao Tang, Pei Xu, Yuze Guo, Minzhe Niu, Haojie Zhu, Qichao Dong, Xuechao Yan, Siyuan Dong, Lu Hou, Qingqiu Huang, Xiaosong Jia, Hang Xu
First: 2025-11-24T15:28:25+00:00 · Latest: 2025-11-24T15:28:25+00:00
Abstract
Autonomous driving heavily relies on accurate and robust spatial perception. Many failures arise from inaccuracies and instability, especially in long-tail scenarios and complex interactions. However, current vision-language models are weak at spatial grounding and understanding, and VLA systems built on them therefore show limited perception and localization ability. To address these challenges, we introduce Percept-WAM, a perception-enhanced World-Awareness-Action Model that is the first to implicitly integrate 2D/3D scene understanding abilities within a single vision-language model (VLM). Instead of relying on QA-style spatial reasoning, Percept-WAM unifies 2D/3D perception tasks into World-PV and World-BEV tokens, which encode both spatial coordinates and confidence. We propose a grid-conditioned prediction mechanism for dense object perception, incorporating IoU-aware scoring and parallel autoregressive decoding, improving stability in long-tail, far-range, and small-object scenarios. Additionally, Percept-WAM leverages pretrained VLM parameters to retain general intelligence (e.g., logical reasoning) and can output perception results and trajectory control outputs directly. Experiments show that Percept-WAM matches or surpasses classical detectors and segmenters on downstream perception benchmarks, achieving 51.7/58.9 mAP on COCO 2D detection and nuScenes BEV 3D detection. When integrated with trajectory decoders, it further improves planning performance on nuScenes and NAVSIM, e.g., surpassing DiffusionDrive by 2.1 in PMDS on NAVSIM. Qualitative results further highlight its strong open-vocabulary and long-tail generalization.
中文标题/摘要
标题:Percept-WAM:增强感知的世界意识行动模型以实现稳健的端到端自动驾驶
自动驾驶高度依赖于准确和稳健的空间感知。许多失败源于空间感知的不准确性和不稳定性,尤其是在长尾场景和复杂交互中。然而,当前的视觉-语言模型在空间定位和理解方面较弱,因此基于它们构建的VL系统在感知和定位能力方面表现有限。为应对这些挑战,我们提出了Percept-WAM,这是一种增强感知的世界意识行动模型,首次在单一视觉-语言模型(VLM)中隐式地整合了2D/3D场景理解能力。Percept-WAM 不依赖于问答式的空间推理,而是将2D/3D感知任务统一为世界-PV和世界-BEV标记,这些标记编码了空间坐标和置信度。我们提出了一种基于网格条件的预测机制,用于密集对象感知,结合了IoU感知评分和并行自回归解码,提高了在长尾、远距离和小目标场景中的稳定性。此外,Percept-WAM 利用预训练的VLM参数保留了一般智能(例如逻辑推理),可以直接输出感知结果和轨迹控制输出。实验表明,Percept-WAM 在下游感知基准测试中与经典检测器和分割器相当或超越,分别在COCO 2D检测和nuScenes BEV 3D检测中达到51.7/58.9的mAP。当与轨迹解码器集成时,它进一步提高了nuScenes和NAVSIM上的规划性能,例如在NAVSIM上的PMDS上超越DiffusionDrive 2.1。定性结果进一步突显了其强大的开放词汇和长尾泛化能力。
Summary / 总结
Percept-WAM is designed to enhance the spatial perception and robustness of autonomous driving systems by integrating 2D/3D scene understanding within a single vision-language model. It introduces World-PV and World-BEV tokens to encode spatial coordinates and confidence, and uses a grid-conditioned prediction mechanism for dense object perception. Experiments show that Percept-WAM outperforms classical detectors on perception benchmarks and improves planning performance on nuScenes and NAVSIM, surpassing DiffusionDrive by 2.1 in PMDS on NAVSIM.
Percept-WAM 是一种通过在单一视觉语言模型中整合 2D/3D 场景理解来增强自主驾驶系统的空间感知和定位能力的方法。它引入了 World-PV 和 World-BEV 令牌来编码空间坐标和置信度,并使用网格条件预测机制进行密集对象感知。实验表明,Percept-WAM 在感知基准测试中优于经典检测器和分割器,并在自主驾驶场景中提高了规划性能。
Are Large Vision Language Models Truly Grounded in Medical Images? Evidence from Italian Clinical Visual Question Answering
Authors: Federico Felizzi, Olivia Riccomi, Michele Ferramola, Francesco Andrea Causio, Manuel Del Medico, Vittorio De Vita, Lorenzo De Mori, Alessandra Piscitelli Pietro Eric Risuleo, Bianca Destro Castaniti, Antonio Cristiano Alessia Longo, Luigi De Angelis, Mariapia Vassalli, Marcello Di Pumpo
First: 2025-11-24T15:26:58+00:00 · Latest: 2025-11-24T15:26:58+00:00
Comments: Accepted at the Workshop on Multimodal Representation Learning for Healthcare (MMRL4H), EurIPS 2025
Abstract
Large vision language models (VLMs) have achieved impressive performance on medical visual question answering benchmarks, yet their reliance on visual information remains unclear. We investigate whether frontier VLMs demonstrate genuine visual grounding when answering Italian medical questions by testing four state-of-the-art models: Claude Sonnet 4.5, GPT-4o, GPT-5-mini, and Gemini 2.0 flash exp. Using 60 questions from the EuropeMedQA Italian dataset that explicitly require image interpretation, we substitute correct medical images with blank placeholders to test whether models truly integrate visual and textual information. Our results reveal striking variability in visual dependency: GPT-4o shows the strongest visual grounding with a 27.9pp accuracy drop (83.2% [74.6%, 91.7%] to 55.3% [44.1%, 66.6%]), while GPT-5-mini, Gemini, and Claude maintain high accuracy with modest drops of 8.5pp, 2.4pp, and 5.6pp respectively. Analysis of model-generated reasoning reveals confident explanations for fabricated visual interpretations across all models, suggesting varying degrees of reliance on textual shortcuts versus genuine visual analysis. These findings highlight critical differences in model robustness and the need for rigorous evaluation before clinical deployment.
中文标题/摘要
标题:大型视觉语言模型在医学图像中是否真正接地?来自意大利临床视觉问答的证据
大型视觉语言模型(VLMs)在医学视觉问答基准测试中取得了令人印象深刻的性能,但它们对视觉信息的依赖性仍然不清楚。我们通过测试四种最先进的模型:Claude Sonnet 4.5、GPT-4o、GPT-5-mini 和 Gemini 2.0 flash exp,来调查这些模型在回答意大利医学问题时是否真正展示了视觉接地。我们使用了来自欧洲MedQA意大利数据集的60个问题,这些问题是明确需要图像解释的。我们用空白占位符替换正确的医学图像,以测试模型是否真正整合了视觉和文本信息。我们的结果显示了视觉依赖性的显著差异:GPT-4o 表现出了最强的视觉接地,准确率下降了27.9个百分点(从83.2% [74.6%, 91.7%] 降至55.3% [44.1%, 66.6%]),而GPT-5-mini、Gemini 和 Claude 的准确率下降幅度分别为8.5个百分点、2.4个百分点和5.6个百分点。对模型生成的推理分析显示,所有模型都自信地解释了虚构的视觉解释,这表明它们在不同程度上依赖于文本捷径而非真正的视觉分析。这些发现突显了模型稳健性的重要差异,并强调在临床部署前需要进行严格的评估。
Summary / 总结
This study investigates the visual grounding capabilities of large vision language models (VLMs) in answering Italian medical questions by substituting correct medical images with blank placeholders. Four state-of-the-art models—Claude Sonnet 4.5, GPT-4o, GPT-5-mini, and Gemini 2.0—were tested on 60 questions from the EuropeMedQA Italian dataset. Results showed significant variability in visual dependency, with GPT-4o showing the strongest grounding and a 27.9 percentage point accuracy drop, while GPT-5-mini, Gemini, and Claude maintained high accuracy with smaller drops. Analysis of model-generated reasoning indicated that all models relied on textual shortcuts to some extent, highlighting the need for rigorous evaluation before clinical deployment.
研究评估了大型视觉语言模型(VLMs)在意大利医学图像问题上的视觉接地能力。通过用占位符替换正确图像,研究发现模型在视觉依赖性上存在显著差异,GPT-4o表现出最强的视觉接地能力,而GPT-5-mini、Gemini和Claude尽管有轻微的性能下降,但仍保持了高准确性。模型生成的推理分析表明,它们依赖于文本捷径,强调了在临床应用前需要进行严格的评估。
MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis in Chest X-Ray
Authors: Yitong Li, Morteza Ghahremani, Christian Wachinger
First: 2025-05-27T19:37:51+00:00 · Latest: 2025-11-24T15:19:38+00:00
Abstract
Recent vision-language foundation models deliver state-of-the-art results in natural image classification, but falter in medical images due to pronounced domain shifts. Training a medical foundation model also requires substantial resources, including extensive annotated data and high computational capacity. To bridge this gap with minimal overhead, we introduce MedBridge, a lightweight multimodal adaptation framework that flexibly re-purposes arbitrary pre-trained foundation VLMs for medical image diagnosis. MedBridge comprises three novel core components. First, a Focal Sampling module that subsamples and extracts high-resolution local regions to capture subtle pathological features, compensating for the limited input resolution of foundation VLMs. Second, a Query-Encoder model with a small set of learnable queries to align the feature maps of frozen VLMs with medical semantics, without requiring retraining of the backbone layers. Third, a Mixture of Experts mechanism, driven by learnable queries, harnesses the complementary strength of various VLMs to maximize diagnostic performance. We evaluate MedBridge on five chest radiograph benchmarks in three key adaptation tasks, demonstrating its superior performance in both cross-domain and in-domain adaptation settings under varying levels of training data availability. MedBridge achieved an improvement of 6-15% in AUC compared to state-of-the-art VLM adaptation methods in multi-label thoracic disease diagnosis, underscoring its effectiveness in leveraging diverse foundation models for accurate and data-efficient medical diagnosis. Our project and code are available at https://github.com/ai-med/MedBridge.
中文标题/摘要
标题:MedBridge:将基础视觉-语言模型与胸部X光医学图像诊断对接
近期的基础视觉-语言模型在自然图像分类中取得了最先进的成果,但在医学图像中由于领域差异显著而表现不佳。训练医学基础模型还需要大量的资源,包括大量的标注数据和高性能的计算能力。为了在最小的开销下弥合这一差距,我们引入了MedBridge,这是一种轻量级的多模态适应框架,可以灵活地重新利用任意预训练的基础视觉-语言模型(VLM)进行医学图像诊断。MedBridge 包含三个新颖的核心组件。首先,一个焦点采样模块,用于采样和提取高分辨率的局部区域,以捕捉细微的病理特征,弥补基础VLM输入分辨率有限的问题。其次,一个具有少量可学习查询的查询编码器模型,用于将冻结的VLM的特征图与医学语义对齐,无需重新训练骨干层。第三,一个由可学习查询驱动的专家混合机制,利用各种VLM的互补优势,最大化诊断性能。我们在三个关键适应任务中的五个胸部X光基准上评估了MedBridge,展示了其在不同训练数据可用性下的跨域和域内适应设置中的优越性能。在多标签胸腔疾病诊断中,MedBridge 的AUC相比最先进的VLM适应方法提高了6-15%,突显了其利用多种基础模型进行准确和数据高效医学诊断的有效性。我们的项目和代码可在https://github.com/ai-med/MedBridge/ 获取。
Summary / 总结
MedBridge is a lightweight multimodal adaptation framework that repurposes pre-trained vision-language models for medical image diagnosis, particularly in chest X-rays. It includes a Focal Sampling module for extracting high-resolution local regions, a Query-Encoder model to align feature maps with medical semantics, and a Mixture of Experts mechanism to leverage the strengths of different models. MedBridge outperforms state-of-the-art VLM adaptation methods by 6-15% in AUC for multi-label thoracic disease diagnosis across various data availability settings.
MedBridge 是一个轻量级的多模态适应框架,旨在利用预训练的视觉-语言模型进行医学影像诊断,特别是在胸部X光片上。它包括一个焦点采样模块来捕捉细微的病理特征,一个查询编码器模型来对齐特征图与医学语义,以及一个专家混合机制来结合不同模型的优势。MedBridge 在多标签胸腔疾病诊断的各种基准和数据可用性场景中,相比最先进的 VLM 调整方法在 AUC 上提高了 6-15%。
Distributionally Robust Free Energy Principle for Decision-Making
Authors: Allahkaram Shafiei, Hozefa Jesawada, Karl Friston, Giovanni Russo
First: 2025-03-17T14:36:08+00:00 · Latest: 2025-11-24T15:19:30+00:00
Comments: Contains main text and supplementary information. Supplementary movie is at the paper repository
Abstract
Despite their groundbreaking performance, autonomous agents can misbehave when training and environmental conditions become inconsistent, with minor mismatches leading to undesirable behaviors or even catastrophic failures. Robustness towards these training-environment ambiguities is a core requirement for intelligent agents and its fulfillment is a long-standing challenge towards their real-world deployments. Here, we introduce a Distributionally Robust Free Energy model (DR-FREE) that instills this core property by design. Combining a robust extension of the free energy principle with a resolution engine, DR-FREE wires robustness into the agent decision-making mechanisms. Across benchmark experiments, DR-FREE enables the agents to complete the task even when, in contrast, state-of-the-art models fail. This milestone may inspire both deployments in multi-agent settings and, at a perhaps deeper level, the quest for an explanation of how natural agents -- with little or no training -- survive in capricious environments.
中文标题/摘要
标题:分布鲁棒自由能原理在决策中的应用
尽管自主代理在性能上取得了突破性进展,但在训练和环境条件不一致时,它们可能会表现出不良行为,即使是微小的不匹配也可能导致不良行为或灾难性故障。对这些训练-环境模糊性的鲁棒性是智能代理的核心要求,其实现是将其部署到现实世界中的长期挑战。在此,我们提出了一种分布鲁棒自由能模型(DR-FREE),通过设计赋予其这一核心特性。结合鲁棒扩展的自由能原理与解决引擎,DR-FREE 将鲁棒性嵌入到代理的决策机制中。在基准实验中,DR-FREE 使代理即使在最先进的模型失败的情况下也能完成任务。这一里程碑可能激发多代理环境中的部署,并在更深层次上激励对自然代理如何在多变环境中生存的解释探索。
Summary / 总结
The research addresses the issue of autonomous agents misbehaving due to minor mismatches between training and environmental conditions. It introduces DR-FREE, a Distributionally Robust Free Energy model that combines a robust extension of the free energy principle with a resolution engine. The model ensures robust decision-making mechanisms, enabling agents to complete tasks even when state-of-the-art models fail. This robustness is demonstrated across benchmark experiments, potentially advancing the deployment of agents in complex and unpredictable environments.
研究针对自主代理因训练与环境条件微小不匹配而表现出异常行为的问题。引入了DR-FREE模型,该模型结合了扩展后的自由能原理和解决引擎,确保代理的决策机制具有鲁棒性。该模型使代理能够在任务中表现良好,即使最先进的模型会失败。这种鲁棒性在基准实验中得到了验证,可能推动代理在复杂和不可预测环境中部署的进步。
Can Modern Vision Models Understand the Difference Between an Object and a Look-alike?
Authors: Itay Cohen, Ethan Fetaya, Amir Rosenfeld
First: 2025-11-24T15:09:32+00:00 · Latest: 2025-11-24T15:09:32+00:00
Abstract
Recent advances in computer vision have yielded models with strong performance on recognition benchmarks; however, significant gaps remain in comparison to human perception. One subtle ability is to judge whether an image looks like a given object without being an instance of that object. We study whether vision-language models such as CLIP capture this distinction. We curated a dataset named RoLA (Real or Lookalike) of real and lookalike exemplars (e.g., toys, statues, drawings, pareidolia) across multiple categories, and first evaluate a prompt-based baseline with paired "real"/"lookalike" prompts. We then estimate a direction in CLIP's embedding space that moves representations between real and lookalike. Applying this direction to image and text embeddings improves discrimination in cross-modal retrieval on Conceptual12M, and also enhances captions produced by a CLIP prefix captioner.
中文标题/摘要
标题:现代视觉模型能否理解对象与其相似物之间的区别?
计算机视觉领域的最新进展产生了在识别基准测试中表现出色的模型;然而,在与人类感知的比较中仍存在显著差距。一种微妙的能力是判断一张图片看起来像某个对象但又不是该对象的实例。我们研究视觉-语言模型(如CLIP)是否捕捉到这种区别。我们整理了一个名为RoLA(真实或相似物)的数据集,包含多个类别中的真实和相似物实例(例如,玩具、雕像、绘画、拟态现象),并首先评估基于提示的基本方法,使用配对的“真实”/“相似物”提示。然后我们在CLIP的嵌入空间中估计一个方向,该方向在真实和相似物之间移动表示。将此方向应用于图像和文本嵌入,在Conceptual12M的跨模态检索中提高了区分能力,并且也增强了由CLIP前缀生成的描述。
Summary / 总结
The study investigates whether modern vision models, specifically CLIP, can distinguish between an object and its look-alike. A dataset named RoLA was created to evaluate this ability. The research first tested a prompt-based approach and then identified a direction in CLIP's embedding space to enhance the distinction between real and lookalike objects. This improvement was observed in cross-modal retrieval and caption generation tasks.
研究探讨了现代视觉模型CLIP是否能够区分一个物体与其相似物。研究人员创建了一个名为RoLA的数据集来评估这一能力。首先测试了基于提示的方法,然后在CLIP的嵌入空间中确定了一个方向来增强对真实物体与相似物区别的区分。这种改进在跨模态检索和生成描述中得到了体现。
Test-Time Preference Optimization for Image Restoration
Authors: Bingchen Li, Xin Li, Jiaqi Xu, Jiaming Guo, Wenbo Li, Renjing Pei, Zhibo Chen
First: 2025-11-24T14:32:27+00:00 · Latest: 2025-11-24T14:32:27+00:00
Comments: Accepted by AAAI26
Abstract
Image restoration (IR) models are typically trained to recover high-quality images using L1 or LPIPS loss. To handle diverse unknown degradations, zero-shot IR methods have also been introduced. However, existing pre-trained and zero-shot IR approaches often fail to align with human preferences, resulting in restored images that may not be favored. This highlights the critical need to enhance restoration quality and adapt flexibly to various image restoration tasks or backbones without requiring model retraining and ideally without labor-intensive preference data collection. In this paper, we propose the first Test-Time Preference Optimization (TTPO) paradigm for image restoration, which enhances perceptual quality, generates preference data on-the-fly, and is compatible with any IR model backbone. Specifically, we design a training-free, three-stage pipeline: (i) generate candidate preference images online using diffusion inversion and denoising based on the initially restored image; (ii) select preferred and dispreferred images using automated preference-aligned metrics or human feedback; and (iii) use the selected preference images as reward signals to guide the diffusion denoising process, optimizing the restored image to better align with human preferences. Extensive experiments across various image restoration tasks and models demonstrate the effectiveness and flexibility of the proposed pipeline.
中文标题/摘要
标题:图像恢复中的测试时偏好优化
图像恢复(IR)模型通常使用L1或LPIPS损失来训练以恢复高质量图像。为了处理各种未知退化,还引入了零样本IR方法。然而,现有的预训练和零样本IR方法往往无法与人类偏好对齐,导致恢复的图像可能不受青睐。这突显了提高恢复质量和灵活适应各种图像恢复任务或骨干网络的迫切需要,而无需重新训练模型,最好也不需要大量的人类偏好数据收集。在本文中,我们提出了第一个用于图像恢复的测试时偏好优化(TTPO)范式,该范式提高了感知质量,能够实时生成偏好数据,并与任何IR模型骨干兼容。具体而言,我们设计了一个无需训练的三阶段管道:(i) 使用扩散反转和去噪基于初始恢复的图像在线生成候选偏好图像;(ii) 使用自动偏好对齐的度量或人类反馈选择偏好和非偏好图像;(iii) 使用所选的偏好图像作为奖励信号来指导扩散去噪过程,优化恢复的图像以更好地与人类偏好对齐。在各种图像恢复任务和模型上的广泛实验表明,所提出的管道的有效性和灵活性。
Summary / 总结
This paper addresses the issue of image restoration models not aligning with human preferences, proposing a Test-Time Preference Optimization (TTPO) paradigm. The method involves a three-stage pipeline: generating candidate preference images using diffusion inversion and denoising, selecting preferred and dispreferred images using automated metrics or human feedback, and optimizing the restored image using the selected preference images as reward signals. Experiments show the effectiveness and flexibility of the proposed approach across various image restoration tasks and models.
本文针对图像恢复模型与人类偏好不一致的问题,提出了Test-Time Preference Optimization (TTPO) paradigm。方法包括三个阶段:使用扩散反演和去噪生成候选偏好图像,使用自动化指标或人工反馈选择偏好和非偏好图像,并基于选定的偏好图像优化恢复图像。实验表明,该方法能够提高感知质量,并且适用于不同的图像恢复任务和模型。
EEG-VLM: A Hierarchical Vision-Language Model with Multi-Level Feature Alignment and Visually Enhanced Language-Guided Reasoning for EEG Image-Based Sleep Stage Prediction
Authors: Xihe Qiu, Gengchen Ma, Haoyu Wang, Chen Zhan, Xiaoyu Tan, Shuo Li
First: 2025-11-24T14:23:42+00:00 · Latest: 2025-11-24T14:23:42+00:00
Abstract
Sleep stage classification based on electroencephalography (EEG) is fundamental for assessing sleep quality and diagnosing sleep-related disorders. However, most traditional machine learning methods rely heavily on prior knowledge and handcrafted features, while existing deep learning models still struggle to jointly capture fine-grained time-frequency patterns and achieve clinical interpretability. Recently, vision-language models (VLMs) have made significant progress in the medical domain, yet their performance remains constrained when applied to physiological waveform data, especially EEG signals, due to their limited visual understanding and insufficient reasoning capability. To address these challenges, we propose EEG-VLM, a hierarchical vision-language framework that integrates multi-level feature alignment with visually enhanced language-guided reasoning for interpretable EEG-based sleep stage classification. Specifically, a specialized visual enhancement module constructs high-level visual tokens from intermediate-layer features to extract rich semantic representations of EEG images. These tokens are further aligned with low-level CLIP features through a multi-level alignment mechanism, enhancing the VLM's image-processing capability. In addition, a Chain-of-Thought (CoT) reasoning strategy decomposes complex medical inference into interpretable logical steps, effectively simulating expert-like decision-making. Experimental results demonstrate that the proposed method significantly improves both the accuracy and interpretability of VLMs in EEG-based sleep stage classification, showing promising potential for automated and explainable EEG analysis in clinical settings.
中文标题/摘要
标题:EEG-VLM:一种用于基于EEG图像的睡眠阶段预测的分层视觉-语言模型,具有多级特征对齐和视觉增强的语言引导推理
基于脑电图(EEG)的睡眠阶段分类是评估睡眠质量和诊断睡眠相关障碍的基础。然而,大多数传统的机器学习方法依赖于先验知识和手工特征,而现有的深度学习模型仍然难以同时捕捉细微的时间-频率模式并实现临床可解释性。最近,视觉-语言模型(VLMs)在医疗领域取得了显著进展,但在应用于生理波形数据,尤其是EEG信号时,由于其有限的视觉理解和不足的推理能力,其性能仍然受到限制。为了解决这些挑战,我们提出了一种分层视觉-语言框架EEG-VLM,该框架结合了多级特征对齐和视觉增强的语言引导推理,以实现可解释的基于EEG的睡眠阶段分类。具体而言,一个专门的视觉增强模块从中间层特征中构建高级视觉标记,以提取EEG图像的丰富语义表示。这些标记通过多级对齐机制进一步与低级CLIP特征对齐,增强VLM的图像处理能力。此外,一种链式思考(CoT)推理策略将复杂的医学推理分解为可解释的逻辑步骤,有效地模拟专家级决策。实验结果表明,所提出的方法在基于EEG的睡眠阶段分类中显著提高了VLM的准确性和可解释性,展示了在临床环境中实现自动化和可解释的EEG分析的潜力。
Summary / 总结
The research aims to improve the accuracy and interpretability of sleep stage classification using EEG data. It proposes EEG-VLM, a hierarchical vision-language model that integrates multi-level feature alignment and visually enhanced language-guided reasoning. The model constructs high-level visual tokens from intermediate-layer features to extract rich semantic representations of EEG images, aligns these tokens with low-level CLIP features, and employs a Chain-of-Thought reasoning strategy to enhance interpretability. Experimental results show that EEG-VLM significantly improves the accuracy and interpretability of VLMs in EEG-based sleep stage classification.
研究旨在通过将多级视觉语言模型(VLM)与多级特征对齐和视觉增强的语言引导推理相结合,提高基于EEG的睡眠阶段分类的准确性和可解释性。方法包括一个视觉增强模块,该模块从中间层特征中构建高级视觉标记以提取EEG图像的丰富语义表示,然后通过多级对齐机制与低级CLIP特征对齐。此外,采用链式思考推理策略将复杂的医学推理分解为可解释的逻辑步骤。实验结果表明,所提出的EEG-VLM方法在睡眠阶段分类中的准确性和可解释性显著提高,展示了在临床环境中实现自动化和可解释的EEG分析的潜力。
Collaborative Learning with Multiple Foundation Models for Source-Free Domain Adaptation
Authors: Huisoo Lee, Jisu Han, Hyunsouk Cho, Wonjun Hwang
First: 2025-11-24T14:12:22+00:00 · Latest: 2025-11-24T14:12:22+00:00
Comments: 15 pages, 8 figures
Abstract
Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained source model to an unlabeled target domain without access to source data. Recent advances in Foundation Models (FMs) have introduced new opportunities for leveraging external semantic knowledge to guide SFDA. However, relying on a single FM is often insufficient, as it tends to bias adaptation toward a restricted semantic coverage, failing to capture diverse contextual cues under domain shift. To overcome this limitation, we propose a Collaborative Multi-foundation Adaptation (CoMA) framework that jointly leverages two different FMs (e.g., CLIP and BLIP) with complementary properties to capture both global semantics and local contextual cues. Specifically, we employ a bidirectional adaptation mechanism that (1) aligns different FMs with the target model for task adaptation while maintaining their semantic distinctiveness, and (2) transfers complementary knowledge from the FMs to the target model. To ensure stable adaptation under mini-batch training, we introduce Decomposed Mutual Information (DMI) that selectively enhances true dependencies while suppressing false dependencies arising from incomplete class coverage. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art SFDA methods across four benchmarks, including Office-31, Office-Home, DomainNet-126, and VisDA, under the closed-set setting, while also achieving best results on partial-set and open-set variants.
中文标题/摘要
标题:多基础模型协作学习在源数据无标签域适应中的应用
源数据无标签域适应(SFDA)旨在无需访问源数据的情况下,将预训练的源模型适应未标记的目标域。基础模型(FMs)的最新进展引入了利用外部语义知识指导SFDA的新机会。然而,依赖单一FM往往不够,因为它倾向于将适应偏向于有限的语义覆盖,无法捕捉域转移下的多样上下文线索。为克服这一限制,我们提出了一种协作多基础模型适应(CoMA)框架,该框架联合利用两种具有互补特性的不同FM(例如,CLIP和BLIP),以捕捉全局语义和局部上下文线索。具体而言,我们采用双向适应机制,(1)将不同FM与目标模型对齐以进行任务适应,同时保持其语义独特性,(2)将FM中的互补知识转移到目标模型。为了确保在小批量训练下的稳定适应,我们引入了分解互信息(DMI),以选择性地增强真实依赖关系并抑制由于类别覆盖不完整而产生的虚假依赖关系。广泛的实验表明,在闭集设置下,我们的方法在Office-31、Office-Home、DomainNet-126和VisDA四个基准测试中始终优于现有最先进的SFDA方法,同时在部分集和开放集变体中也取得了最佳结果。
Summary / 总结
The paper addresses Source-Free Domain Adaptation (SFDA) by proposing a Collaborative Multi-foundation Adaptation (CoMA) framework that uses two different Foundation Models (FMs) to capture both global semantics and local contextual cues. The method employs a bidirectional adaptation mechanism and introduces Decomposed Mutual Information (DMI) to ensure stable adaptation. Experiments show that CoMA outperforms existing SFDA methods across multiple benchmarks, including Office-31, Office-Home, DomainNet-126, and VisDA, under both closed-set and open-set settings.
论文提出了一种协作多基础模型适应(CoMA)框架,以解决源数据免费领域适应(SFDA)的问题。该框架利用两种具有互补特性的基础模型来捕捉全局语义和局部上下文线索。方法采用了双向适应机制,并引入了分解互信息(DMI)以确保在小批量训练下的稳定适应。实验表明,CoMA在多个基准测试中,包括Office-31、Office-Home、DomainNet-126和VisDA,在闭集和部分集/开放集设置下均优于现有SFDA方法。
ABM-LoRA: Activation Boundary Matching for Fast Convergence in Low-Rank Adaptation
Authors: Dongha Lee, Jinhee Park, Minjun Kim, Junseok Kwon
First: 2025-11-24T14:09:42+00:00 · Latest: 2025-11-24T14:09:42+00:00
Comments: 16 pages, 5 figures, under review
Abstract
We propose Activation Boundary Matching for Low-Rank Adaptation (ABM-LoRA), a principled initialization strategy that substantially accelerates the convergence of low-rank adapters. While LoRA offers high parameter efficiency, its random initialization restricts gradient updates to a mismatched tangent space, causing significant information loss and hindering early convergence. Our ABM-LoRA addresses this by aligning the adapter's activation boundaries with those of the pretrained model before downstream training, thereby maximizing the projection of full-parameter gradients into the adapter subspace. This alignment sharply reduces information loss at initialization, yields a lower starting loss, and accelerates convergence. We demonstrate ABM-LoRA's effectiveness across diverse architectures and tasks: language understanding (T5-Base on GLUE), dialogue generation (LLaMA2-7B on WizardLM), and vision recognition (ViT-B/16 on VTAB-1K). On VTAB-1K, it achieves the highest accuracy among all methods, with strong gains on structured reasoning tasks requiring geometric understanding.
中文标题/摘要
标题:ABM-LoRA:低秩适应中的激活边界匹配以实现快速收敛
我们提出了低秩适应中的激活边界匹配(ABM-LoRA),这是一种原理性的初始化策略,显著加速了低秩适配器的收敛速度。尽管LoRA具有高参数效率,但其随机初始化限制了梯度更新到一个不匹配的切空间中,导致大量信息丢失并阻碍了早期收敛。我们的ABM-LoRA通过在下游训练前使适配器的激活边界与预训练模型的激活边界对齐,从而最大化全参数梯度在适配器子空间中的投影。这种对齐在初始化时大幅减少了信息丢失,降低了初始损失,并加速了收敛。我们在多种架构和任务上展示了ABM-LoRA的有效性:语言理解(T5-Base在GLUE上),对话生成(LLaMA2-7B在WizardLM上),以及视觉识别(ViT-B/16在VTAB-1K上)。在VTAB-1K上,它在所有方法中达到了最高的准确性,并在需要几何理解的结构化推理任务上取得了显著的提升。
Summary / 总结
ABM-LoRA is a method that initializes low-rank adapters by aligning their activation boundaries with those of the pretrained model, which accelerates convergence and reduces information loss. Experiments show that ABM-LoRA significantly improves early convergence and achieves the highest accuracy on VTAB-1K, especially on structured reasoning tasks requiring geometric understanding across various architectures and tasks such as language understanding, dialogue generation, and vision recognition.
ABM-LoRA 通过使低秩适配器的激活边界与预训练模型的激活边界对齐,从而加速收敛并减少信息损失。实验表明,ABM-LoRA 显著提高了早期收敛性,并在 VTAB-1K 上实现了最高的准确率,特别是在需要几何理解的结构化推理任务中,适用于各种架构和任务,如语言理解、对话生成和视觉识别。
MonoSR: Open-Vocabulary Spatial Reasoning from Monocular Images
Authors: Qirui Wang, Jingyi He, Yining Pan, Si Yong Yeo, Xulei Yang, Shijie Li
First: 2025-11-24T13:49:17+00:00 · Latest: 2025-11-24T13:49:17+00:00
Abstract
Spatial reasoning (SR), the ability to infer 3D spatial information from 2D inputs, is essential for real-world applications such as embodied AI and autonomous driving. However, existing research primarily focuses on indoor environments and typically relies on multi-view observations, which limits their generalizability to outdoor scenarios and constrains their applicability to monocular images, the most common real-world setting. In this work, we propose MonoSR, a large-scale monocular spatial reasoning dataset that spans diverse scenarios including indoor, outdoor, and object-centric settings, and supports multiple question types. MonoSR provides a path toward open-world monocular spatial reasoning. Beyond introducing the dataset, we evaluate advanced vision-language models to reveal their limitations on this challenging task. We further analyze whether auxiliary information is crucial for monocular spatial reasoning and offer practical guidance for designing future models. These contributions collectively establish a foundation for advancing monocular spatial reasoning in real-world, open-world environments.
中文标题/摘要
标题:MonoSR:基于单目图像的开放词汇空间推理
空间推理(SR),从二维输入推断三维空间信息的能力,对于实际应用如具身AI和自动驾驶至关重要。然而,现有研究主要集中在室内环境,并通常依赖多视角观测,这限制了其在室外场景中的泛化能力和对单目图像的适用性,单目图像是最常见的实际应用场景。在本文中,我们提出了MonoSR,这是一个大规模的单目空间推理数据集,涵盖了包括室内、室外和对象中心在内的多种场景,并支持多种问题类型。MonoSR 为开放世界单目空间推理提供了途径。除了介绍数据集外,我们还评估了先进的视觉-语言模型,揭示了它们在这一具有挑战性的任务上的局限性。我们进一步分析了辅助信息对于单目空间推理是否至关重要,并提供了设计未来模型的实用指导。这些贡献共同为推进单目空间推理在实际、开放世界环境中的发展奠定了基础。
Summary / 总结
The research aims to address the limitations of existing spatial reasoning methods that primarily focus on indoor environments and multi-view observations, which are insufficient for outdoor scenarios and monocular images. The authors propose MonoSR, a large-scale monocular spatial reasoning dataset covering various settings, and evaluate advanced vision-language models, revealing their limitations. The study also explores the importance of auxiliary information for monocular spatial reasoning and provides guidance for future model design.
研究动机是解决现有空间推理方法主要集中在室内环境和多视角观察上的局限性,这限制了它们在户外场景和单目图像中的泛化能力。主要方法是创建一个涵盖多种场景的大型单目空间推理数据集MonoSR,并评估先进的视觉-语言模型在该任务上的表现以识别其局限性。关键发现包括辅助信息对单目空间推理的重要性以及对未来模型设计的实用指导。
DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection
Authors: Hai Ci, Ziheng Peng, Pei Yang, Yingxin Xuan, Mike Zheng Shou
First: 2025-11-24T13:43:54+00:00 · Latest: 2025-11-24T13:43:54+00:00
Comments: 16 pages, 10 figures
Abstract
Diffusion-based editing enables realistic modification of local image regions, making AI-generated content harder to detect. Existing AIGC detection benchmarks focus on classifying entire images, overlooking the localization of diffusion-based edits. We introduce DiffSeg30k, a publicly available dataset of 30k diffusion-edited images with pixel-level annotations, designed to support fine-grained detection. DiffSeg30k features: 1) In-the-wild images--we collect images or image prompts from COCO to reflect real-world content diversity; 2) Diverse diffusion models--local edits using eight SOTA diffusion models; 3) Multi-turn editing--each image undergoes up to three sequential edits to mimic real-world sequential editing; and 4) Realistic editing scenarios--a vision-language model (VLM)-based pipeline automatically identifies meaningful regions and generates context-aware prompts covering additions, removals, and attribute changes. DiffSeg30k shifts AIGC detection from binary classification to semantic segmentation, enabling simultaneous localization of edits and identification of the editing models. We benchmark three baseline segmentation approaches, revealing significant challenges in semantic segmentation tasks, particularly concerning robustness to image distortions. Experiments also reveal that segmentation models, despite being trained for pixel-level localization, emerge as highly reliable whole-image classifiers of diffusion edits, outperforming established forgery classifiers while showing great potential in cross-generator generalization. We believe DiffSeg30k will advance research in fine-grained localization of AI-generated content by demonstrating the promise and limitations of segmentation-based methods. DiffSeg30k is released at: https://huggingface.co/datasets/Chaos2629/Diffseg30k
中文标题/摘要
标题:DiffSeg30k:基于多轮扩散编辑的局部AIGC检测基准
基于扩散的编辑能够对局部图像区域进行逼真的修改,使AI生成的内容更难被检测。现有的AIGC检测基准主要集中在对整个图像进行分类,忽视了扩散编辑的定位。我们引入了DiffSeg30k,这是一个包含30,000张带有像素级注释的扩散编辑图像的公开数据集,旨在支持精细粒度的检测。DiffSeg30k特点:1) 野外图像——我们从COCO收集图像或图像提示以反映现实世界的多样性;2) 多种扩散模型——使用八种SOTA扩散模型进行局部编辑;3) 多轮编辑——每张图像最多进行三次连续编辑以模拟实际的连续编辑;4) 现实的编辑场景——基于视觉语言模型(VLM)的流水线自动识别有意义的区域并生成涵盖添加、删除和属性更改的上下文感知提示。DiffSeg30k将AIGC检测从二分类任务转变为语义分割,使编辑的定位和编辑模型的识别同时成为可能。我们基准测试了三种基本的分割方法,揭示了语义分割任务中的重大挑战,特别是在图像失真鲁棒性方面。实验还表明,尽管分割模型是为像素级定位训练的,但它们在扩散编辑的整个图像分类中表现出高度的可靠性,超越了现有的伪造分类器,并显示出在跨生成器泛化的巨大潜力。我们相信,DiffSeg30k将通过展示基于分割方法的潜力和局限性,推动对AI生成内容的精细定位研究。DiffSeg30k发布于:https://huggingface.co/datasets/Chaos2629/Diffseg30k
Summary / 总结
The research introduces DiffSeg30k, a dataset of 30k diffusion-edited images with pixel-level annotations, to support fine-grained detection of AI-generated content. The dataset includes in-the-wild images, diverse diffusion models, multi-turn editing, and realistic editing scenarios. Experiments show that segmentation models, while trained for pixel-level localization, can effectively classify diffusion edits as whole-image classifiers, outperforming established forgery classifiers and showing potential for cross-generator generalization. This shifts AIGC detection from binary classification to semantic segmentation, enabling simultaneous localization of edits and identification of editing models.
研究引入了包含30k扩散编辑图像和像素级注释的DiffSeg30k数据集,旨在支持对AI生成内容的精细检测。该数据集包括野外图像、多种扩散模型、多轮编辑和现实编辑场景。研究对三种分割方法进行了基准测试,突出了语义分割的挑战,并展示了分割模型作为扩散编辑可靠分类器的潜力。该数据集旨在推进对AI生成内容局部化的研究。
Neural Scaling Laws for Deep Regression
Authors: Tilen Cadez, Kyoung-Min Kim
First: 2025-09-12T06:49:19+00:00 · Latest: 2025-11-24T13:26:06+00:00
Comments: Supplementary Information will be provided with the published manuscript
Abstract
Neural scaling laws--power-law relationships between generalization errors and characteristics of deep learning models--are vital tools for developing reliable models while managing limited resources. Although the success of large language models highlights the importance of these laws, their application to deep regression models remains largely unexplored. Here, we empirically investigate neural scaling laws in deep regression using a parameter estimation model for twisted van der Waals magnets. We observe power-law relationships between the loss and both training dataset size and model capacity across a wide range of values, employing various architectures--including fully connected networks, residual networks, and vision transformers. Furthermore, the scaling exponents governing these relationships range from 1 to 2, with specific values depending on the regressed parameters and model details. The consistent scaling behaviors and their large scaling exponents suggest that the performance of deep regression models can improve substantially with increasing data size.
中文标题/摘要
标题:深度回归的神经缩放定律
神经缩放定律--泛化误差与深度学习模型特征之间的幂律关系--是开发可靠模型、同时管理有限资源的重要工具。尽管大型语言模型的成功突显了这些定律的重要性,但它们在深度回归模型中的应用仍鲜有探索。在这里,我们使用扭曲范德瓦尔斯磁体的参数估计模型,实证研究了深度回归中的神经缩放定律。我们观察到损失与训练数据集大小和模型容量之间存在幂律关系,涵盖了各种架构--包括全连接网络、残差网络和视觉变换器。此外,这些关系的缩放指数范围从1到2,具体值取决于回归参数和模型细节。一致的缩放行为及其较大的缩放指数表明,随着数据量的增加,深度回归模型的性能可以显著提高。
Summary / 总结
This study explores neural scaling laws in deep regression models by using a parameter estimation model for twisted van der Waals magnets. The research finds power-law relationships between the loss and both training dataset size and model capacity, with scaling exponents ranging from 1 to 2. These consistent scaling behaviors indicate that deep regression models can significantly improve their performance with larger datasets.
研究通过使用扭曲范德瓦尔斯磁体的参数估计模型,探索了深度回归模型中的神经缩放定律。研究发现损失与训练数据集大小和模型容量之间存在幂律关系,缩放指数范围从1到2。这些一致的缩放行为表明,通过使用更大的数据集,深度回归模型的性能可以显著提高。
Granular Computing-driven SAM: From Coarse-to-Fine Guidance for Prompt-Free Segmentation
Authors: Qiyang Yu, Yu Fang, Tianrui Li, Xuemei Cao, Yan Chen, Jianghao Li, Fan Min, Yi Zhang
First: 2025-11-24T12:55:02+00:00 · Latest: 2025-11-24T12:55:02+00:00
Comments: 19 pages, 7 figures
Abstract
Prompt-free image segmentation aims to generate accurate masks without manual guidance. Typical pre-trained models, notably Segmentation Anything Model (SAM), generate prompts directly at a single granularity level. However, this approach has two limitations: (1) Localizability, lacking mechanisms for autonomous region localization; (2) Scalability, limited fine-grained modeling at high resolution. To address these challenges, we introduce Granular Computing-driven SAM (Grc-SAM), a coarse-to-fine framework motivated by Granular Computing (GrC). First, the coarse stage adaptively extracts high-response regions from features to achieve precise foreground localization and reduce reliance on external prompts. Second, the fine stage applies finer patch partitioning with sparse local swin-style attention to enhance detail modeling and enable high-resolution segmentation. Third, refined masks are encoded as latent prompt embeddings for the SAM decoder, replacing handcrafted prompts with an automated reasoning process. By integrating multi-granularity attention, Grc-SAM bridges granular computing with vision transformers. Extensive experimental results demonstrate Grc-SAM outperforms baseline methods in both accuracy and scalability. It offers a unique granular computational perspective for prompt-free segmentation.
中文标题/摘要
标题:粒度计算驱动的SAM:从粗到细的无提示分割引导
无提示图像分割旨在生成准确的掩码而无需手动指导。典型的预训练模型,尤其是分割一切模型(SAM),在单一粒度级别直接生成提示。然而,这种方法有两个局限性:(1)定位能力不足,缺乏自主区域定位机制;(2)可扩展性,高分辨率下的细粒度建模受限。为了解决这些挑战,我们引入了粒度计算驱动的SAM(Grc-SAM),这是一种受粒度计算(GrC)启发的从粗到细框架。首先,粗阶段自适应地从特征中提取高响应区域,以实现精确的前景定位并减少对外部提示的依赖。其次,细阶段应用更精细的局部分块,并使用稀疏局部Swin风格注意力增强细节建模,实现高分辨率分割。第三,细化的掩码被编码为SAM解码器的潜在提示嵌入,用自动推理过程替代手工制作的提示。通过集成多粒度注意力,Grc-SAM将粒度计算与视觉变换器相结合。广泛的实验结果表明,Grc-SAM在准确性和可扩展性方面均优于基线方法。它为无提示分割提供了独特的粒度计算视角。
Summary / 总结
The research aims to improve prompt-free image segmentation by addressing limitations in localizability and scalability. Grc-SAM, a coarse-to-fine framework, introduces multi-granularity attention to achieve precise foreground localization and enhance detail modeling. The coarse stage extracts high-response regions for foreground localization, while the fine stage uses sparse local swin-style attention for high-resolution segmentation. Refined masks are encoded as latent prompt embeddings, reducing reliance on external prompts. Experiments show Grc-SAM outperforms baseline methods in both accuracy and scalability.
研究旨在通过解决局部化能力和扩展性问题来改进无提示图像分割。方法Granular Computing-driven SAM (Grc-SAM) 引入了一种粗细两级框架。首先使用粗级阶段提取高响应区域以实现精确的前景定位,然后应用细级阶段的稀疏局部Swin风格注意力以增强细节建模。精炼的掩码被编码为潜在提示嵌入,实现自动化推理过程。实验表明Grc-SAM在准确性和扩展性方面优于基线方法。
MedSAM3: Delving into Segment Anything with Medical Concepts
Authors: Anglin Liu, Rundong Xue, Xu R. Cao, Yifan Shen, Yi Lu, Xiang Li, Qianqian Chen, Jintai Chen
First: 2025-11-24T12:34:38+00:00 · Latest: 2025-11-24T12:34:38+00:00
Abstract
Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time-consuming manual annotation for new clinical application. Here, we propose MedSAM-3, a text promptable medical segmentation model for medical image and video segmentation. By fine-tuning the Segment Anything Model (SAM) 3 architecture on medical images paired with semantic conceptual labels, our MedSAM-3 enables medical Promptable Concept Segmentation (PCS), allowing precise targeting of anatomical structures via open-vocabulary text descriptions rather than solely geometric prompts. We further introduce the MedSAM-3 Agent, a framework that integrates Multimodal Large Language Models (MLLMs) to perform complex reasoning and iterative refinement in an agent-in-the-loop workflow. Comprehensive experiments across diverse medical imaging modalities, including X-ray, MRI, Ultrasound, CT, and video, demonstrate that our approach significantly outperforms existing specialist and foundation models. We will release our code and model at https://github.com/Joey-S-Liu/MedSAM3.
中文标题/摘要
标题:MedSAM3:深入探讨带有医学概念的分割一切
医学图像分割是生物医学发现的基础。现有方法缺乏泛化能力,并且需要为新的临床应用进行大量耗时的手动标注。在此,我们提出MedSAM-3,一种用于医学图像和视频分割的可文本提示医学分割模型。通过在医学图像与语义概念标签配对的数据上微调Segment Anything Model (SAM) 3架构,我们的MedSAM-3实现了医学提示概念分割(PCS),允许通过开放词汇的文本描述精确瞄准解剖结构,而不仅仅是几何提示。我们还引入了MedSAM-3代理,这是一种框架,将多模态大型语言模型(MLLMs)集成到代理在环的工作流中,以进行复杂推理和迭代细化。跨多种医学成像模态(包括X射线、MRI、超声、CT和视频)的全面实验表明,我们的方法在性能上显著优于现有专家模型和基础模型。我们将在https://github.com/Joey-S-Liu/MedSAM3/发布我们的代码和模型。
Summary / 总结
The research aims to improve the generalizability of medical image segmentation models and reduce the need for extensive manual annotation. MedSAM-3, a text-promptable medical segmentation model, is proposed by fine-tuning the Segment Anything Model (SAM) on medical images with semantic labels. This model enables precise anatomical structure segmentation using open-vocabulary text descriptions. Experiments across various medical imaging modalities show that MedSAM-3 outperforms existing models in terms of segmentation accuracy and efficiency. The MedSAM-3 Agent, which integrates Multimodal Large Language Models, further enhances the model's performance through iterative refinement in an agent-in-the-loop workflow.
研究旨在提高医学图像分割模型的通用性并减少大量手动标注的需求。通过在带有语义标签的医学图像上微调Segment Anything Model (SAM),提出了MedSAM-3,该模型能够使用开放词汇的文本描述精确分割解剖结构。跨多种医学成像模态的实验表明,MedSAM-3在分割准确性和效率方面优于现有模型。MedSAM-3 Agent 进一步通过集成多模态大型语言模型增强了复杂推理和迭代细化的代理在环工作流。
History
20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553