Advancing site-specific disease and pest management in precision agriculture: From reasoning-driven foundation models to adaptive, feedback-based learning
Authors: Nitin Rai, Daeun, Choi, Nathan S. Boyd, Arnold W. Schumann
First: 2025-10-28T17:16:47+00:00 · Latest: 2025-10-28T17:16:47+00:00
Comments: 26 pages, 8 figures, and 2 tables
Abstract
Site-specific disease management (SSDM) in crops has advanced rapidly through
machine and deep learning (ML and DL) for real-time computer vision. Research
evolved from handcrafted feature extraction to large-scale automated feature
learning. With foundation models (FMs), crop disease datasets are now processed
in fundamentally new ways. Unlike traditional neural networks, FMs integrate
visual and textual data, interpret symptoms in text, reason about
symptom-management relationships, and support interactive QA for growers and
educators. Adaptive and imitation learning in robotics further enables
field-based disease management. This review screened approx. 40 articles on FM
applications for SSDM, focusing on large-language models (LLMs) and
vision-language models (VLMs), and discussing their role in adaptive learning
(AL), reinforcement learning (RL), and digital twin frameworks for targeted
spraying. Key findings: (a) FMs are gaining traction with surging literature in
2023-24; (b) VLMs outpace LLMs, with a 5-10x increase in publications; (c) RL
and AL are still nascent for smart spraying; (d) digital twins with RL can
simulate targeted spraying virtually; (e) addressing the sim-to-real gap is
critical for real-world deployment; (f) human-robot collaboration remains
limited, especially in human-in-the-loop approaches where robots detect early
symptoms and humans validate uncertain cases; (g) multi-modal FMs with
real-time feedback will drive next-gen SSDM. For updates, resources, and
contributions, visit, https://github.com/nitin-dominic/AgriPathogenDatabase, to
submit papers, code, or datasets.
中文标题/摘要
标题:精准农业中特定地点病害和害虫管理的进步:从推理驱动的基础模型到适应性、反馈学习
作物的特定地点病害管理(SSDM)通过机器学习和深度学习(ML和DL)的实时计算机视觉技术得到了迅速发展。研究从手工特征提取演进到大规模自动特征学习。借助基础模型(FMs),作物病害数据集现在以根本不同的方式处理。与传统神经网络不同,FMs整合了视觉和文本数据,解释文本中的症状,推理症状管理关系,并支持种植者和教育者的交互式问答。机器人领域的适应性和模仿学习进一步使基于田间的病害管理成为可能。本文综述了约40篇关于FMs在SSDM应用的文章,重点关注大型语言模型(LLMs)和视觉语言模型(VLMs),并讨论了它们在适应性学习(AL)、强化学习(RL)和数字孪生框架中的作用,以实现精准喷洒。主要发现:(a) FMs在2023-24年获得了越来越多的文献支持;(b) VLMs超越了LLMs,出版物增加了5-10倍;(c) RL和AL在智能喷洒中仍处于初级阶段;(d) 带有RL的数字孪生可以虚拟模拟精准喷洒;(e) 缩小模拟与现实之间的差距对于实际部署至关重要;(f) 人机协作仍然有限,尤其是在人机在环的方法中,机器人检测早期症状,人类验证不确定的案例;(g) 多模态FMs结合实时反馈将推动下一代SSDM。欲获取更新、资源和贡献,请访问https://github.com/nitin-dominic/AgriPathogenDatabase,提交论文、代码或数据集。
VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree
Authors: Wenlong Li, Yifei Xu, Yuan Rao, Zhenhua Wang, Shuiguang Deng
Venue: NeurIPS 2025 poster
First: 2025-10-26T14:36:15+00:00 · Latest: 2025-10-28T16:57:22+00:00
Comments: NeurIPS 2025 poster
Abstract
Video anomaly detection (VAD) focuses on identifying anomalies in videos.
Supervised methods demand substantial in-domain training data and fail to
deliver clear explanations for anomalies. In contrast, training-free methods
leverage the knowledge reserves and language interactivity of large pre-trained
models to detect anomalies. However, the current fixed-length temporal window
sampling approaches struggle to accurately capture anomalies with varying
temporal spans. Therefore, we propose VADTree that utilizes a Hierarchical
Granularityaware Tree (HGTree) structure for flexible sampling in VAD. VADTree
leverages the knowledge embedded in a pre-trained Generic Event Boundary
Detection (GEBD) model to characterize potential anomaly event boundaries.
Specifically, VADTree decomposes the video into generic event nodes based on
boundary confidence, and performs adaptive coarse-fine hierarchical structuring
and redundancy removal to construct the HGTree. Then, the multi-dimensional
priors are injected into the visual language models (VLMs) to enhance the
node-wise anomaly perception, and anomaly reasoning for generic event nodes is
achieved via large language models (LLMs). Finally, an inter-cluster node
correlation method is used to integrate the multi-granularity anomaly scores.
Extensive experiments on three challenging datasets demonstrate that VADTree
achieves state-of-the-art performance in training-free settings while
drastically reducing the number of sampled video segments. The code will be
available at https://github.com/wenlongli10/VADTree.
中文标题/摘要
标题:VADTree:基于层次粒度感知树的无训练视频异常检测
视频异常检测(VAD)专注于识别视频中的异常。
监督方法需要大量领域内的训练数据,并且无法为异常提供清晰的解释。相比之下,无训练方法利用大型预训练模型的知识储备和语言互动性来检测异常。然而,当前固定长度的时间窗口采样方法难以准确捕捉具有不同时间跨度的异常。因此,我们提出了VADTree,利用层次粒度感知树(HGTree)结构进行灵活的VAD采样。VADTree利用预训练的通用事件边界检测(GEBD)模型来表征潜在的异常事件边界。具体来说,VADTree基于边界置信度将视频分解为通用事件节点,并进行自适应粗细层次结构构建和冗余去除以构建HGTree。然后,将多维先验注入视觉语言模型(VLMs)以增强节点级别的异常感知,并通过大型语言模型(LLMs)实现通用事件节点的异常推理。最后,使用跨簇节点相关方法整合多粒度异常评分。在三个具有挑战性的数据集上的广泛实验表明,VADTree在无训练设置中实现了最先进的性能,同时大幅减少了采样的视频片段数量。代码将在https://github.com/wenlongli10/VADTree上提供。
Summary / 总结
VADTree is proposed to address the limitations of training-free video anomaly detection methods by utilizing a Hierarchical Granularity-Aware Tree (HGTree) structure. It decomposes videos into generic event nodes and constructs an HGTree for adaptive coarse-fine hierarchical structuring. VADTree leverages a pre-trained Generic Event Boundary Detection (GEBD) model to detect potential anomaly event boundaries and integrates multi-granularity anomaly scores. Experiments show that VADTree outperforms existing methods in training-free settings while reducing the number of sampled video segments.
VADTree 通过利用层次粒度感知树(HGTree)结构来解决训练免费视频异常检测方法的局限性。它将视频分解为通用事件节点,并构建 HGTree 进行自适应粗细层次结构化。VADTree 利用预训练的通用事件边界检测(GEBD)模型来检测潜在的异常事件边界,并通过跨簇节点相关性方法整合多粒度异常得分。实验表明,VADTree 在训练免费设置中优于现有方法,同时显著减少了采样的视频片段数量。
TableTime: Reformulating Time Series Classification as Training-Free Table Understanding with Large Language Models
Authors: Jiahao Wang, Mingyue Cheng, Qingyang Mao, Yitong Zhou, Daoyu Wang, Qi Liu, Feiyang Xu, Xin Li
First: 2024-11-24T07:02:32+00:00 · Latest: 2025-10-28T16:23:53+00:00
Abstract
Large language models (LLMs) have demonstrated their effectiveness in
multivariate time series classification (MTSC). Effective adaptation of LLMs
for MTSC necessitates informative data representations. Existing LLM-based
methods directly encode embeddings for time series within the latent space of
LLMs from scratch to align with semantic space of LLMs. Despite their
effectiveness, we reveal that these methods conceal three inherent bottlenecks:
(1) they struggle to encode temporal and channel-specific information in a
lossless manner, both of which are critical components of multivariate time
series; (2) it is much difficult to align the learned representation space with
the semantic space of the LLMs; (3) they require task-specific retraining,
which is both computationally expensive and labor-intensive. To bridge these
gaps, we propose TableTime, which reformulates MTSC as a table understanding
task. Specifically, TableTime introduces the following strategies: (1) convert
multivariate time series into a tabular form, thus minimizing information loss
to the greatest extent; (2) represent tabular time series in text format to
achieve natural alignment with the semantic space of LLMs; (3) design a
reasoning framework that integrates contextual text information, neighborhood
assistance, multi-path inference and problem decomposition to enhance the
reasoning ability of LLMs and realize zero-shot classification. Extensive
experiments performed on 10 publicly representative datasets from UEA archive
verify the superiorities of the TableTime.
中文标题/摘要
标题:TableTime:将多变量时间序列分类重新定义为无需训练的大语言模型表理解
大型语言模型(LLMs)在多变量时间序列分类(MTSC)中展示了其有效性。将LLMs有效适应MTSC需要信息性的数据表示。现有的基于LLM的方法直接从零开始在LLM的潜在空间中编码时间序列嵌入,以与LLM的语义空间对齐。尽管这些方法有效,但我们发现它们隐藏了三个固有的瓶颈:(1)它们难以以无损方式编码时间和通道特定的信息,这两种信息都是多变量时间序列的关键组成部分;(2)学习到的表示空间与LLM的语义空间对齐非常困难;(3)它们需要特定任务的重新训练,这既耗费计算资源又耗时。为了弥合这些差距,我们提出了TableTime,将其重新定义为一个表理解任务。具体而言,TableTime 引入了以下策略:(1)将多变量时间序列转换为表格形式,从而最大限度地减少信息损失;(2)以文本格式表示表格时间序列,以实现自然与LLM语义空间的对齐;(3)设计一个推理框架,结合上下文文本信息、邻域协助、多路径推理和问题分解,以增强LLM的推理能力和实现零样本分类。在UEA存档中的10个公开代表性数据集上进行的广泛实验验证了TableTime的优越性。
Summary / 总结
The research motivation is to address the limitations of existing large language model (LLM)-based methods in multivariate time series classification (MTSC). TableTime reformulates MTSC as a table understanding task, converting time series into tabular form and representing them in text to align with the semantic space of LLMs. Key experimental findings show that TableTime outperforms existing methods on 10 publicly representative datasets from the UEA archive, demonstrating superior zero-shot classification capabilities.
TableTime 将多变量时间序列分类重新表述为表格理解任务,解决现有基于大语言模型(LLM)的方法的局限性。它将时间序列转换为表格形式以最小化信息损失,以文本格式表示表格数据以自然地与LLM的语义空间对齐,并设计了一个推理框架以增强LLM的推理能力。在UEA档案中的10个数据集上的实验验证了TableTime在零样本分类中的优越性能。
Superpowering Open-Vocabulary Object Detectors for X-ray Vision
Authors: Pablo Garcia-Fernandez, Lorenzo Vaquero, Mingxuan Liu, Feng Xue, Daniel Cores, Nicu Sebe, Manuel Mucientes, Elisa Ricci
Venue: ICCV 2025
First: 2025-03-21T11:54:16+00:00 · Latest: 2025-10-28T15:20:36+00:00
Comments: Accepted at ICCV 2025
Abstract
Open-vocabulary object detection (OvOD) is set to revolutionize security
screening by enabling systems to recognize any item in X-ray scans. However,
developing effective OvOD models for X-ray imaging presents unique challenges
due to data scarcity and the modality gap that prevents direct adoption of
RGB-based solutions. To overcome these limitations, we propose RAXO, a
training-free framework that repurposes off-the-shelf RGB OvOD detectors for
robust X-ray detection. RAXO builds high-quality X-ray class descriptors using
a dual-source retrieval strategy. It gathers relevant RGB images from the web
and enriches them via a novel X-ray material transfer mechanism, eliminating
the need for labeled databases. These visual descriptors replace text-based
classification in OvOD, leveraging intra-modal feature distances for robust
detection. Extensive experiments demonstrate that RAXO consistently improves
OvOD performance, providing an average mAP increase of up to 17.0 points over
base detectors. To further support research in this emerging field, we also
introduce DET-COMPASS, a new benchmark featuring bounding box annotations for
over 300 object categories, enabling large-scale evaluation of OvOD in X-ray.
Code and dataset available at: https://github.com/PAGF188/RAXO.
中文标题/摘要
标题:为X射线视觉增强开放词汇对象检测
开放词汇对象检测(OvOD)有望通过使系统能够识别X射线扫描中的任何物品来革新安全筛查。然而,由于数据稀缺性和模态差距,开发适用于X射线成像的高效OvOD模型面临独特挑战,这阻碍了直接采用基于RGB的解决方案。为克服这些限制,我们提出了一种无需训练的RAXO框架,该框架重新利用现成的RGB OvOD检测器以实现稳健的X射线检测。RAXO使用双源检索策略构建高质量的X射线类别描述符。它从网络收集相关RGB图像并通过一种新颖的X射线材料转移机制丰富它们,从而消除对标注数据库的需要。这些视觉描述符取代了OvOD中的基于文本的分类,利用模态内特征距离实现稳健检测。大量实验表明,RAXO持续提升OvOD性能,相对于基础检测器平均mAP提升高达17.0个百分点。为了进一步支持这一新兴领域的研究,我们还引入了DET-COMPASS基准,该基准包含超过300个对象类别的边界框注释,使OvOD在X射线中的大规模评估成为可能。代码和数据集可在:https://github.com/PAGF188/RAXO/ 获取。
Iterative Critique-Refine Framework for Enhancing LLM Personalization
Authors: Durga Prasad Maram, Dhruvin Gandhi, Zonghai Yao, Gayathri Akkinapalli, Franck Dernoncourt, Yu Wang, Ryan A. Rossi, Nesreen K. Ahmed
First: 2025-10-28T14:36:22+00:00 · Latest: 2025-10-28T14:36:22+00:00
Abstract
Personalized text generation requires models not only to produce coherent
text but also to align with a target user's style, tone, and topical focus.
Existing retrieval-augmented approaches such as LaMP and PGraphRAG enrich
profiles with user and neighbor histories, but they stop at generation and
often yield outputs that drift in tone, topic, or style. We present PerFine, a
unified, training-free critique-refine framework that enhances personalization
through iterative, profile-grounded feedback. In each iteration, an LLM
generator produces a draft conditioned on the retrieved profile, and a critic
LLM - also conditioned on the same profile - provides structured feedback on
tone, vocabulary, sentence structure, and topicality. The generator then
revises, while a novel knockout strategy retains the stronger draft across
iterations. We further study additional inference-time strategies such as
Best-of-N and Topic Extraction to balance quality and efficiency. Across Yelp,
Goodreads, and Amazon datasets, PerFine consistently improves personalization
over PGraphRAG, with GEval gains of +7-13%, steady improvements over 3-5
refinement iterations, and scalability with increasing critic size. These
results highlight that post-hoc, profile-aware feedback offers a powerful
paradigm for personalized LLM generation that is both training-free and
model-agnostic.
中文标题/摘要
标题:迭代批判-完善框架以增强LLM个性化
个性化文本生成不仅要求模型生成连贯的文本,还要求与目标用户的风格、语气和主题焦点相一致。现有的检索增强方法,如LaMP和PGraphRAG,通过用户和邻居历史丰富个人资料,但它们仅停留在生成阶段,往往导致输出在语气、主题或风格上漂移。我们提出PerFine,这是一种统一的、无需训练的批判-完善框架,通过迭代、基于个人资料的反馈来增强个性化。在每次迭代中,LLM生成器根据检索到的个人资料生成草稿,而另一个也基于相同个人资料的批判LLM提供结构化的反馈,包括语气、词汇、句子结构和主题性。生成器随后进行修订,而一种新颖的淘汰策略保留了更强的草稿。我们还研究了在推理时的其他策略,如Best-of-N和主题提取,以平衡质量和效率。在Yelp、Goodreads和Amazon数据集上,PerFine在GEval上的一致性改进超过了PGraphRAG,稳态改进在3-5次完善迭代中持续进行,并且随着批判者规模的增加而可扩展。这些结果表明,事后、基于个人资料的反馈为个性化LLM生成提供了一种强大的范式,该范式既无需训练又模型无关。
Summary / 总结
The research aims to enhance the personalization of text generation by ensuring coherence and alignment with a user's style, tone, and topic. PerFine, a critique-refine framework, iteratively generates drafts and receives structured feedback from a critic model, which then refines the drafts. Across Yelp, Goodreads, and Amazon datasets, PerFine outperforms PGraphRAG with a 7-13% improvement in GEval scores, showing consistent gains over 3-5 iterations and scalability with larger critic models.
研究旨在通过确保连贯性和与用户风格、语气和主题的对齐来增强文本生成的个性化。PerFine是一种批判-修正框架,通过迭代生成草稿并从批评模型接收结构化反馈,然后修正草稿。在Yelp、Goodreads和Amazon数据集上,PerFine在GEval评分上优于PGraphRAG,提高了7-13%,并且在3-5次迭代中表现出一致的改进,并且随着批评模型规模的增加而具有可扩展性。
Mano Technical Report
Authors: Tianyu Fu, Anyang Su, Chenxu Zhao, Hanning Wang, Minghui Wu, Zhe Yu, Fei Hu, Mingjia Shi, Wei Dong, Jiayao Wang, Yuyang Chen, Ruiyang Yu, Siran Peng, Menglin Li, Nan Huang, Haitian Wei, Jiawei Yu, Yi Xin, Xilin Zhao, Kai Gu, Ping Jiang, Sifan Zhou, Shuo Wang
First: 2025-09-22T03:13:58+00:00 · Latest: 2025-10-28T14:31:14+00:00
Abstract
Graphical user interfaces (GUIs) are the primary medium for human-computer
interaction, yet automating GUI interactions remains challenging due to the
complexity of visual elements, dynamic environments, and the need for
multi-step reasoning. Existing methods based on vision-language models (VLMs)
often suffer from limited resolution, domain mismatch, and insufficient
sequential decisionmaking capability. To address these issues, we propose Mano,
a robust GUI agent built upon a multi-modal foundation model pre-trained on
extensive web and computer system data. Our approach integrates a novel
simulated environment for high-fidelity data generation, a three-stage training
pipeline (supervised fine-tuning, offline reinforcement learning, and online
reinforcement learning), and a verification module for error recovery. Mano
demonstrates state-of-the-art performance on multiple GUI benchmarks, including
Mind2Web and OSWorld, achieving significant improvements in success rate and
operational accuracy. Our work provides new insights into the effective
integration of reinforcement learning with VLMs for practical GUI agent
deployment, highlighting the importance of domain-specific data, iterative
training, and holistic reward design.
中文标题/摘要
标题:Mano 技术报告
图形用户界面(GUI)是人机交互的主要媒介,但由于视觉元素的复杂性、动态环境以及多步推理的需求,自动化GUI交互仍然具有挑战性。现有的基于视觉-语言模型(VLMs)的方法往往受到分辨率有限、领域不匹配和序列决策能力不足的限制。为了解决这些问题,我们提出了一种名为Mano的稳健的GUI代理,该代理基于在大量网络和计算机系统数据上预训练的多模态基础模型构建。我们的方法结合了一个新颖的模拟环境以生成高保真数据、三阶段训练管道(监督微调、离线强化学习和在线强化学习)以及一个验证模块以实现错误恢复。Mano在多个GUI基准测试中表现出最先进的性能,包括Mind2Web和OSWorld,显著提高了成功率和操作准确性。我们的工作为强化学习与VLMs的有效集成提供了新的见解,强调了领域特定数据、迭代训练和整体奖励设计的重要性。
Summary / 总结
The research aims to improve the automation of graphical user interface (GUI) interactions by addressing the limitations of existing vision-language models. Mano, a robust GUI agent, is proposed, leveraging a multi-modal foundation model pre-trained on extensive web and computer system data. It uses a three-stage training pipeline and a verification module to enhance performance. Mano shows state-of-the-art results on benchmarks like Mind2Web and OSWorld, with significant improvements in success rate and operational accuracy.
论文针对自动化图形用户界面(GUI)交互的挑战,如复杂性和动态环境。它提出了Mano,一种基于多模态基础模型的稳健GUI代理,包括数据生成的模拟环境、三阶段训练管道和验证模块。Mano在Mind2Web和OSWorld等基准测试中表现出色,显著提高了成功率和操作准确性。这项工作强调了领域特定数据、迭代训练和整体奖励设计对于实际GUI代理部署的重要性。
OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
Authors: Qiushi Sun, Mukai Li, Zhoumianze Liu, Zhihui Xie, Fangzhi Xu, Zhangyue Yin, Kanzhi Cheng, Zehao Li, Zichen Ding, Qi Liu, Zhiyong Wu, Zhuosheng Zhang, Ben Kao, Lingpeng Kong
First: 2025-10-28T13:22:39+00:00 · Latest: 2025-10-28T13:22:39+00:00
Comments: work in progress
Abstract
Computer-using agents powered by Vision-Language Models (VLMs) have
demonstrated human-like capabilities in operating digital environments like
mobile platforms. While these agents hold great promise for advancing digital
automation, their potential for unsafe operations, such as system compromise
and privacy leakage, is raising significant concerns. Detecting these safety
concerns across the vast and complex operational space of mobile environments
presents a formidable challenge that remains critically underexplored. To
establish a foundation for mobile agent safety research, we introduce
MobileRisk-Live, a dynamic sandbox environment accompanied by a safety
detection benchmark comprising realistic trajectories with fine-grained
annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety
detection framework that synergistically combines a Formal Verifier for
detecting explicit system-level violations with a VLM-based Contextual Judge
for assessing contextual risks and agent actions. Experiments show that
OS-Sentinel achieves 10%-30% improvements over existing approaches across
multiple metrics. Further analysis provides critical insights that foster the
development of safer and more reliable autonomous mobile agents.
中文标题/摘要
标题:OS-Sentinel:通过混合验证在现实工作流中提升移动GUI代理的安全性
由视觉-语言模型(VLMs)驱动的计算机使用代理在操作数字环境如移动平台方面展示了类似人类的能力。尽管这些代理在推进数字自动化方面具有巨大潜力,但它们进行不安全操作的可能性,如系统破坏和隐私泄露,引发了重大担忧。在移动环境复杂且庞大的操作空间中检测这些安全问题是一项艰巨的挑战,目前仍严重未被探索。为了建立移动代理安全研究的基础,我们引入了MobileRisk-Live,一个动态沙盒环境,附带一个包含现实轨迹和细粒度注释的安全检测基准。在此基础上,我们提出了OS-Sentinel,一种新颖的混合安全检测框架,该框架将形式验证器与基于VLM的上下文评估器相结合,用于检测系统级违规并评估上下文风险和代理行为。实验结果显示,OS-Sentinel在多个指标上比现有方法提高了10%-30%。进一步的分析提供了关键见解,促进了更安全、更可靠的自主移动代理的发展。
Summary / 总结
The research aims to enhance the safety of mobile GUI agents powered by Vision-Language Models (VLMs) by addressing the risk of unsafe operations such as system compromise and privacy leakage. The study introduces OS-Sentinel, a hybrid safety detection framework that combines a Formal Verifier for explicit system-level violations with a VLM-based Contextual Judge for contextual risk assessment. Experiments demonstrate that OS-Sentinel outperforms existing methods by 10%-30% across multiple metrics, providing valuable insights for developing safer autonomous mobile agents.
研究旨在通过解决潜在的不安全操作问题,增强由视觉-语言模型(VLM)驱动的移动GUI代理的安全性。研究引入了OS-Sentinel,这是一种结合了形式验证器和基于VLM的上下文评估器的混合安全检测框架,用于检测系统级违规和评估上下文风险。实验表明,OS-Sentinel在多个指标上比现有方法提高了10%-30%,为开发更安全可靠的移动代理提供了宝贵的见解。
Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs
Authors: Xuannan Liu, Zekun Li, Zheqi He, Peipei Li, Shuhan Xia, Xing Cui, Huaibo Huang, Xi Yang, Ran He
Venue: NeurIPS 2025
First: 2025-05-17T05:06:38+00:00 · Latest: 2025-10-28T12:44:07+00:00
Comments: Accepted by NeurIPS 2025 Dataset and Benchmark Track, Project page:
https://liuxuannan.github.io/Video-SafetyBench.github.io/
Abstract
The increasing deployment of Large Vision-Language Models (LVLMs) raises
safety concerns under potential malicious inputs. However, existing multimodal
safety evaluations primarily focus on model vulnerabilities exposed by static
image inputs, ignoring the temporal dynamics of video that may induce distinct
safety risks. To bridge this gap, we introduce Video-SafetyBench, the first
comprehensive benchmark designed to evaluate the safety of LVLMs under
video-text attacks. It comprises 2,264 video-text pairs spanning 48
fine-grained unsafe categories, each pairing a synthesized video with either a
harmful query, which contains explicit malice, or a benign query, which appears
harmless but triggers harmful behavior when interpreted alongside the video. To
generate semantically accurate videos for safety evaluation, we design a
controllable pipeline that decomposes video semantics into subject images (what
is shown) and motion text (how it moves), which jointly guide the synthesis of
query-relevant videos. To effectively evaluate uncertain or borderline harmful
outputs, we propose RJScore, a novel LLM-based metric that incorporates the
confidence of judge models and human-aligned decision threshold calibration.
Extensive experiments show that benign-query video composition achieves average
attack success rates of 67.2%, revealing consistent vulnerabilities to
video-induced attacks. We believe Video-SafetyBench will catalyze future
research into video-based safety evaluation and defense strategies.
中文标题/摘要
标题:Video-SafetyBench:视频LVLM安全性评估基准
随着大型视觉-语言模型(LVLMs)的广泛应用,潜在恶意输入可能引发的安全问题日益突出。然而,现有的多模态安全性评估主要集中在静态图像输入暴露的模型漏洞上,忽视了视频中的时间动态可能带来的独特安全风险。为解决这一问题,我们提出了Video-SafetyBench,这是首个旨在评估视频-文本攻击下LVLMs安全性的全面基准。它包含2,264个视频-文本配对,覆盖48个细粒度的不安全类别,每个配对包含一个合成视频和一个有害查询或一个看似无害但实际上与视频结合后会引发有害行为的良性查询。为了生成适合安全评估的语义准确视频,我们设计了一个可控的流水线,将视频语义分解为主题图像(显示什么)和运动文本(如何移动),两者共同指导生成与查询相关的视频。为了有效评估不确定或边缘有害输出,我们提出了RJScore,这是一种新颖的基于LLM的度量标准,结合了法官模型的信心和人类对决策阈值的校准。大量实验表明,良性查询视频组合的平均攻击成功率达到了67.2%,揭示了LVLMs对视频诱导攻击的一致性漏洞。我们相信Video-SafetyBench将促进未来基于视频的安全评估和防御策略的研究。
Summary / 总结
The paper introduces Video-SafetyBench, a benchmark for evaluating the safety of Large Vision-Language Models (LVLMs) under video-text attacks. It includes 2,264 video-text pairs covering 48 unsafe categories, with each pair consisting of a synthesized video and either a harmful or benign query. The study reveals that benign-query videos achieve an average attack success rate of 67.2%, highlighting LVLM vulnerabilities to video-induced attacks. The benchmark aims to advance research in video-based safety evaluation and defense strategies.
Video-SafetyBench 是一个基准,旨在评估大型视觉-语言模型(LVLM)在视频-文本攻击下的安全性,填补了现有跨模态安全评估主要关注静态图像的空白。它包含2,264个视频-文本对,涵盖48个不安全类别,并使用可控的管道根据主题图像和运动文本合成视频。基准还引入了RJScore,这是一种新型的基于LLM的度量标准,用于评估不确定或边缘有害的输出。实验表明,良性查询视频组合的平均攻击成功率达到了67.2%,揭示了LVLM对视频诱导攻击的一致性漏洞。
Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning
Authors: Amit Peleg, Naman Deep Singh, Matthias Hein
Venue: NeurIPS 2025
First: 2025-05-30T10:04:00+00:00 · Latest: 2025-10-28T12:08:40+00:00
Comments: Accepted at NeurIPS 2025
Abstract
Vision-language models like CLIP have demonstrated remarkable zero-shot
capabilities in classification and retrieval. However, these models often
struggle with compositional reasoning - the ability to understand the
relationships between concepts. A recent benchmark, SugarCrepe++, reveals that
previous works on improving compositionality have mainly improved lexical
sensitivity but neglected semantic understanding. In addition, downstream
retrieval performance often deteriorates, although one would expect that
improving compositionality should enhance retrieval. In this work, we introduce
CLIC (Compositionally-aware Learning in CLIP), a fine-tuning method based on a
novel training technique combining multiple images and their associated
captions. CLIC improves compositionality across architectures as well as
differently pre-trained CLIP models, both in terms of lexical and semantic
understanding, and achieves consistent gains in retrieval performance. This
even applies to the recent CLIPS, which achieves SOTA retrieval performance.
Nevertheless, the short fine-tuning with CLIC leads to an improvement in
retrieval and to the best compositional CLIP model on SugarCrepe++. All our
models and code are available at https://clic-compositional-clip.github.io
中文标题/摘要
标题:利用高效微调提升CLIP的组合理解能力
像CLIP这样的视觉-语言模型在分类和检索方面展示了惊人的零样本能力。然而,这些模型在组合推理方面常常遇到困难——即理解概念间关系的能力。最近的基准测试SugarCrepe++表明,之前提高组合性的研究主要提高了词汇敏感性,但忽视了语义理解。此外,下游检索性能通常会下降,尽管人们期望提高组合性会增强检索。在本文中,我们引入了CLIC(组合理解中的CLIP学习),这是一种基于结合多张图像及其相关描述的新颖训练技术的微调方法。CLIC在不同架构和不同预训练的CLIP模型中都提高了组合性,从词汇和语义理解上都取得了改进,并且在检索性能上也实现了持续的提升。即使对于最近的CLIPS,它也达到了SOTA的检索性能。然而,使用CLIC进行短暂微调仍能提高检索性能,并在SugarCrepe++上实现了最佳的组合CLIP模型。我们的所有模型和代码均可在https://clic-compositional-clip.github.io获取
Summary / 总结
This work addresses the limitation of CLIP models in compositional reasoning, which is crucial for understanding relationships between concepts. The authors introduce CLIC, a fine-tuning method that combines multiple images and their captions to enhance both lexical and semantic understanding. CLIC consistently improves retrieval performance and achieves the best compositional CLIP model on the SugarCrepe++ benchmark, even with short fine-tuning. The method is applicable across different CLIP architectures and pre-trained models.
该研究通过引入CLIC方法,结合多张图像及其描述,解决了CLIP模型在组成性推理方面的局限性。CLIC在不同CLIP架构和预训练模型中提高了词汇和语义理解能力,一致地提升了检索性能,甚至对最新的CLIPS模型也是如此。CLIC的短时微调过程增强了检索性能,并在SugarCrepe++基准测试中达到了最佳的组成性CLIP模型。
What do vision-language models see in the context? Investigating multimodal in-context learning
Authors: Gabriel O. dos Santos, Esther Colombini, Sandra Avila
First: 2025-10-28T11:55:24+00:00 · Latest: 2025-10-28T11:55:24+00:00
Abstract
In-context learning (ICL) enables Large Language Models (LLMs) to learn tasks
from demonstration examples without parameter updates. Although it has been
extensively studied in LLMs, its effectiveness in Vision-Language Models (VLMs)
remains underexplored. In this work, we present a systematic study of ICL in
VLMs, evaluating seven models spanning four architectures on three image
captioning benchmarks. We analyze how prompt design, architectural choices, and
training strategies influence multimodal ICL. To our knowledge, we are the
first to analyze how attention patterns in VLMs vary with an increasing number
of in-context demonstrations. Our results reveal that training on imag-text
interleaved data enhances ICL performance but does not imply effective
integration of visual and textual information from demonstration examples. In
contrast, instruction tuning improves instruction-following but can reduce
reliance on in-context demonstrations, suggesting a trade-off between
instruction alignment and in-context adaptation. Attention analyses further
show that current VLMs primarily focus on textual cues and fail to leverage
visual information, suggesting a limited capacity for multimodal integration.
These findings highlight key limitations in the ICL abilities of current VLMs
and provide insights for enhancing their ability to learn from multimodal
in-context examples.
中文标题/摘要
标题:视觉语言模型在上下文中的视见是什么?探究多模态上下文学习
上下文学习(ICL)使大型语言模型(LLMs)能够在不更新参数的情况下从演示示例中学习任务。尽管它在LLMs中的研究非常广泛,但在视觉语言模型(VLMs)中的有效性仍然未被充分探索。在本研究中,我们对VLMs中的ICL进行了系统研究,评估了四种架构下的七个模型在三个图像字幕基准上的表现。我们分析了提示设计、架构选择和训练策略如何影响多模态ICL。据我们所知,我们是第一个分析视觉语言模型中的注意力模式如何随着上下文演示数量的增加而变化的研究。我们的结果表明,使用图像-文本交错数据进行训练可以提高ICL性能,但并不意味着能够有效地整合演示示例中的视觉和文本信息。相反,指令调优可以提高指令遵循能力,但可能会减少对上下文演示的依赖,这表明指令对齐和上下文适应之间存在权衡。进一步的注意力分析表明,当前的VLMs主要关注文本线索,未能利用视觉信息,这表明它们在多模态整合方面的能力有限。这些发现突显了当前VLMs在ICL能力方面的关键局限性,并为提高其从多模态上下文示例中学习的能力提供了见解。
Summary / 总结
This study investigates in-context learning (ICL) in Vision-Language Models (VLMs) by evaluating seven models across four architectures on three image captioning benchmarks. The research finds that training on image-text interleaved data enhances ICL performance but does not necessarily integrate visual and textual information effectively. Instruction tuning improves instruction-following but may reduce reliance on in-context demonstrations, indicating a trade-off between instruction alignment and in-context adaptation. Attention analyses reveal that current VLMs primarily focus on textual cues and underutilize visual information, highlighting limitations in their multimodal integration capabilities.
本研究通过在三个图像字幕基准上评估七种模型(涵盖四种架构)来探讨Vision-Language模型(VLM)中的在上下文学习(ICL)。研究发现,使用图像-文本交错数据进行训练可以提升ICL性能,但并不能有效地将演示中的视觉和文本信息整合起来。指令调优虽然提高了指令遵循能力,但减少了对上下文演示的依赖,表明指令对齐与上下文适应之间存在权衡。注意力分析显示,当前的VLM主要依赖文本线索,而未能充分利用视觉信息,这表明需要增强其多模态整合能力。
Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning
Authors: Ivica Dimitrovski, Vlatko Spasev, Ivan Kitanovski
First: 2025-10-28T11:39:22+00:00 · Latest: 2025-10-28T11:39:22+00:00
Abstract
Remote sensing applications increasingly rely on deep learning for scene
classification. However, their performance is often constrained by the scarcity
of labeled data and the high cost of annotation across diverse geographic and
sensor domains. While recent vision-language models like CLIP have shown
promise by learning transferable representations at scale by aligning visual
and textual modalities, their direct application to remote sensing remains
suboptimal due to significant domain gaps and the need for task-specific
semantic adaptation. To address this critical challenge, we systematically
explore prompt learning as a lightweight and efficient adaptation strategy for
few-shot remote sensing image scene classification. We evaluate several
representative methods, including Context Optimization, Conditional Context
Optimization, Multi-modal Prompt Learning, and Prompting with Self-Regulating
Constraints. These approaches reflect complementary design philosophies: from
static context optimization to conditional prompts for enhanced generalization,
multi-modal prompts for joint vision-language adaptation, and semantically
regularized prompts for stable learning without forgetting. We benchmark these
prompt-learning methods against two standard baselines: zero-shot CLIP with
hand-crafted prompts and a linear probe trained on frozen CLIP features.
Through extensive experiments on multiple benchmark remote sensing datasets,
including cross-dataset generalization tests, we demonstrate that prompt
learning consistently outperforms both baselines in few-shot scenarios.
Notably, Prompting with Self-Regulating Constraints achieves the most robust
cross-domain performance. Our findings underscore prompt learning as a scalable
and efficient solution for bridging the domain gap in satellite and aerial
imagery, providing a strong foundation for future research in this field.
中文标题/摘要
标题:基于CLIP和提示学习的少量样本遥感图像场景分类
遥感应用越来越多地依赖深度学习进行场景分类。然而,其性能往往受限于标注数据的稀缺性和跨不同地理和传感器领域的高标注成本。虽然像CLIP这样的视觉-语言模型通过大规模地对齐视觉和文本模态来学习可迁移的表示显示出前景,但它们直接应用于遥感领域仍因领域差距显著和需要特定任务的语义适应而效果不佳。为应对这一关键挑战,我们系统地探索了提示学习作为一种轻量级和高效的适应策略,用于少量样本遥感图像场景分类。我们评估了几种代表性方法,包括上下文优化、条件上下文优化、多模态提示学习和自调节约束提示。这些方法反映了互补的设计理念:从静态上下文优化到增强泛化的条件提示,从联合视觉-语言适应的多模态提示到通过语义正则化提示实现稳定学习而不遗忘。我们通过在多个基准遥感数据集上的广泛实验,包括跨数据集泛化测试,证明提示学习在少量样本场景中始终优于两种标准基线:零样本CLIP配以手工设计的提示和在冻结CLIP特征上训练的线性探针。我们的研究结果强调了提示学习作为一种可扩展和高效的解决方案,用于弥合卫星和航空图像的领域差距,为该领域的未来研究奠定了坚实的基础。
Summary / 总结
The paper addresses the challenge of few-shot remote sensing image scene classification by exploring prompt learning as an efficient adaptation strategy. Various methods such as Context Optimization, Conditional Context Optimization, Multi-modal Prompt Learning, and Prompting with Self-Regulating Constraints were evaluated. The study demonstrates that prompt learning methods outperform zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features, with Prompting with Self-Regulating Constraints showing the most robust cross-domain performance.
论文通过利用CLIP和提示学习来解决遥感图像场景的少样本分类问题,评估了包括上下文优化、条件上下文优化、多模态提示学习和自调节约束提示在内的多种提示学习方法。这些方法与手工构建提示的零样本CLIP和冻结CLIP特征训练的线性探针进行了比较。实验结果显示,提示学习方法优于基线方法,其中自调节约束提示在跨域性能上表现最为稳健。
VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
Authors: Walid Bousselham, Hilde Kuehne, Cordelia Schmid
Venue: www
First: 2025-10-27T16:32:12+00:00 · Latest: 2025-10-28T11:09:37+00:00
Comments: www.walidbousselham.com/VOLD/
Abstract
Training vision-language models (VLMs) for complex reasoning remains a
challenging task, i.a. due to the scarcity of high-quality image-text reasoning
data. Conversely, text-based reasoning resources are abundant and scalable, but
it is still an open question how to leveraging them for VLM reasoning. To
address this problem, we propose VOLD, a framework to transfer reasoning
capabilities from text-only teacher models to VLM student models. To this end,
VOLD combines reinforcement learning via Group Relative Policy Optimization
(GRPO) with on-policy distillation, which allows the student reasoning traces
to be guided by the teacher model, resulting in a significant gain over using
GRPO alone. We further show that a cold-start alignment is essential for an
effective transfer during the online training phase in this scenario and that
without sufficient distributional alignment between teacher and student,
on-policy distillation fails to provide meaningful guidance. We evaluate VOLD
across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and
LogicVista, showing that VOLD outperforms the baseline model significantly and
improves over the state of the art by a margin. Our ablation shows the
importance of a cold-start alignment via SFT for on-policy distillation with a
text-only teacher.
中文标题/摘要
标题:VOLD:通过策略优化蒸馏从语言模型向视觉语言模型的知识迁移
训练视觉语言模型(VLMs)进行复杂推理仍然是一个具有挑战性的任务,主要是由于高质量图像文本推理数据的稀缺。相反,基于文本的推理资源丰富且可扩展,但如何利用它们来增强VLM推理仍然是一个开放的问题。为了解决这个问题,我们提出了VOLD,这是一种将仅文本教师模型的推理能力转移到VLM学生模型的框架。为此,VOLD结合了组相对策略优化(GRPO)的强化学习与策略优化蒸馏,这使得学生推理轨迹能够受到教师模型的引导,从而显著优于单独使用GRPO。我们进一步表明,在这种场景下的在线训练阶段,冷启动对齐对于有效的知识迁移至关重要,如果没有足够的分布对齐,策略优化蒸馏将无法提供有意义的指导。我们在包括MMMU-Pro、MathVision、MathVista和LogicVista在内的多种基准上评估了VOLD,结果显示VOLD显著优于基线模型,并且在某些方面超越了现有技术。我们的消融实验表明,通过仅文本教师进行的SFT对于策略优化蒸馏的冷启动对齐至关重要。
ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model
Authors: Juntian Zhang, Song Jin, Chuanqi Cheng, Yuhan Liu, Yankai Lin, Xun Zhang, Yufei Zhang, Fei Jiang, Guojun Yin, Wei Lin, Rui Yan
First: 2025-10-28T10:42:57+00:00 · Latest: 2025-10-28T10:42:57+00:00
Abstract
The limited capacity for fine-grained visual perception presents a critical
bottleneck for Vision-Language Models (VLMs) in real-world applications.
Addressing this is challenging due to the scarcity of high-quality data and the
limitations of existing methods: supervised fine-tuning (SFT) often compromises
general capabilities, while reinforcement fine-tuning (RFT) prioritizes textual
reasoning over visual perception. To bridge this gap, we propose a novel
two-stage task that structures visual perception learning as a coarse-to-fine
progressive process. Based on this task formulation, we develop ViPER, a
self-bootstrapping framework specifically designed to enable iterative
evolution through self-critiquing and self-prediction. By synergistically
integrating image-level and instance-level reconstruction with a two-stage
reinforcement learning strategy, ViPER establishes a closed-loop training
paradigm, where internally synthesized data directly fuel the enhancement of
perceptual ability. Applied to the Qwen2.5-VL family, ViPER produces the
Qwen-Viper series. With an average gain of 1.7% on seven comprehensive
benchmarks spanning various tasks and up to 6.0% on fine-grained perception,
Qwen-Viper consistently demonstrates superior performance across different
vision-language scenarios while maintaining generalizability. Beyond enabling
self-improvement in perceptual capabilities, ViPER provides concrete evidence
for the reciprocal relationship between generation and understanding, a
breakthrough to developing more autonomous and capable VLMs.
中文标题/摘要
标题:ViPER:赋能视觉感知能力自我演进的视觉语言模型
视觉语言模型(VLMs)在现实应用中面临细粒度视觉感知能力有限的瓶颈。由于高质量数据稀缺和现有方法的局限性,解决这一问题颇具挑战性:监督微调(SFT)通常会牺牲通用能力,而强化微调(RFT)则侧重于文本推理而非视觉感知。为弥合这一差距,我们提出了一种新的两阶段任务,将视觉感知学习构建成从粗到细的渐进过程。基于此任务框架,我们开发了ViPER,这是一种自强化框架,旨在通过自我批判和自我预测实现迭代进化。通过将图像级和实例级重建与两阶段强化学习策略协同整合,ViPER建立了一个闭环训练范式,其中内部合成数据直接促进了感知能力的提升。应用于Qwen2.5-VL家族,ViPER产生了Qwen-Viper系列。在七个涵盖不同任务的综合基准测试中,Qwen-Viper平均提高了1.7%,在细粒度感知方面最高提高了6.0%,在不同视觉语言场景中表现出优越性能并保持了泛化能力。除了赋能感知能力的自我提升,ViPER还为生成与理解之间的相互关系提供了实证证据,为开发更自主和强大的VLMs开辟了新途径。
Summary / 总结
The paper proposes ViPER, a self-bootstrapping framework for Vision-Language Models (VLMs) to enhance their visual perception abilities. It addresses the challenge of limited fine-grained visual perception by structuring learning as a coarse-to-fine process and using a two-stage reinforcement learning strategy. ViPER integrates image-level and instance-level reconstruction to create a closed-loop training paradigm, enabling iterative self-improvement. Experiments on Qwen2.5-VL family show that Qwen-Viper series achieves an average gain of 1.7% on seven benchmarks and up to 6.0% on fine-grained perception tasks, demonstrating superior performance and generalizability in various vision-language scenarios.
论文提出了ViPER框架,用于增强视觉语言模型(VLMs)的视觉感知能力。ViPER采用两阶段任务逐步提升视觉感知,结合图像级和实例级重建与强化学习。这种方法在七个基准测试中带来了平均1.7%的提升,最高可达6.0%的细粒度感知任务提升,同时保持了泛化能力。
Training-free Source Attribution of AI-generated Images via Resynthesis
Authors: Pietro Bongini, Valentina Molinari, Andrea Costanzo, Benedetta Tondi, Mauro Barni
First: 2025-10-28T10:39:04+00:00 · Latest: 2025-10-28T10:39:04+00:00
Comments: 14 pages, 4 figures, 1 table, accepted at "The 17th IEEE
INTERNATIONAL WORKSHOP ON INFORMATION FORENSICS AND SECURITY (WIFS2025)",
Perth, Australia
Abstract
Synthetic image source attribution is a challenging task, especially in data
scarcity conditions requiring few-shot or zero-shot classification
capabilities. We present a new training-free one-shot attribution method based
on image resynthesis. A prompt describing the image under analysis is
generated, then it is used to resynthesize the image with all the candidate
sources. The image is attributed to the model which produced the resynthesis
closest to the original image in a proper feature space. We also introduce a
new dataset for synthetic image attribution consisting of face images from
commercial and open-source text-to-image generators. The dataset provides a
challenging attribution framework, useful for developing new attribution models
and testing their capabilities on different generative architectures. The
dataset structure allows to test approaches based on resynthesis and to compare
them to few-shot methods. Results from state-of-the-art few-shot approaches and
other baselines show that the proposed resynthesis method outperforms existing
techniques when only a few samples are available for training or fine-tuning.
The experiments also demonstrate that the new dataset is a challenging one and
represents a valuable benchmark for developing and evaluating future few-shot
and zero-shot methods.
中文标题/摘要
标题:基于重塑的无训练AI生成图像源归属
合成图像源归属是一个具有挑战性的任务,尤其是在数据稀缺条件下需要实现少样本或零样本分类能力。我们提出了一种新的无训练的一次性归属方法,基于图像重塑。生成一个描述待分析图像的提示,然后使用该提示将图像与所有候选源重新合成。图像被归属给在适当特征空间中与原始图像最接近的重塑模型。我们还引入了一个新的合成图像归属数据集,包含来自商业和开源文本到图像生成器的面部图像。该数据集提供了一个具有挑战性的归属框架,有助于开发新的归属模型并测试其在不同生成架构上的能力。数据集结构允许测试基于重塑的方法,并将其与少样本方法进行比较。最先进的少样本方法和其他基线的结果表明,当仅有少量样本用于训练或微调时,提出的重塑方法优于现有技术。实验还表明,新数据集具有挑战性,是开发和评估未来少样本和零样本方法的宝贵基准。
Summary / 总结
The paper addresses the challenge of source attribution for AI-generated images in data-scarce conditions. It introduces a training-free one-shot attribution method using image resynthesis. By generating a prompt and resynthesizing the image with candidate sources, the method attributes the image to the model that produces the closest resynthesis to the original in a specific feature space. Experiments show that this resynthesis method outperforms state-of-the-art few-shot approaches when only a few samples are available, and the new dataset is challenging and valuable for testing few-shot and zero-shot methods.
该论文旨在解决在数据稀缺条件下识别AI生成图像来源的挑战。它提出了一种基于图像重合成的无训练一-shot归属方法。通过为待分析的图像生成提示并使用候选来源重新合成图像,该方法确定了产生与原始图像最接近重合成的来源模型。实验表明,该重合成方法在样本稀缺时优于现有技术,展示了其在少样本和零样本场景中的有效性。该数据集还被证明具有挑战性,并且对于评估未来归属模型具有宝贵的基准作用。
Enabling Near-realtime Remote Sensing via Satellite-Ground Collaboration of Large Vision-Language Models
Authors: Zihan Li, Jiahao Yang, Yuxin Zhang, Zhe Chen, Yue Gao
First: 2025-10-28T09:48:26+00:00 · Latest: 2025-10-28T09:48:26+00:00
Comments: 15 pages, 11 figures
Abstract
Large vision-language models (LVLMs) have recently demonstrated great
potential in remote sensing (RS) tasks (e.g., disaster monitoring) conducted by
low Earth orbit (LEO) satellites. However, their deployment in real-world LEO
satellite systems remains largely unexplored, hindered by limited onboard
computing resources and brief satellite-ground contacts. We propose Grace, a
satellite-ground collaborative system designed for near-realtime LVLM inference
in RS tasks. Accordingly, we deploy compact LVLM on satellites for realtime
inference, but larger ones on ground stations (GSs) to guarantee end-to-end
performance. Grace is comprised of two main phases that are asynchronous
satellite-GS Retrieval-Augmented Generation (RAG), and a task dispatch
algorithm. Firstly, we still the knowledge archive of GS RAG to satellite
archive with tailored adaptive update algorithm during limited satellite-ground
data exchange period. Secondly, propose a confidence-based test algorithm that
either processes the task onboard the satellite or offloads it to the GS.
Extensive experiments based on real-world satellite orbital data show that
Grace reduces the average latency by 76-95% compared to state-of-the-art
methods, without compromising inference accuracy.
中文标题/摘要
标题:通过卫星-地面协作实现大型视觉-语言模型的近实时遥感
大型视觉-语言模型(LVLMs)最近在由低地球轨道(LEO)卫星执行的遥感(例如,灾害监测)任务中展示了巨大的潜力。然而,它们在实际LEO卫星系统中的部署仍然鲜有探索,受限于有限的机载计算资源和短暂的卫星-地面通信时间。我们提出Grace,一种为遥感任务设计的卫星-地面协作系统,用于实现近实时的LVLM推理。因此,我们将在卫星上部署紧凑的LVLM以实现实时推理,而在地面站(GSs)上部署更大的模型以保证端到端性能。Grace由两个主要阶段组成,即异步卫星-GS检索增强生成(RAG)和任务调度算法。首先,在有限的卫星-地面数据交换期间,我们使用定制的自适应更新算法将GS RAG的知识库传输到卫星档案。其次,我们提出了一种基于置信度的测试算法,该算法要么在卫星上处理任务,要么将其卸载到GS。基于实际卫星轨道数据的广泛实验表明,与最先进的方法相比,Grace将平均延迟降低了76-95%,而不影响推理准确性。
Summary / 总结
Grace is a satellite-ground collaborative system for near-realtime LVLM inference in remote sensing tasks. It deploys compact LVLMs on satellites for real-time inference and larger models on ground stations to ensure performance. Experiments show that Grace reduces the average latency by 76-95% compared to existing methods while maintaining inference accuracy.
研究旨在通过卫星-地面协作的大视觉-语言模型(LVLM)实现近实时遥感,以解决机载计算资源有限和卫星-地面通信短暂的问题。Grace 系统将紧凑的 LVLM 部署在卫星上进行实时推理,并将较大的模型部署在地面站以确保端到端性能。实验表明,与最先进的方法相比,Grace 将平均延迟降低了 76-95%,同时不牺牲推理准确性。
MC-SJD : Maximal Coupling Speculative Jacobi Decoding for Autoregressive Visual Generation Acceleration
Authors: Junhyuk So, Hyunho Kook, Chaeyeon Jang, Eunhyeok Park
First: 2025-10-28T09:26:27+00:00 · Latest: 2025-10-28T09:26:27+00:00
Abstract
While autoregressive (AR) modeling has recently emerged as a new paradigm in
visual generation, its practical adoption is severely constrained by the slow
inference speed of per-token generation, which often requires thousands of
steps to produce a single sample. To address this challenge, we propose MC-SJD,
a training-free, lossless parallel decoding framework designed to accelerate AR
visual generation by extending the recently introduced Speculative Jacobi
Decoding (SJD). Although SJD shows strong potential for accelerating AR
generation, we demonstrate that token instability across iterations
significantly reduces the acceptance rate, a limitation that primarily arises
from the independent sampling process used during draft token generation. To
overcome this, we introduce MC-SJD, an information-theoretic approach based on
coupling, which substantially accelerates standard SJD by maximizing the
probability of sampling identical draft tokens across consecutive iterations,
all while preserving its lossless property. Remarkably, this method requires
only a single-line modification to the existing algorithm, yet achieves
substantial performance gains, delivering up to a ~4.2x acceleration in image
generation and ~13.3x acceleration in video generation compared to standard AR
decoding, without any degradation in output quality.
中文标题/摘要
标题:MC-SJD:最大耦合推测雅可比解码加速自回归视觉生成
虽然自回归(AR)建模最近已成为视觉生成的新范式,但由于每次生成一个词元的缓慢推理速度,其实际应用受到严重限制,通常需要数千步才能生成一个样本。为了解决这一挑战,我们提出了MC-SJD,这是一种无需训练、无损并行解码框架,旨在通过扩展最近引入的推测雅可比解码(SJD)来加速AR视觉生成。尽管SJD在加速AR生成方面显示出强大的潜力,但我们证明,迭代过程中词元的不稳定性显著降低了接受率,这一限制主要源于在草稿词元生成过程中使用的独立采样过程。为克服这一问题,我们引入了MC-SJD,这是一种基于耦合的信息论方法,通过最大化连续迭代中采样相同草稿词元的概率,显著加速了标准SJD,同时保持其无损特性。令人惊讶的是,这种方法只需要对现有算法进行一行修改,就能实现显著的性能提升,图像生成加速高达约4.2倍,视频生成加速高达约13.3倍,而输出质量没有任何下降。
Summary / 总结
The research aims to address the slow inference speed of autoregressive (AR) modeling in visual generation by proposing MC-SJD, a training-free, lossless parallel decoding framework. MC-SJD builds upon Speculative Jacobi Decoding (SJD) to accelerate AR visual generation by maximizing the probability of sampling identical draft tokens across iterations, thereby overcoming the token instability issue. The method achieves up to a 4.2x acceleration in image generation and 13.3x in video generation without compromising output quality, with only a single-line modification to the existing algorithm.
研究旨在通过解决逐令牌生成的缓慢推理速度问题,加速自回归(AR)视觉生成。提出了一个无需训练、无损并行解码框架MC-SJD,以克服Speculative Jacobi Decoding(SJD)中的令牌不稳定性问题。通过最大化连续迭代中采样相同草稿令牌的概率,MC-SJD 实现了图像生成高达4.2倍、视频生成高达13.3倍的加速,同时不降低输出质量。
V-SAT: Video Subtitle Annotation Tool
Authors: Arpita Kundu, Joyita Chakraborty, Anindita Desarkar, Aritra Sen, Srushti Anil Patil, Vishwanathan Raman
First: 2025-10-28T08:34:27+00:00 · Latest: 2025-10-28T08:34:27+00:00
Abstract
The surge of audiovisual content on streaming platforms and social media has
heightened the demand for accurate and accessible subtitles. However, existing
subtitle generation methods primarily speech-based transcription or OCR-based
extraction suffer from several shortcomings, including poor synchronization,
incorrect or harmful text, inconsistent formatting, inappropriate reading
speeds, and the inability to adapt to dynamic audio-visual contexts. Current
approaches often address isolated issues, leaving post-editing as a
labor-intensive and time-consuming process. In this paper, we introduce V-SAT
(Video Subtitle Annotation Tool), a unified framework that automatically
detects and corrects a wide range of subtitle quality issues. By combining
Large Language Models(LLMs), Vision-Language Models (VLMs), Image Processing,
and Automatic Speech Recognition (ASR), V-SAT leverages contextual cues from
both audio and video. Subtitle quality improved, with the SUBER score reduced
from 9.6 to 3.54 after resolving all language mode issues and F1-scores of
~0.80 for image mode issues. Human-in-the-loop validation ensures high-quality
results, providing the first comprehensive solution for robust subtitle
annotation.
中文标题/摘要
标题:V-SAT:视频字幕标注工具
流媒体平台和社交媒体上音频视频内容的激增提高了对准确和可访问字幕的需求。然而,现有的字幕生成方法主要依赖于语音转录或OCR提取,存在同步差、错误或有害文本、格式不一致、不合适的朗读速度以及无法适应动态音频视频环境等缺点。当前的方法往往只解决孤立的问题,导致后期编辑工作量大且耗时。本文介绍了一种名为V-SAT(视频字幕标注工具)的统一框架,该框架能够自动检测和纠正多种字幕质量问题。通过结合大型语言模型(LLMs)、视觉语言模型(VLMs)、图像处理和自动语音识别(ASR),V-SAT 利用了来自音频和视频的上下文线索。在解决所有语言模式问题后,SUBER评分从9.6降至3.54,图像模式问题的F1分数约为0.80。人工在环验证确保了高质量的结果,提供了首个全面的稳健字幕标注解决方案。
Summary / 总结
V-SAT is a unified framework designed to improve subtitle quality for audiovisual content by automatically detecting and correcting various issues such as synchronization, text accuracy, and formatting. It integrates LLMs, VLMs, image processing, and ASR to leverage audio and video contextual cues. Experimental results show a significant improvement in the SUBER score from 9.6 to 3.54 after resolving language issues, with F1-scores of approximately 0.80 for image mode issues. Human-in-the-loop validation ensures high-quality results, providing a comprehensive solution for robust subtitle annotation.
论文介绍了V-SAT,这是一种统一框架,用于自动检测和纠正音频视频内容中的各种字幕质量问题。通过整合LLMs、VLMs、图像处理和ASR,V-SAT提高了字幕质量,SUBER分数从9.6降低到3.54,图像模式问题的F1分数约为0.80。人工在环验证确保了高质量的结果,提供了一种全面的 robust 字幕注释解决方案。
Normal and Abnormal Pathology Knowledge-Augmented Vision-Language Model for Anomaly Detection in Pathology Images
Authors: Jinsol Song, Jiamu Wang, Anh Tien Nguyen, Keunho Byeon, Sangjeong Ahn, Sung Hak Lee, Jin Tae Kwak
Venue: ICCV 2025
First: 2025-08-21T05:40:23+00:00 · Latest: 2025-10-28T08:08:14+00:00
Comments: Accepted to ICCV 2025. Code is available at:
https://github.com/QuIIL/ICCV2025_Ano-NAViLa
Abstract
Anomaly detection in computational pathology aims to identify rare and scarce
anomalies where disease-related data are often limited or missing. Existing
anomaly detection methods, primarily designed for industrial settings, face
limitations in pathology due to computational constraints, diverse tissue
structures, and lack of interpretability. To address these challenges, we
propose Ano-NAViLa, a Normal and Abnormal pathology knowledge-augmented
Vision-Language model for Anomaly detection in pathology images. Ano-NAViLa is
built on a pre-trained vision-language model with a lightweight trainable MLP.
By incorporating both normal and abnormal pathology knowledge, Ano-NAViLa
enhances accuracy and robustness to variability in pathology images and
provides interpretability through image-text associations. Evaluated on two
lymph node datasets from different organs, Ano-NAViLa achieves the
state-of-the-art performance in anomaly detection and localization,
outperforming competing models.
中文标题/摘要
标题:正常与异常病理知识增强的视觉-语言模型在病理图像异常检测中的应用
在计算病理学中的异常检测旨在识别罕见且稀缺的异常情况,其中与疾病相关的数据往往有限或缺失。现有的异常检测方法主要针对工业环境设计,由于计算限制、多样的组织结构和缺乏可解释性,在病理学中面临局限性。为了解决这些挑战,我们提出了一种名为Ano-NAViLa的模型,这是一种基于预训练视觉-语言模型的正常与异常病理知识增强的异常检测模型。通过结合正常和异常病理知识,Ano-NAViLa提高了对病理图像变异性的准确性和鲁棒性,并通过图像-文本关联提供了可解释性。在两个不同器官的淋巴结数据集上进行评估,Ano-NAViLa在异常检测和定位方面达到了最先进的性能,超越了竞争对手的模型。
Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning
Authors: Aodi Wu, Xubo Luo
Venue: IROS 2025
First: 2025-10-28T07:43:30+00:00 · Latest: 2025-10-28T07:43:30+00:00
Comments: RoboSense Challenge with IROS 2025
Abstract
This technical report presents our solution for the RoboSense Challenge at
IROS 2025, which evaluates Vision-Language Models (VLMs) on autonomous driving
scene understanding across perception, prediction, planning, and corruption
detection tasks. We propose a systematic framework built on four core
components. First, a Mixture-of-Prompts router classifies questions and
dispatches them to task-specific expert prompts, eliminating interference
across diverse question types. Second, task-specific prompts embed explicit
coordinate systems, spatial reasoning rules, role-playing,
Chain-of-Thought/Tree-of-Thought reasoning, and few-shot examples tailored to
each task. Third, a visual assembly module composes multi-view images with
object crops, magenta markers, and adaptive historical frames based on question
requirements. Fourth, we configure model inference parameters (temperature,
top-p, message roles) per task to optimize output quality. Implemented on
Qwen2.5-VL-72B, our approach achieves 70.87% average accuracy on Phase-1 (clean
data) and 72.85% on Phase-2 (corrupted data), demonstrating that structured
prompting and spatial grounding substantially enhance VLM performance on
safety-critical autonomous driving tasks. Code and prompt are available at
https://github.com/wuaodi/UCAS-CSU-phase2.
中文标题/摘要
标题:通过任务特定提示和空间推理增强视觉语言模型在自动驾驶中的应用
本技术报告介绍了我们在IROS 2025 RoboSense挑战中的解决方案,该挑战评估视觉语言模型(VLMs)在自动驾驶场景理解方面的表现,涵盖感知、预测、规划和损坏检测任务。我们提出了一种基于四个核心组件的系统框架。首先,混合提示路由器将问题分类并分配给特定任务的专家提示,消除不同问题类型之间的干扰。其次,特定任务的提示嵌入了明确的坐标系统、空间推理规则、角色扮演、链式/树式推理以及针对每个任务定制的少量示例。第三,视觉组装模块根据问题要求组合多视图图像、物体剪辑、洋红色标记和自适应历史帧。第四,我们根据任务配置模型推理参数(温度、top-p、消息角色),以优化输出质量。在Qwen2.5-VL-72B上实现,我们的方法在第一阶段(干净数据)的平均准确率为70.87%,第二阶段(损坏数据)为72.85%,表明结构化提示和空间定位显著提升了VLM在安全关键的自动驾驶任务中的性能。代码和提示可在https://github.com/wuaodi/UCAS-CSU-phase2/ 获取。
Summary / 总结
This technical report introduces a framework for enhancing Vision-Language Models (VLMs) in autonomous driving tasks. The approach uses a Mixture-of-Prompts router to dispatch questions to task-specific expert prompts, a visual assembly module to compose images, and task-specific prompts with spatial reasoning rules. On the RoboSense Challenge at IROS 2025, the model achieved 70.87% accuracy on clean data and 72.85% on corrupted data, showing significant improvement in VLM performance for autonomous driving tasks.
该技术报告提出了针对IROS 2025 RoboSense挑战的解决方案,旨在提升视觉-语言模型(VLMs)在自动驾驶中的应用。该方法包含四个核心组件:混合提示路由器、任务特定提示、视觉装配模块以及任务特定的模型推理配置。该框架在干净数据上达到了70.87%的准确率,在受污染数据上达到了72.85%,展示了结构化提示和空间定位在安全关键的自动驾驶任务中对VLM性能的显著提升。
From Observability Data to Diagnosis: An Evolving Multi-agent System for Incident Management in Cloud Systems
Authors: Yu Luo, Jiamin Jiang, Jingfei Feng, Lei Tao, Qingliang Zhang, Xidao Wen, Yongqian Sun, Shenglin Zhang, Jielong Huang, Nan Qi, Dan Pei
First: 2025-10-28T07:38:15+00:00 · Latest: 2025-10-28T07:38:15+00:00
Abstract
Incident management (IM) is central to the reliability of large-scale cloud
systems. Yet manual IM, where on-call engineers examine metrics, logs, and
traces is labor-intensive and error-prone in the face of massive and
heterogeneous observability data. Existing automated IM approaches often
struggle to generalize across systems, provide limited interpretability, and
incur high deployment costs, which hinders adoption in practice. In this paper,
we present OpsAgent, a lightweight, self-evolving multi-agent system for IM
that employs a training-free data processor to convert heterogeneous
observability data into structured textual descriptions, along with a
multi-agent collaboration framework that makes diagnostic inference transparent
and auditable. To support continual capability growth, OpsAgent also introduces
a dual self-evolution mechanism that integrates internal model updates with
external experience accumulation, thereby closing the deployment loop.
Comprehensive experiments on the OPENRCA benchmark demonstrate state-of-the-art
performance and show that OpsAgent is generalizable, interpretable,
cost-efficient, and self-evolving, making it a practically deployable and
sustainable solution for long-term operation in real-world cloud systems.
中文标题/摘要
标题:从可观测数据到诊断:云系统故障管理的 evolving 多智能体系统
故障管理(IM)是大规模云系统可靠性的核心。然而,手动IM,其中当班工程师检查指标、日志和跟踪,面对庞大的异构可观测数据时是劳动密集型且容易出错的。现有的自动化IM方法往往难以在系统之间泛化,提供有限的可解释性,并且部署成本高,这阻碍了其实用中的采用。在本文中,我们提出了OpsAgent,这是一种轻量级、自我演化的多智能体系统,用于IM,它采用无需训练的数据处理器将异构可观测数据转换为结构化的文本描述,并采用多智能体协作框架使诊断推理透明且可审计。为了支持持续能力增长,OpsAgent 还引入了一种双重自我演化机制,将内部模型更新与外部经验积累相结合,从而关闭部署循环。在OPENRCA基准上的全面实验表明,OpsAgent 的性能处于先进水平,并且证明OpsAgent 是通用的、可解释的、成本效益高的和自我演化的,使其成为在实际云系统中长期运行的可部署和可持续的解决方案。
Summary / 总结
OpsAgent is a lightweight multi-agent system designed for incident management in cloud systems, addressing the challenges of manual monitoring and existing automated approaches. It uses a training-free data processor to convert heterogeneous observability data into structured textual descriptions and employs a multi-agent collaboration framework for transparent and auditable diagnostic inference. OpsAgent also features a dual self-evolution mechanism, enabling continual capability growth and closing the deployment loop. Experimental results on the OPENRCA benchmark show that OpsAgent performs at state-of-the-art levels, is generalizable, interpretable, and cost-efficient, making it a practical and sustainable solution for cloud systems.
论文介绍了OpsAgent,这是一种轻量级的多智能体系统,用于云系统的故障管理。它使用无训练的数据处理器将异构可观测数据转换为结构化的文本描述,并采用多智能体协作框架进行透明的诊断推理。OpsAgent还包含一种双重自我进化机制,将内部模型更新与外部经验积累相结合,以实现持续的能力增长。实验表明,OpsAgent在性能上达到最先进的水平,具有通用性、可解释性和成本效益,是实际部署和可持续的解决方案。
Compositional Image Synthesis with Inference-Time Scaling
Authors: Minsuk Ji, Sanghyeok Lee, Namhyuk Ahn
First: 2025-10-28T07:16:21+00:00 · Latest: 2025-10-28T07:16:21+00:00
Comments: projcet page: https://github.com/gcl-inha/ReFocus
Abstract
Despite their impressive realism, modern text-to-image models still struggle
with compositionality, often failing to render accurate object counts,
attributes, and spatial relations. To address this challenge, we present a
training-free framework that combines an object-centric approach with
self-refinement to improve layout faithfulness while preserving aesthetic
quality. Specifically, we leverage large language models (LLMs) to synthesize
explicit layouts from input prompts, and we inject these layouts into the image
generation process, where a object-centric vision-language model (VLM) judge
reranks multiple candidates to select the most prompt-aligned outcome
iteratively. By unifying explicit layout-grounding with self-refine-based
inference-time scaling, our framework achieves stronger scene alignment with
prompts compared to recent text-to-image models. The code are available at
https://github.com/gcl-inha/ReFocus.
中文标题/摘要
标题:基于推理时缩放的组成性图像合成
尽管现代文本到图像模型具有惊人的逼真度,但在组成性方面仍然存在困难,经常无法准确渲染物体数量、属性和空间关系。为了解决这一挑战,我们提出了一种无需训练的框架,该框架结合了以对象为中心的方法与自我精炼,以提高布局忠实度同时保持美学质量。具体而言,我们利用大型语言模型(LLMs)从输入提示中合成显式布局,并将这些布局注入到图像生成过程中,其中以对象为中心的视觉语言模型(VLM)评估并重新排序多个候选方案,以迭代选择最符合提示的输出。通过将显式布局接地与基于自我精炼的推理时缩放统一起来,我们的框架在场景对齐方面比最近的文本到图像模型表现更佳。代码可在https://github.com/gcl-inha/ReFocus获取。
Summary / 总结
This paper addresses the issue of compositionality in text-to-image models by proposing a training-free framework that combines an object-centric approach with self-refinement. The framework uses large language models to generate explicit layouts from input prompts, which are then injected into the image generation process. An object-centric vision-language model (VLM) iteratively reranks and selects the most prompt-aligned outcome. This approach improves scene alignment with prompts compared to recent text-to-image models.
研究旨在通过解决文本到图像模型在对象数量、属性和空间关系方面的不足,提高其组成准确性。方法是使用大型语言模型从输入提示生成明确的布局,然后将这些布局注入图像生成过程。一个以对象为中心的视觉语言模型会迭代地细化这些布局,使其更好地与输入提示对齐,从而比最近的模型实现更强的场景对齐。
ETC: training-free diffusion models acceleration with Error-aware Trend Consistency
Authors: Jiajian Xie, Hubery Yin, Chen Li, Zhou Zhao, Shengyu Zhang
First: 2025-10-28T07:08:09+00:00 · Latest: 2025-10-28T07:08:09+00:00
Comments: 17 pages, 10 figures
Abstract
Diffusion models have achieved remarkable generative quality but remain
bottlenecked by costly iterative sampling. Recent training-free methods
accelerate diffusion process by reusing model outputs. However, these methods
ignore denoising trends and lack error control for model-specific tolerance,
leading to trajectory deviations under multi-step reuse and exacerbating
inconsistencies in the generated results. To address these issues, we introduce
Error-aware Trend Consistency (ETC), a framework that (1) introduces a
consistent trend predictor that leverages the smooth continuity of diffusion
trajectories, projecting historical denoising patterns into stable future
directions and progressively distributing them across multiple approximation
steps to achieve acceleration without deviating; (2) proposes a model-specific
error tolerance search mechanism that derives corrective thresholds by
identifying transition points from volatile semantic planning to stable quality
refinement. Experiments show that ETC achieves a 2.65x acceleration over FLUX
with negligible (-0.074 SSIM score) degradation of consistency.
中文标题/摘要
标题:ETC:基于误差感知趋势一致性的训练-free扩散模型加速
扩散模型在生成质量方面取得了显著进展,但仍然受到昂贵的迭代采样的限制。最近的训练-free方法通过重用模型输出来加速扩散过程。然而,这些方法忽略了去噪趋势,缺乏对模型特定容差的误差控制,导致多步重用时轨迹偏移,并加剧生成结果的一致性问题。为了解决这些问题,我们引入了误差感知趋势一致性(ETC)框架,该框架(1)引入了一致趋势预测器,利用扩散轨迹的平滑连续性,将历史去噪模式投影到稳定的方向,并逐步分布在多个近似步骤中,以实现加速而不偏离;(2)提出了一种针对模型的误差容差搜索机制,通过识别从易变语义规划到稳定质量细化的过渡点来推导纠正阈值。实验表明,ETC在FLUX上实现了2.65倍的加速,且一致性下降可忽略不计(-0.074 SSIM分数)。
Summary / 总结
ETC is a framework that addresses the issue of costly iterative sampling in diffusion models by introducing a consistent trend predictor and a model-specific error tolerance search mechanism. This approach ensures smooth and consistent denoising trends, leading to a 2.65x acceleration without significant degradation in SSIM score compared to FLUX.
ETC 是一个框架,通过预测一致的趋势并在多个步骤中分布它们来加速扩散模型,而不偏离轨迹。它还引入了模型特定的误差容限搜索机制,以保持一致性。实验表明,ETC 相比于 FLUX 实现了 2.65 倍的加速,并且一致性仅略有下降(-0.074 SSIM 分数)。
LagMemo: Language 3D Gaussian Splatting Memory for Multi-modal Open-vocabulary Multi-goal Visual Navigation
Authors: Haotian Zhou, Xiaole Wang, He Li, Fusheng Sun, Shengyu Guo, Guolei Qi, Jianghuan Xu, Huijing Zhao
First: 2025-10-28T06:42:21+00:00 · Latest: 2025-10-28T06:42:21+00:00
Abstract
Navigating to a designated goal using visual information is a fundamental
capability for intelligent robots. Most classical visual navigation methods are
restricted to single-goal, single-modality, and closed set goal settings. To
address the practical demands of multi-modal, open-vocabulary goal queries and
multi-goal visual navigation, we propose LagMemo, a navigation system that
leverages a language 3D Gaussian Splatting memory. During exploration, LagMemo
constructs a unified 3D language memory. With incoming task goals, the system
queries the memory, predicts candidate goal locations, and integrates a local
perception-based verification mechanism to dynamically match and validate goals
during navigation. For fair and rigorous evaluation, we curate GOAT-Core, a
high-quality core split distilled from GOAT-Bench tailored to multi-modal
open-vocabulary multi-goal visual navigation. Experimental results show that
LagMemo's memory module enables effective multi-modal open-vocabulary goal
localization, and that LagMemo outperforms state-of-the-art methods in
multi-goal visual navigation. Project page:
https://weekgoodday.github.io/lagmemo
中文标题/摘要
标题:LagMemo:语言3D高斯斑点记忆多模态开放词汇多目标视觉导航
使用视觉信息导航至指定目标是智能机器人的一项基本能力。大多数经典的视觉导航方法仅限于单目标、单模态和封闭词汇的目标设置。为应对多模态、开放词汇的目标查询和多目标视觉导航的实际需求,我们提出LagMemo,一种利用语言3D高斯斑点记忆的导航系统。在探索过程中,LagMemo构建统一的3D语言记忆。接收到任务目标后,系统查询记忆,预测候选目标位置,并结合局部感知验证机制在导航过程中动态匹配和验证目标。为了公平和严格的评估,我们从GOAT-Bench中精心筛选出GOAT-Core,一个高质量的核心分割,专为多模态开放词汇多目标视觉导航设计。实验结果表明,LagMemo的记忆模块能够有效实现多模态开放词汇目标定位,并且在多目标视觉导航中优于现有最先进的方法。项目页面:https://weekgoodday.github.io/lagmemo
Summary / 总结
LagMemo is a navigation system that addresses the limitations of classical visual navigation methods by enabling multi-modal, open-vocabulary, and multi-goal visual navigation. It uses a language 3D Gaussian Splatting memory to construct a unified 3D language memory during exploration. The system queries this memory with incoming task goals, predicts candidate goal locations, and verifies them dynamically during navigation. Experimental results demonstrate that LagMemo's memory module effectively localizes multi-modal open-vocabulary goals and outperforms existing methods in multi-goal visual navigation.
LagMemo 是一种用于处理多模态和开放词汇多目标视觉导航的导航系统。它使用语言 3D 高斯斑点记忆在探索过程中构建统一的 3D 语言记忆。系统预测候选目标位置并使用局部感知机制进行验证。实验结果表明,LagMemo 能有效定位多模态开放词汇目标,并在多目标视觉导航中优于现有方法。
HistoLens: An Interactive XAI Toolkit for Verifying and Mitigating Flaws in Vision-Language Models for Histopathology
Authors: Sandeep Vissapragada, Vikrant Sahu, Gagan Raj Gupta, Vandita Singh
First: 2025-10-28T06:38:59+00:00 · Latest: 2025-10-28T06:38:59+00:00
Abstract
For doctors to truly trust artificial intelligence, it can't be a black box.
They need to understand its reasoning, almost as if they were consulting a
colleague. We created HistoLens1 to be that transparent, collaborative partner.
It allows a pathologist to simply ask a question in plain English about a
tissue slide--just as they would ask a trainee. Our system intelligently
translates this question into a precise query for its AI engine, which then
provides a clear, structured report. But it doesn't stop there. If a doctor
ever asks, "Why?", HistoLens can instantly provide a 'visual proof' for any
finding--a heatmap that points to the exact cells and regions the AI used for
its analysis. We've also ensured the AI focuses only on the patient's tissue,
just like a trained pathologist would, by teaching it to ignore distracting
background noise. The result is a workflow where the pathologist remains the
expert in charge, using a trustworthy AI assistant to verify their insights and
make faster, more confident diagnoses.
中文标题/摘要
标题:HistoLens:一种用于验证和缓解病理学视觉语言模型缺陷的交互式XAI工具包
为了医生真正信任人工智能,它不能是一个黑箱。
他们需要理解其推理过程,就像咨询同事一样。我们创建了HistoLens1,使其成为一个透明且协作的伙伴。
它允许病理学家用简单的英语问题询问组织切片——就像他们询问实习生一样。我们的系统会智能地将这个问题转化为对AI引擎的精确查询,然后提供一个清晰的结构化报告。但这还不止于此。如果医生问“为什么?”,HistoLens可以立即提供任何发现的“视觉证明”——一个热力图,指向AI分析所用的精确细胞和区域。我们还确保AI只关注患者的组织,就像经过训练的病理学家一样,通过教会它忽略分散注意力的背景噪音。结果是一个工作流程,其中病理学家仍然是专家,使用可信赖的AI助手验证他们的见解并更快、更自信地做出诊断。
Summary / 总结
HistoLens is an interactive XAI toolkit designed to enhance the transparency and trustworthiness of vision-language models in histopathology. It translates pathologists' natural language questions into precise queries for the AI, providing clear reports and visual proofs for findings. This ensures the AI focuses only on the relevant tissue, similar to a trained pathologist, and supports pathologists in verifying their insights and making more confident diagnoses.
HistoLens 是一个交互式 XAI 工具包,旨在增强病理学中视觉-语言模型的透明度和可信度。它将病理学家的自然语言问题转化为精确的查询,提供清晰的报告和发现的视觉证明。这确保 AI 只关注相关组织,类似于训练有素的病理学家的做法,并支持病理学家验证他们的见解并做出更有信心的诊断。
OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation
Authors: Agus Gunawan, Samuel Teodoro, Yun Chen, Soo Ye Kim, Jihyong Oh, Munchurl Kim
First: 2025-10-28T06:06:52+00:00 · Latest: 2025-10-28T06:06:52+00:00
Comments: The first two authors contributed equally to this work. The last two
authors are co-corresponding authors
Abstract
Recent advancements in diffusion-based text synthesis have demonstrated
significant performance in inserting and editing text within images via
inpainting. However, despite the potential of text inpainting methods, three
key limitations hinder their applicability to broader Text Image Manipulation
(TIM) tasks: (i) the inability to remove text, (ii) the lack of control over
the style of rendered text, and (iii) a tendency to generate duplicated
letters. To address these challenges, we propose OmniText, a training-free
generalist capable of performing a wide range of TIM tasks. Specifically, we
investigate two key properties of cross- and self-attention mechanisms to
enable text removal and to provide control over both text styles and content.
Our findings reveal that text removal can be achieved by applying
self-attention inversion, which mitigates the model's tendency to focus on
surrounding text, thus reducing text hallucinations. Additionally, we
redistribute cross-attention, as increasing the probability of certain text
tokens reduces text hallucination. For controllable inpainting, we introduce
novel loss functions in a latent optimization framework: a cross-attention
content loss to improve text rendering accuracy and a self-attention style loss
to facilitate style customization. Furthermore, we present OmniText-Bench, a
benchmark dataset for evaluating diverse TIM tasks. It includes input images,
target text with masks, and style references, covering diverse applications
such as text removal, rescaling, repositioning, and insertion and editing with
various styles. Our OmniText framework is the first generalist method capable
of performing diverse TIM tasks. It achieves state-of-the-art performance
across multiple tasks and metrics compared to other text inpainting methods and
is comparable with specialist methods.
中文标题/摘要
标题:OmniText:一种无需训练的通用文本-图像可控操控方法
基于扩散的文本合成最近的进展在通过修复技术在图像中插入和编辑文本方面展示了显著的性能。然而,尽管文本修复方法具有潜力,但三个关键限制阻碍了它们在更广泛的文本图像操控(TIM)任务中的应用:(i)无法删除文本,(ii)缺乏对渲染文本风格的控制,以及(iii)倾向于生成重复的字母。为了解决这些挑战,我们提出了OmniText,这是一种无需训练的通用方法,能够执行广泛的TIM任务。具体而言,我们研究了交叉注意力和自我注意力机制的两种关键属性,以实现文本删除并提供对文本风格和内容的控制。我们的研究发现,通过应用自我注意力反转可以实现文本删除,这可以减轻模型专注于周围文本的倾向,从而减少文本幻觉。此外,我们重新分配了交叉注意力,因为增加某些文本标记的概率可以减少文本幻觉。对于可控修复,我们引入了一种新的损失函数,用于潜在优化框架:交叉注意力内容损失以提高文本渲染准确性,以及自我注意力风格损失以促进风格定制。此外,我们提出了OmniText-Bench,这是一个用于评估各种TIM任务的基准数据集。它包括输入图像、带有掩码的目标文本和风格参考,涵盖了诸如文本删除、缩放、重新定位、插入和编辑以及各种风格等多种应用。我们的OmniText框架是第一个能够执行多种TIM任务的通用方法。它在多个任务和指标上实现了最先进的性能,与其它文本修复方法相比具有竞争力,并且与专门方法相当。
Bridging the gap to real-world language-grounded visual concept learning
Authors: Whie Jung, Semin Kim, Junee Kim, Seunghoon Hong
First: 2025-10-24T12:54:13+00:00 · Latest: 2025-10-28T05:32:23+00:00
Abstract
Human intelligence effortlessly interprets visual scenes along a rich
spectrum of semantic dimensions. However, existing approaches to
language-grounded visual concept learning are limited to a few predefined
primitive axes, such as color and shape, and are typically explored in
synthetic datasets. In this work, we propose a scalable framework that
adaptively identifies image-related concept axes and grounds visual concepts
along these axes in real-world scenes. Leveraging a pretrained vision-language
model and our universal prompting strategy, our framework identifies a diverse
image-related axes without any prior knowledge. Our universal concept encoder
adaptively binds visual features to the discovered axes without introducing
additional model parameters for each concept. To ground visual concepts along
the discovered axes, we optimize a compositional anchoring objective, which
ensures that each axis can be independently manipulated without affecting
others. We demonstrate the effectiveness of our framework on subsets of
ImageNet, CelebA-HQ, and AFHQ, showcasing superior editing capabilities across
diverse real-world concepts that are too varied to be manually predefined. Our
method also exhibits strong compositional generalization, outperforming
existing visual concept learning and text-based editing methods. The code is
available at https://github.com/whieya/Language-grounded-VCL.
中文标题/摘要
标题:跨越到现实世界语言导向视觉概念学习的鸿沟
人类智能能够轻松地在丰富的语义维度上解释视觉场景。然而,现有的语言导向视觉概念学习方法仅限于少数预定义的基本轴,如颜色和形状,并且通常在合成数据集中进行探索。在本工作中,我们提出了一种可扩展的框架,该框架能够自适应地识别与图像相关的概念轴,并在现实世界的场景中将视觉概念沿这些轴进行定位。利用预训练的跨模态模型和我们的通用提示策略,我们的框架能够在没有任何先验知识的情况下识别出多样化的与图像相关的轴。我们的通用概念编码器能够自适应地将视觉特征绑定到发现的轴上,而无需为每个概念引入额外的模型参数。为了沿发现的轴定位视觉概念,我们优化了一个组合锚定目标,该目标确保每个轴可以独立操作而不影响其他轴。我们在ImageNet、CelebA-HQ和AFHQ的子集上展示了我们框架的有效性,展示了在多样化的现实世界概念上具有优越的编辑能力,这些概念过于多样化而无法手动预定义。我们的方法还表现出强大的组合泛化能力,优于现有的视觉概念学习和基于文本的编辑方法。代码可在https://github.com/whieya/Language-grounded-VCL/ 获取。
Summary / 总结
This work addresses the limitation of existing approaches in language-grounded visual concept learning by proposing a scalable framework that identifies diverse image-related concept axes in real-world scenes. The framework uses a pretrained vision-language model and a universal prompting strategy to discover these axes without prior knowledge. It also employs an adaptive concept encoder to bind visual features to these axes and optimizes a compositional anchoring objective to ensure independent manipulation of each axis. Experiments on ImageNet, CelebA-HQ, and AFHQ show superior editing capabilities and strong compositional generalization compared to existing methods.
该研究针对现有语言指导视觉概念学习方法的局限性,这些方法通常局限于预定义的基本轴和合成数据集。作者提出了一种可扩展的框架,使用预训练的视觉-语言模型和通用提示策略,在真实场景中识别多样化的图像相关概念轴。该框架展示了在各种现实世界概念上的优越编辑能力和强大的组合泛化能力,超越了现有方法。
CalFuse: Multi-Modal Continual Learning via Feature Calibration and Parameter Fusion
Authors: Juncen Guo, Siao Liu, Xiaoguang Zhu, Lianlong Sun, Liangyu Teng, Jingyi Wu, Di Li, Linxiao Gong, Weiwei Jiang, Wei Zhou, Liang Song
First: 2025-03-24T13:44:12+00:00 · Latest: 2025-10-28T05:22:48+00:00
Abstract
With the proliferation of multi-modal data in large-scale visual recognition
systems, enabling models to continuously acquire knowledge from evolving data
streams while preserving prior information has become increasingly critical.
Class-Continual Learning (CCL) addresses this challenge by incrementally
incorporating new class knowledge without revisiting historical data, making it
essential for real-world big data applications. While traditional CCL methods
rely solely on visual features, recent advances in Vision-Language Models
(VLMs) such as CLIP demonstrate significant potential for CCL by leveraging
pre-trained multi-modal knowledge. However, existing approaches face challenges
in mitigating catastrophic forgetting while maintaining the cross-modal
generalization capabilities of VLMs. To address these limitations, we propose
CalFuse, a framework that synergizes feature Calibration with parameter Fusion
to enable effective multi-modal knowledge integration in continual learning
scenarios. CalFuse introduces a dynamic feature calibration mechanism that
adaptively balances original CLIP visual representations with task-specific
features, preserving the model's intrinsic cross-modal generalization while
adapting to new classes. Concurrently, a QR decomposition-based parameter
fusion strategy progressively integrates newly acquired knowledge with
historical task parameters, maintaining equilibrium between learning new class
representations and retaining prior knowledge across sequential tasks.
Extensive experiments on benchmark datasets validate the effectiveness of our
approach in large-scale multi-modal continual learning settings, demonstrating
superior performance over state-of-the-art methods in both average accuracy and
final task retention.
中文标题/摘要
标题:CalFuse:通过特征校准和参数融合实现多模态连续学习
随着大规模视觉识别系统中多模态数据的增多,使模型能够从不断变化的数据流中持续获取新知识,同时保留先前信息变得越来越关键。连续类学习(CCL)通过逐步整合新类知识而不重新访问历史数据来应对这一挑战,使其成为现实世界大数据应用中的重要组成部分。虽然传统的CCL方法仅依赖于视觉特征,但最近的视觉语言模型(VLM)如CLIP展示了通过利用预训练的多模态知识进行CCL的巨大潜力。然而,现有方法在减轻灾难性遗忘的同时保持跨模态泛化能力方面面临挑战。为了解决这些限制,我们提出了CalFuse框架,该框架通过特征校准与参数融合的协同作用,使多模态知识在连续学习场景中的有效整合成为可能。CalFuse引入了一种动态特征校准机制,该机制能够自适应地平衡原始CLIP视觉表示与任务特定特征,从而在保留模型固有的跨模态泛化能力的同时适应新类。同时,基于QR分解的参数融合策略逐步整合新获取的知识与历史任务参数,保持学习新类表示与保留先前知识之间的平衡。在基准数据集上的广泛实验验证了我们在大规模多模态连续学习设置中的方法的有效性,展示了在平均准确率和最终任务保留方面优于现有最佳方法的性能。
Summary / 总结
CalFuse is a framework designed for multi-modal continual learning, which aims to enable models to continuously acquire new class knowledge without forgetting past information. It combines feature calibration and parameter fusion to balance the integration of new and old knowledge. Experiments show that CalFuse outperforms existing methods in terms of average accuracy and final task retention on benchmark datasets.
CalFuse 是一种结合特征校准和参数融合的多模态连续学习框架,旨在解决灾难性遗忘和保持跨模态泛化能力的问题。它动态平衡 CLIP 视觉表示与任务特定特征,并使用 QR 分解进行参数融合,从而在整合新知识的同时保留历史信息。实验表明,CalFuse 在基准数据集上的平均准确率和最终任务保留方面优于现有方法。
MDP3: A Training-free Approach for List-wise Frame Selection in Video-LLMs
Authors: Hui Sun, Shiyin Lu, Huanyu Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Ming Li
First: 2025-01-06T09:55:55+00:00 · Latest: 2025-10-28T04:18:29+00:00
Comments: 26 pages, 14 figures
Abstract
Video large language models (Video-LLMs) have made significant progress in
understanding videos. However, processing multiple frames leads to lengthy
visual token sequences, presenting challenges such as the limited context
length cannot accommodate the entire video, and the inclusion of irrelevant
frames hinders visual perception. Hence, effective frame selection is crucial.
This paper emphasizes that frame selection should follow three key principles:
query relevance, list-wise diversity, and sequentiality. Existing methods, such
as uniform frame sampling and query-frame matching, do not capture all of these
principles. Thus, we propose Markov decision determinantal point process with
dynamic programming (MDP3) for frame selection, a training-free and
model-agnostic method that can be seamlessly integrated into existing
Video-LLMs. Our method first estimates frame similarities conditioned on the
query using a conditional Gaussian kernel within the reproducing kernel Hilbert
space~(RKHS). We then apply the determinantal point process~(DPP) to the
similarity matrix to capture both query relevance and list-wise diversity. To
incorporate sequentiality, we segment the video and apply DPP within each
segment, conditioned on the preceding segment selection, modeled as a Markov
decision process~(MDP) for allocating selection sizes across segments.
Theoretically, MDP3 provides a \((1 - 1/e)\)-approximate solution to the
NP-hard list-wise frame selection problem with pseudo-polynomial time
complexity, demonstrating its efficiency. Empirically, MDP3 significantly
outperforms existing methods, verifying its effectiveness and robustness.
中文标题/摘要
标题:MDP3:一种无需训练的列表级帧选择方法用于视频大语言模型
视频大语言模型(Video-LLMs)在理解视频方面取得了显著进展。然而,处理多帧会导致视觉标记序列过长,带来诸如上下文长度有限无法容纳整个视频,以及包含无关帧影响视觉感知等挑战。因此,有效的帧选择至关重要。本文强调帧选择应遵循三个关键原则:查询相关性、列表级多样性以及顺序性。现有方法,如均匀帧采样和查询帧匹配,未能捕捉所有这些原则。因此,我们提出了马尔可夫决策确定性点过程与动态规划(MDP3)进行帧选择,这是一种无需训练且模型无关的方法,可以无缝集成到现有的Video-LLMs中。我们的方法首先使用条件高斯核在再生核希尔伯特空间(RKHS)中估计帧相似性,然后将确定性点过程(DPP)应用于相似性矩阵,以捕捉查询相关性和列表级多样性。为了引入顺序性,我们将视频分割,并在每个分割内应用DPP,条件是基于前一个分割的选择,将其建模为马尔可夫决策过程(MDP)以分配各分割的选择大小。理论上,MDP3为NP难的列表级帧选择问题提供了\((1 - 1/e)\)近似解,具有伪多项式时间复杂度,展示了其效率。实验上,MDP3显著优于现有方法,验证了其有效性和鲁棒性。
Summary / 总结
This paper addresses the challenge of effective frame selection in Video-LLMs by proposing MDP3, a training-free and model-agnostic method. MDP3 uses a conditional Gaussian kernel in RKHS to estimate frame similarities, then applies DPP to capture query relevance and list-wise diversity. Sequentiality is incorporated by segmenting the video and using an MDP to allocate selection sizes across segments. Experiments show that MDP3 outperforms existing methods in terms of effectiveness and robustness.
本文提出了一种训练-free 和模型无关的方法 MDP3,以解决 Video-LLMs 中的帧选择问题。MDP3 使用条件高斯核在 RKHS 中估计帧相似性,然后应用 DPP 来捕捉查询相关性和列表级多样性。通过视频分段并在每个段内使用 MDP 分配选择大小来引入顺序性。实验表明,MDP3 在性能和鲁棒性方面优于现有方法。
Enhancing CLIP Robustness via Cross-Modality Alignment
Authors: Xingyu Zhu, Beier Zhu, Shuo Wang, Kesen Zhao, Hanwang Zhang
Venue: NeurIPS 2025 Spotlight
First: 2025-10-28T03:47:44+00:00 · Latest: 2025-10-28T03:47:44+00:00
Comments: NeurIPS 2025 Spotlight
Abstract
Vision-language models (VLMs) such as CLIP demonstrate strong generalization
in zero-shot classification but remain highly vulnerable to adversarial
perturbations. Existing methods primarily focus on adversarial fine-tuning or
prompt optimization; they often overlook the gaps in CLIP's encoded features,
which is shown as the text and image features lie far apart from each other.
This misalignment is significantly amplified under adversarial perturbations,
leading to severe degradation in classification performance. To address this
problem, we propose Cross-modality Alignment, dubbed COLA, an optimal
transport-based framework that explicitly addresses adversarial misalignment by
restoring both global image-text alignment and local structural consistency in
the feature space. (1) COLA first projects adversarial image embeddings onto a
subspace spanned by class text features, effectively filtering out non-semantic
distortions while preserving discriminative information. (2) It then models
images and texts as discrete distributions over multiple augmented views and
refines their alignment via OT, with the subspace projection seamlessly
integrated into the cost computation. This design ensures stable cross-modal
alignment even under adversarial conditions. COLA is training-free and
compatible with existing fine-tuned models. Extensive evaluations across 14
zero-shot classification benchmarks demonstrate the effectiveness of COLA,
especially with an average improvement of 6.7% on ImageNet and its variants
under PGD adversarial attacks, while maintaining high accuracy on clean
samples.
中文标题/摘要
标题:通过跨模态对齐提升CLIP的鲁棒性
视觉-语言模型(VLMs)如CLIP在零样本分类中表现出强大的泛化能力,但对对抗性扰动极为敏感。现有方法主要集中在对抗性微调或提示优化上;它们往往忽视了CLIP编码特征之间的差距,即文本和图像特征彼此相距甚远。这种对齐在对抗性扰动下被显著放大,导致分类性能严重下降。为解决这一问题,我们提出了跨模态对齐(COLA),这是一种基于最优传输的框架,通过恢复全局图像-文本对齐和局部结构一致性来显式解决对抗性对齐问题。(1)COLA首先将对抗性图像嵌入投影到由类别文本特征构成的子空间中,有效过滤掉非语义扭曲,同时保留判别信息。(2)然后将图像和文本建模为多个增强视图上的离散分布,并通过最优传输优化它们的对齐,子空间投影无缝集成到成本计算中。这种设计确保在对抗性条件下跨模态对齐的稳定性。COLA无需训练且与现有微调模型兼容。在14个零样本分类基准上的广泛评估表明,COLA的有效性,特别是在PGD对抗性攻击下,ImageNet及其变体的平均改进率为6.7%,同时在干净样本上保持高准确性。
Summary / 总结
The paper addresses the vulnerability of CLIP to adversarial perturbations by proposing COLA, a Cross-modality Alignment framework. COLA uses optimal transport to restore global image-text alignment and local structural consistency, effectively filtering out non-semantic distortions while preserving discriminative information. Experimental results show a significant improvement of 6.7% on ImageNet under PGD attacks, while maintaining high accuracy on clean samples.
论文提出COLA,一种跨模态对齐框架,以解决CLIP对对抗扰动的脆弱性问题。COLA使用最优传输来全局和局部对齐文本和图像特征,通过将对抗性图像嵌入投影到由类别文本特征构成的子空间,并通过离散分布建模来优化它们的对齐。实验结果显示,在PGD攻击下,COLA在ImageNet上的改进幅度达到6.7%,同时在干净样本上保持高准确率。
HyPerNav: Hybrid Perception for Object-Oriented Navigation in Unknown Environment
Authors: Zecheng Yin, Hao Zhao, Zhen Li
First: 2025-10-27T01:43:56+00:00 · Latest: 2025-10-28T02:49:09+00:00
Comments: under review
Abstract
Objective-oriented navigation(ObjNav) enables robot to navigate to target
object directly and autonomously in an unknown environment. Effective
perception in navigation in unknown environment is critical for autonomous
robots. While egocentric observations from RGB-D sensors provide abundant local
information, real-time top-down maps offer valuable global context for ObjNav.
Nevertheless, the majority of existing studies focus on a single source, seldom
integrating these two complementary perceptual modalities, despite the fact
that humans naturally attend to both. With the rapid advancement of
Vision-Language Models(VLMs), we propose Hybrid Perception Navigation
(HyPerNav), leveraging VLMs' strong reasoning and vision-language understanding
capabilities to jointly perceive both local and global information to enhance
the effectiveness and intelligence of navigation in unknown environments. In
both massive simulation evaluation and real-world validation, our methods
achieved state-of-the-art performance against popular baselines. Benefiting
from hybrid perception approach, our method captures richer cues and finds the
objects more effectively, by simultaneously leveraging information
understanding from egocentric observations and the top-down map. Our ablation
study further proved that either of the hybrid perception contributes to the
navigation performance.
中文标题/摘要
标题:HyPerNav:未知环境中的对象导向导航的混合感知
对象导向导航(ObjNav)使机器人能够在未知环境中直接自主地导航到目标对象。在未知环境中进行有效的导航感知对于自主机器人至关重要。虽然第一人称视角的RGB-D传感器提供了丰富的局部信息,实时的全局地图则为ObjNav提供了宝贵的整体上下文。然而,现有的大多数研究仅关注单一感知模态,很少将这两种互补的感知模态结合起来,尽管人类自然会同时关注两者。随着视觉语言模型(VLMs)的迅速发展,我们提出了混合感知导航(HyPerNav),利用VLMs的强大推理和视觉语言理解能力,同时感知局部和全局信息,以增强未知环境中导航的有效性和智能化。在大规模的仿真评估和现实世界验证中,我们的方法在流行的基线方法中达到了最先进的性能。得益于混合感知方法,我们的方法捕获了更丰富的线索,并更有效地找到了目标对象,同时利用第一人称观察和全局地图的信息理解。我们的消融研究进一步证明了混合感知对导航性能的贡献。
Summary / 总结
HyPerNav is proposed to enhance object-oriented navigation in unknown environments by integrating local RGB-D sensor observations and global top-down maps using Vision-Language Models. The method leverages the reasoning and vision-language understanding capabilities of VLMs to jointly perceive both local and global information, achieving state-of-the-art performance in both simulation and real-world evaluations. The ablation study demonstrates that both modalities contribute to improved navigation performance.
HyPerNav 是一种用于未知环境中的目标导向导航的混合感知方法,通过视觉语言模型结合局部 RGB-D 传感器观测与全局顶部地图。该方法通过同时感知局部和全局信息显著提高了导航的有效性和智能化。实验表明,HyPerNav 在仿真和真实世界中均优于现有基线,能够捕获更多线索并更有效地找到目标物体。