arXiv 论文速递

2026-02-04 03:54
Snapshot: 20260204_0354
World-Gymnast: Training Robots with Reinforcement Learning in a World Model
Authors: Ansh Kumar Sharma, Yixiang Sun, Ninghao Lu, Yunzhe Zhang, Jiarao Liu, Sherry Yang
First: 2026-02-02T18:44:45+00:00 · Latest: 2026-02-02T18:44:45+00:00
Comments: https://world-gymnast.github.io/
Abstract
Robot learning from interacting with the physical world is fundamentally bottlenecked by the cost of physical interaction. The two alternatives, supervised finetuning (SFT) from expert demonstrations and reinforcement learning (RL) in a software-based simulator, are limited by the amount of expert data available and the sim-to-real gap for manipulation. With the recent emergence of world models learned from real-world video-action data, we ask the question of whether training a policy in a world model can be more effective than supervised learning or software simulation in achieving better real-robot performance. We propose World-Gymnast, which performs RL finetuning of a vision-language-action (VLA) policy by rolling out the policy in an action-conditioned video world model and rewarding the rollouts with a vision-language model (VLM). On the Bridge robot setup, World-Gymnast outperforms SFT by as much as 18x and outperforms software simulator by as much as 2x. More importantly, World-Gymnast demonstrates intriguing capabilities of RL with a world model, including training on diverse language instructions and novel scenes from the world model, test-time training in a novel scene, and online iterative world model and policy improvement. Our results suggest learning a world model and training robot policies in the cloud could be the key to bridging the gap between robots that work in demonstrations and robots that can work in anyone's household.
中文标题/摘要
标题:世界体操手:使用世界模型中的强化学习训练机器人
机器人通过与物理世界交互学习受到物理交互成本的限制。监督微调(SFT)从专家演示和基于软件的模拟器中的强化学习(RL)的两种替代方案分别受限于可用的专家数据量和操作的模拟到现实的差距。随着世界模型从真实世界视频-动作数据中学习的最近出现,我们提出了一个问题:在世界模型中训练策略是否比监督学习或软件模拟更有效,以实现更好的真实机器人性能。我们提出了World-Gymnast,它通过在动作条件下的视频世界模型中展开策略,并用视觉-语言模型(VLM)奖励展开,对视觉-语言-动作(VLA)策略进行RL微调。在Bridge机器人设置中,World-Gymnast在SFT上的表现高出18倍,在软件模拟器上的表现高出2倍。更重要的是,World-Gymnast展示了使用世界模型进行RL的有趣能力,包括在多种语言指令和世界模型中的新场景上进行训练,在新场景中的测试时训练,以及在线迭代改进世界模型和策略。我们的结果表明,学习世界模型并在云端训练机器人策略可能是弥合演示中工作的机器人和任何家庭中工作的机器人之间的差距的关键。
Summary / 总结
World-Gymnast addresses the limitations of robot learning by using reinforcement learning (RL) in a world model, which is trained from real-world video-action data. This method outperforms supervised finetuning by up to 18x and software simulation by up to 2x on the Bridge robot setup. Key findings include improved performance on diverse language instructions and novel scenes, as well as the ability to train and improve policies in real-time. The results indicate that learning world models and training robot policies in the cloud could significantly enhance real-world robot performance.
World-Gymnast通过使用世界模型中的强化学习(RL)来克服物理交互成本高和模拟到现实的差距,解决了机器人训练的挑战。它在Bridge机器人设置上比监督微调高18倍,比软件模拟高2倍。关键发现包括政策能够从多种语言指令中学习,适应新的场景,并在线迭代改进世界模型和政策。
Certain Head, Uncertain Tail: Expert-Sample for Test-Time Scaling in Fine-Grained MoE
Authors: Yuanteng Chen, Peisong Wang, Nanxin Zeng, Yuantian Shao, Gang Li, Jing Liu, Jian Cheng
First: 2026-02-02T18:39:33+00:00 · Latest: 2026-02-02T18:39:33+00:00
Comments: 24 pages, 13 figures
Abstract
Test-time scaling improves LLM performance by generating multiple candidate solutions, yet token-level sampling requires temperature tuning that trades off diversity against stability. Fine-grained MoE, featuring hundreds of well-trained experts per layer and multi-expert activation per token, offers an unexplored alternative through its rich routing space. We empirically characterize fine-grained MoE routing and uncover an informative pattern: router scores exhibit a certain head of high-confidence experts followed by an uncertain tail of low-confidence candidates. While single-run greedy accuracy remains stable when fewer experts are activated, multi-sample pass@n degrades significantly-suggesting that the certain head governs core reasoning capability while the uncertain tail correlates with reasoning diversity. Motivated by these findings, we propose Expert-Sample, a training-free method that preserves high-confidence selections while injecting controlled stochasticity into the uncertain tail, enabling diverse generation without destabilizing outputs. Evaluated on multiple fine-grained MoE models across math, knowledge reasoning, and code tasks, Expert-Sample consistently improves pass@n and verification-based accuracy. On Qwen3-30B-A3B-Instruct evaluated on GPQA-Diamond with 32 parallel samples, pass@32 rises from 85.4% to 91.9%, and accuracy improves from 59.1% to 62.6% with Best-of-N verification.
中文标题/摘要
标题:确定头部,不确定尾部:测试时缩放在细粒度MoE中的专家样本
测试时缩放通过生成多个候选解决方案来提高LLM性能,但基于token的采样需要温度调节,这在多样性和稳定性之间进行权衡。细粒度MoE,每层包含数百个训练良好的专家,并且每个token具有多专家激活,提供了通过其丰富的路由空间未被探索的替代方案。我们实证表征了细粒度MoE路由,并发现了一个有信息的模式:路由器分数表现出高置信度专家的确定头部,随后是低置信度候选者的不确定尾部。当激活的专家较少时,单次运行贪婪准确率保持稳定,而多样本pass@n显著下降——这表明确定头部管理核心推理能力,而不确定尾部与推理多样性相关。受这些发现的启发,我们提出了一种无需训练的方法——专家样本,该方法保留高置信度选择的同时,向不确定尾部注入可控的随机性,从而实现多样生成而不破坏输出。在数学、知识推理和代码任务等多个细粒度MoE模型上进行评估,专家样本一致地提高了pass@n和基于验证的准确性。在Qwen3-30B-A3B-Instruct上,使用GPQA-Diamond进行32并行样本评估,pass@32从85.4%提高到91.9%,并且在Best-of-N验证下准确性从59.1%提高到62.6%。
ReasonEdit: Editing Vision-Language Models using Human Reasoning
Authors: Jiaxing Qiu, Kaihua Hou, Roxana Daneshjou, Ahmed Alaa, Thomas Hartvigsen
First: 2026-02-02T18:06:14+00:00 · Latest: 2026-02-02T18:06:14+00:00
Abstract
Model editing aims to correct errors in large, pretrained models without altering unrelated behaviors. While some recent works have edited vision-language models (VLMs), no existing editors tackle reasoning-heavy tasks, which typically require humans and models to reason about images.We therefore propose ReasonEdit, the first VLM editor to let users explain their reasoning during editing, introducing a new, practical model editing setup. ReasonEdit continuously stores human reasoning in a codebook, and retrieves only relevant facts during inference using a novel topology-balanced multimodal embedding method inspired by network science. Across four VLMs on multiple rationale-based visual question answering datasets, ReasonEdit achieves state-of-the-art editing performance, ultimately showing that using human reasoning during editing greatly improves edit generalization.
中文标题/摘要
标题:ReasonEdit:使用人类推理编辑视觉-语言模型
模型编辑旨在纠正大型预训练模型中的错误,而不改变无关行为。虽然一些最近的工作已经编辑了视觉-语言模型(VLMs),但现有的编辑器没有解决需要人类和模型对图像进行推理的推理密集型任务。因此,我们提出了ReasonEdit,这是第一个允许用户在编辑过程中解释其推理的VLM编辑器,引入了一种新的、实用的模型编辑设置。ReasonEdit持续存储人类推理在代码书中,并使用一种新颖的拓扑平衡多模态嵌入方法,在推理时仅检索相关事实,该方法受到网络科学的启发。在四个VLMs上的多个基于推理的视觉问答数据集上,ReasonEdit实现了最先进的编辑性能,最终表明在编辑过程中使用人类推理大大提高了编辑的泛化能力。
Summary / 总结
ReasonEdit is designed to edit vision-language models by incorporating human reasoning, addressing the limitations of existing editors that do not handle reasoning-heavy tasks. It uses a codebook to store human reasoning and a topology-balanced multimodal embedding method for inference, achieving state-of-the-art performance across four VLMs on rationale-based visual question answering datasets. This approach significantly enhances the generalization of edits.
ReasonEdit 旨在通过融入人类推理来纠正视觉语言模型中的错误,特别适用于需要推理的任务。它使用代码本存储人类推理,并使用一种基于网络科学的拓扑平衡多模态嵌入方法来检索推理期间的相关信息。在多个数据集上,ReasonEdit 的表现优于现有方法,表明在编辑过程中集成人类推理可以显著提高模型的泛化能力。
CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos
Authors: Shiu-hong Kao, Yu-Wing Tai, Chi-Keung Tang
Venue: ICLR 2026
First: 2025-05-24T07:01:31+00:00 · Latest: 2026-02-02T17:54:23+00:00
Comments: Accepted to ICLR 2026. Project page: https://danielshkao.github.io/cot-rvs.html. Code: https://github.com/DanielSHKao/CoT-RVS
Abstract
Reasoning Video Object Segmentation is a challenging task, aiming at generating a mask sequence from an input video given a complex and implicit text query. While existing works finetune Multimodal Large Language Models (MLLM) for the task, they still fail in video inputs given complex temporally-sensitive queries, indicating their lack of temporal and spatial integration in complex scenarios. In this paper, we propose CoT-RVS, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these complex challenges by temporal-semantic reasoning: CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding keyframe for each object that can be observed effortlessly among all frames (temporal). Notably, the CoT-RVS framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. Our framework's training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that CoT-RVS significantly outperforms previous works in both cases, qualitatively and quantitatively.
中文标题/摘要
标题:CoT-RVS:零样本视频对象分割中的链式思维推理分割
推理视频对象分割是一个具有挑战性的任务,旨在根据复杂的隐含文本查询从输入视频中生成一个掩码序列。现有工作通过微调多模态大型语言模型(MLLM)来完成此任务,但在面对复杂的时间敏感查询时,它们仍然无法处理视频输入,这表明它们在复杂场景中缺乏时间和空间的整合。在本文中,我们提出了一种名为CoT-RVS的新框架,利用MLLM的零样本链式思维(CoT)能力通过时间语义推理来解决这些复杂挑战:CoT-RVS分析给定帧中可能与语言查询匹配的可见对象(语义),并在所有帧中选择一个可以轻松观察到的对应关键帧(时间)。值得注意的是,CoT-RVS框架无需训练,并且兼容闭源的MLLM,可以应用于推理视频实例分割。我们框架的无需训练特性还允许其扩展以处理在线视频流,在这种情况下,CoT在测试时用于更新目标对象,当出现更好的目标并变得可见时。我们在具有显式和隐含查询的视频对象分割上进行了广泛的实验。结果表明,CoT-RVS在两种情况下都显著优于先前的工作,定性和定量上均是如此。
Summary / 总结
The research aims to address the challenge of generating mask sequences from videos given complex queries, which existing methods struggle with due to their lack of temporal and spatial integration. The proposed CoT-RVS framework leverages the zero-shot Chain-of-Thought capability of Multimodal Large Language Models to perform temporal-semantic reasoning, identifying keyframes that match the query. Experimental results demonstrate that CoT-RVS outperforms previous methods in both explicit and implicit query scenarios, both qualitatively and quantitatively.
CoT-RVS 是一种新颖的框架,用于解决复杂和时间敏感查询的零样本推理视频对象分割问题。它利用多模态大型语言模型(MLLM)的零样本推理(CoT)能力进行时间语义推理,识别匹配查询的关键帧。实验结果表明,CoT-RVS 在显式和隐式查询场景中均优于先前的方法,无论是定性还是定量方面。
LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization
Authors: Zhenpeng Huang, Jiaqi Li, Zihan Jia, Xinhao Li, Desen Meng, Lingxue Song, Xi Chen, Liang Li, Limin Wang
Venue: NeurIPS 2025
First: 2026-02-02T17:03:37+00:00 · Latest: 2026-02-02T17:03:37+00:00
Comments: NeurIPS 2025
Abstract
We present LongVPO, a novel two-stage Direct Preference Optimization framework that enables short-context vision-language models to robustly understand ultra-long videos without any long-video annotations. In Stage 1, we synthesize preference triples by anchoring questions to individual short clips, interleaving them with distractors, and applying visual-similarity and question-specificity filtering to mitigate positional bias and ensure unambiguous supervision. We also approximate the reference model's scoring over long contexts by evaluating only the anchor clip, reducing computational overhead. In Stage 2, we employ a recursive captioning pipeline on long videos to generate scene-level metadata, then use a large language model to craft multi-segment reasoning queries and dispreferred responses, aligning the model's preferences through multi-segment reasoning tasks. With only 16K synthetic examples and no costly human labels, LongVPO outperforms the state-of-the-art open-source models on multiple long-video benchmarks, while maintaining strong short-video performance (e.g., on MVBench), offering a scalable paradigm for efficient long-form video understanding.
中文标题/摘要
标题:LongVPO:从锚定线索到自我推理的长视频偏好优化
我们提出了LongVPO,这是一种新颖的两阶段直接偏好优化框架,使短上下文视觉语言模型能够在无需任何长视频注释的情况下,稳健地理解超长视频。在第一阶段,我们通过将问题锚定到单独的短片段,交替插入干扰项,并应用视觉相似性和问题特定性过滤来合成偏好三元组,以减轻位置偏差并确保明确的监督。我们还通过仅评估锚定片段来近似参考模型对长上下文的评分,从而减少计算开销。在第二阶段,我们使用递归的字幕生成流水线在长视频上生成场景级元数据,然后使用大型语言模型构建多段推理查询和不偏好响应,通过多段推理任务对模型的偏好进行对齐。仅使用16K合成示例且无需昂贵的人工标签,LongVPO在多个长视频基准测试中优于最先进的开源模型,同时保持强大的短视频性能(例如,在MVBench上),提供了一种可扩展的框架,用于高效理解长视频。
Summary / 总结
LongVPO is a two-stage framework that helps short-context vision-language models understand long videos without needing long-video annotations. In Stage 1, it creates preference triples by anchoring questions to short clips and filtering them to reduce bias. In Stage 2, it generates scene-level metadata and uses a language model to create reasoning queries, improving model preferences through multi-segment tasks. LongVPO outperforms state-of-the-art models on long-video benchmarks while maintaining short-video performance, offering a scalable solution for long-form video understanding.
LongVPO 是一种两阶段的直接偏好优化框架,帮助短上下文视觉-语言模型理解超长视频而无需长视频标注。在第一阶段,通过将问题锚定到短片段并过滤以减少位置偏差来合成偏好三元组。在第二阶段,使用递归字幕生成管道生成场景级元数据,并使用大型语言模型构建推理查询以对齐模型的偏好。LongVPO 在长视频基准测试中表现出色,同时保持短视频性能,展示了高效长视频理解的可扩展方法。
No time to train! Training-Free Reference-Based Instance Segmentation
Authors: Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, Elliot J. Crowley
First: 2025-07-03T16:59:01+00:00 · Latest: 2026-02-02T16:47:36+00:00
Comments: Preprint
Abstract
The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic-aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50) and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark (22.4% nAP).
中文标题/摘要
标题:没有时间训练!基于参考的实例分割
图像分割模型的历史性能一直受到大规模标注数据收集成本高的限制。Segment Anything Model (SAM) 通过一种可提示的、语义无关的分割范式缓解了这一原始问题,但仍需要手动视觉提示或复杂的领域特定提示生成规则来处理新图像。为了减少这种新的负担,我们的工作研究了仅提供少量参考图像时的对象分割任务。我们的关键洞察是利用基础模型学习到的强语义先验,在参考图像和目标图像之间识别对应的区域。我们发现对应关系能够自动生成实例级分割掩码以供下游任务使用,并通过一个无训练的多阶段方法实现我们的想法,该方法包括(1)记忆库构建;(2)表示聚合;(3)语义感知特征匹配。我们的实验显示在分割指标上取得了显著改进,达到了COCO FSOD(36.8% nAP)、PASCAL VOC 少样本(71.2% nAP50)的最佳性能,并在跨域少样本基准上优于现有无训练方法(22.4% nAP)。
Summary / 总结
This work addresses the challenge of image segmentation by leveraging a small set of reference images to automatically generate instance-level segmentation masks without the need for training. The method uses semantic priors from foundation models to identify correspondences between reference and target images, incorporating a multi-stage, training-free approach. Experiments show significant improvements in segmentation metrics, achieving state-of-the-art performance on COCO FSOD and PASCAL VOC Few-Shot benchmarks, and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark.
该研究通过利用少量参考图像而非手动视觉提示来解决图像分割问题。提出了一种无需训练的方法,利用基础模型中的强语义先验来识别参考图像和目标图像之间的对应区域,从而自动生成实例级别的分割掩码。该方法包括记忆库构建、表示聚合和语义感知特征匹配。实验结果显示在分割指标上取得了显著改进,在COCO FSOD和PASCAL VOC Few-Shot基准上达到了最先进的性能,并在跨域FSOD基准上优于现有无需训练的方法。
LangMap: A Hierarchical Benchmark for Open-Vocabulary Goal Navigation
Authors: Bo Miao, Weijia Liu, Jun Luo, Lachlan Shinnick, Jian Liu, Thomas Hamilton-Smith, Yuhe Yang, Zijie Wu, Vanja Videnovic, Feras Dayoub, Anton van den Hengel
First: 2026-02-02T15:26:19+00:00 · Latest: 2026-02-02T15:26:19+00:00
Abstract
The relationships between objects and language are fundamental to meaningful communication between humans and AI, and to practically useful embodied intelligence. We introduce HieraNav, a multi-granularity, open-vocabulary goal navigation task where agents interpret natural language instructions to reach targets at four semantic levels: scene, room, region, and instance. To this end, we present Language as a Map (LangMap), a large-scale benchmark built on real-world 3D indoor scans with comprehensive human-verified annotations and tasks spanning these levels. LangMap provides region labels, discriminative region descriptions, discriminative instance descriptions covering 414 object categories, and over 18K navigation tasks. Each target features both concise and detailed descriptions, enabling evaluation across different instruction styles. LangMap achieves superior annotation quality, outperforming GOAT-Bench by 23.8% in discriminative accuracy using four times fewer words. Comprehensive evaluations of zero-shot and supervised models on LangMap reveal that richer context and memory improve success, while long-tailed, small, context-dependent, and distant goals, as well as multi-goal completion, remain challenging. HieraNav and LangMap establish a rigorous testbed for advancing language-driven embodied navigation. Project: https://bo-miao.github.io/LangMap
中文标题/摘要
标题:LangMap:层次化开放词汇目标导航基准
物体与语言之间的关系是人类与AI有意义交流的基础,也是实用的具身智能的关键。我们引入了HieraNav,这是一个多粒度、开放词汇的目标导航任务,其中代理根据自然语言指令在四个语义级别(场景、房间、区域和实例)上导航。为此,我们提出了语言作为地图(LangMap),这是一个基于真实世界3D室内扫描的大规模基准,包含全面的人工验证注释和任务,涵盖这些级别。LangMap提供了区域标签、区分性区域描述、涵盖414个物体类别的区分性实例描述,以及超过18000个导航任务。每个目标都具有简洁和详细的描述,使评估跨越不同的指令风格成为可能。LangMap在区分准确性上实现了更高的注释质量,使用四分之一的词数超越了GOAT-Bench 23.8%。LangMap上零样本和监督模型的全面评估表明,更丰富的上下文和记忆可以提高成功率,而长尾、小规模、上下文依赖和远距离目标,以及多目标完成仍然具有挑战性。HieraNav和LangMap为推进语言驱动的具身导航建立了严格的测试平台。项目:https://bo-miao.github.io/LangMap
Summary / 总结
LangMap is a benchmark for open-vocabulary goal navigation, where agents interpret natural language to navigate to targets at four semantic levels: scene, room, region, and instance. It uses real-world 3D indoor scans with human-verified annotations and tasks, providing region and instance descriptions and over 18,000 navigation tasks. LangMap outperforms GOAT-Bench in discriminative accuracy and demonstrates that richer context and memory improve success, while long-tailed goals remain challenging. HieraNav and LangMap provide a rigorous testbed for advancing language-driven embodied navigation.
LangMap 是一个开放词汇的目标导航基准,其中代理通过自然语言导航到场景、房间、区域和实例四个语义层次的目标。它使用带有真人验证注释和任务的真实世界 3D 室内扫描,提供区域和实例描述以及超过 18,000 个导航任务。LangMap 在区分准确性上优于 GOAT-Bench,并表明更丰富的上下文和记忆可以提高成功率,而长尾目标仍然具有挑战性。HieraNav 和 LangMap 为推进语言驱动的实体导航提供了一个严格的测试平台。
MLV-Edit: Towards Consistent and Highly Efficient Editing for Minute-Level Videos
Authors: Yangyi Cao, Yuanhang Li, Lan Chen, Qi Mao
First: 2026-02-02T14:07:00+00:00 · Latest: 2026-02-02T14:07:00+00:00
Abstract
We propose MLV-Edit, a training-free, flow-based framework that address the unique challenges of minute-level video editing. While existing techniques excel in short-form video manipulation, scaling them to long-duration videos remains challenging due to prohibitive computational overhead and the difficulty of maintaining global temporal consistency across thousands of frames. To address this, MLV-Edit employs a divide-and-conquer strategy for segment-wise editing, facilitated by two core modules: Velocity Blend rectifies motion inconsistencies at segment boundaries by aligning the flow fields of adjacent chunks, eliminating flickering and boundary artifacts commonly observed in fragmented video processing; and Attention Sink anchors local segment features to global reference frames, effectively suppressing cumulative structural drift. Extensive quantitative and qualitative experiments demonstrate that MLV-Edit consistently outperforms state-of-the-art methods in terms of temporal stability and semantic fidelity.
中文标题/摘要
标题:MLV-Edit:针对分钟级视频编辑的一致且高效的编辑方法
我们提出了一种无需训练、基于流的框架MLV-Edit,以应对分钟级视频编辑的独特挑战。尽管现有技术在短格式视频操作方面表现出色,但将它们扩展到长时视频仍然具有挑战性,因为计算开销巨大且难以在整个数千帧中保持全局时间一致性。为了解决这个问题,MLV-Edit 采用了一种分而治之的策略进行段落级编辑,由两个核心模块支持:Velocity Blend 通过对齐相邻块的流场来纠正段落边界处的运动不一致性,消除片段视频处理中常见的闪烁和边界伪影;Attention Sink 将局部段落特征锚定到全局参考帧,有效抑制累积结构漂移。大量定量和定性实验表明,MLV-Edit 在时间稳定性和语义保真度方面始终优于现有最先进的方法。
Summary / 总结
MLV-Edit is a training-free, flow-based framework designed for efficient editing of minute-level videos. It addresses the challenges of maintaining global temporal consistency and reducing computational overhead when editing long-duration videos. The framework uses a divide-and-conquer strategy with two core modules: Velocity Blend aligns flow fields to eliminate motion inconsistencies, and Attention Sink anchors local features to global frames to suppress structural drift. Experimental results show that MLV-Edit outperforms existing methods in terms of temporal stability and semantic fidelity.
MLV-Edit 是一个无需训练的框架,用于高效编辑分钟级视频。它采用分而治之的策略来解决保持时间一致性和减少计算开销的挑战。该框架包括 Velocity Blend 用于在段边界对齐流场,以及 Attention Sink 用于将局部特征锚定到全局参考帧。实验结果表明,MLV-Edit 在时间稳定性和语义保真度方面优于现有方法。
Probe and Skip: Self-Predictive Token Skipping for Efficient Long-Context LLM Inference
Authors: Zimeng Wu, Donghao Wang, Chaozhe Jin, Jiaxin Chen, Yunhong Wang
First: 2026-01-19T15:34:29+00:00 · Latest: 2026-02-02T13:59:16+00:00
Abstract
Long-context inference enhances the reasoning capability of Large Language Models (LLMs), but incurs significant computational overhead. Token-oriented methods, such as pruning and skipping, have shown great promise in reducing inference latency, yet still suffer from inherently insufficient structure optimization, outdated selection criteria, and redundancy interference, resulting in suboptimal speed-accuracy trade-off. To address these issues, we propose a novel training-free framework dubbed Self-Predictive Token Skipping (SPTS), for efficient long-context LLM inference. Specifically, motivated by probing the influence of target layers prior to skipping, we design two selective token skipping strategies for typical structures, including Partial Attention Probing (PAP) for multi-head attention and Low-rank Transformation Probing (LTP) for feed forward network. The former selects informative tokens via partial forward attention computation, while the latter constructs a low-rank proxy network to predict token transformations. In addition, a Multi-Stage Delayed Pruning (MSDP) strategy reallocates skipping budgets and progressively removes redundant tokens across layers. Extensive experiments display the effectiveness of our method, achieving up to 2.46$\times$ and 2.29$\times$ speedups for prefilling and end-to-end generation, respectively, while maintaining state-of-the-art accuracy. We will release the source code upon acceptance.
中文标题/摘要
标题:探查与跳过:自我预测性标记跳过以提高高效长上下文LLM推理
长上下文推理增强了大型语言模型(LLMs)的推理能力,但会带来显著的计算开销。标记导向的方法,如剪枝和跳过,已经在减少推理延迟方面显示出巨大的潜力,但仍受到固有的结构优化不足、过时的选择标准和冗余干扰的影响,导致速度-准确度权衡不佳。为了解决这些问题,我们提出了一种名为自我预测性标记跳过(SPTS)的无训练框架,以提高高效长上下文LLM推理。具体而言,受跳过前探查目标层影响的启发,我们为典型结构设计了两种选择性标记跳过策略,包括部分注意探查(PAP)用于多头注意和低秩变换探查(LTP)用于前馈网络。前者通过部分前向注意计算选择信息性标记,而后者构建一个低秩代理网络以预测标记变换。此外,多阶段延迟剪枝(MSDP)策略重新分配跳过预算,并逐层逐步移除冗余标记。大量实验显示了我们方法的有效性,分别在预填充和端到端生成中实现了高达2.46$\times$和2.29$\times$的加速,同时保持了最先进的准确率。在接收后我们将发布源代码。
Summary / 总结
This paper addresses the computational overhead of long-context inference in LLMs by proposing a training-free framework called Self-Predictive Token Skipping (SPTS). It introduces two selective token skipping strategies, Partial Attention Probing (PAP) and Low-rank Transformation Probing (LTP), to reduce inference latency without compromising accuracy. The method achieves up to 2.46$\times$ and 2.29$\times$ speedups for prefilling and end-to-end generation, respectively, while maintaining state-of-the-art accuracy.
本文提出了一种名为Self-Predictive Token Skipping (SPTS)的训练-free框架,以解决LLM长上下文推理的计算开销问题。该框架引入了两种选择性跳过策略,即Partial Attention Probing (PAP)和Low-rank Transformation Probing (LTP),以减少推理延迟。PAP通过部分注意力计算选择有信息量的token,而LTP构建一个低秩代理网络来预测token的变换。此外,还使用了Multi-Stage Delayed Pruning (MSDP)策略,逐步移除冗余token。实验表明,SPTS在预填充和端到端生成中分别实现了2.46$\times$和2.29$\times$的加速,同时保持了最先进的准确率。
Training-free score-based diffusion for parameter-dependent stochastic dynamical systems
Authors: Minglei Yang, Sicheng He
First: 2026-02-02T13:54:36+00:00 · Latest: 2026-02-02T13:54:36+00:00
Abstract
Simulating parameter-dependent stochastic differential equations (SDEs) presents significant computational challenges, as separate high-fidelity simulations are typically required for each parameter value of interest. Despite the success of machine learning methods in learning SDE dynamics, existing approaches either require expensive neural network training for score function estimation or lack the ability to handle continuous parameter dependence. We present a training-free conditional diffusion model framework for learning stochastic flow maps of parameter-dependent SDEs, where both drift and diffusion coefficients depend on physical parameters. The key technical innovation is a joint kernel-weighted Monte Carlo estimator that approximates the conditional score function using trajectory data sampled at discrete parameter values, enabling interpolation across both state space and the continuous parameter domain. Once trained, the resulting generative model produces sample trajectories for any parameter value within the training range without retraining, significantly accelerating parameter studies, uncertainty quantification, and real-time filtering applications. The performance of the proposed approach is demonstrated via three numerical examples of increasing complexity, showing accurate approximation of conditional distributions across varying parameter values.
中文标题/摘要
标题:参数依赖随机动力系统无训练评分扩散方法
模拟参数依赖随机微分方程(SDEs)通常需要为每个感兴趣的参数值进行单独的高保真模拟,这带来了重大的计算挑战。尽管机器学习方法在学习SDE动力学方面取得了成功,但现有方法要么需要昂贵的神经网络训练来估计评分函数,要么无法处理连续参数依赖性。我们提出了一种无训练条件扩散模型框架,用于学习参数依赖SDE的随机流图,其中漂移和扩散系数都依赖于物理参数。关键技术创新是一种联合核加权蒙特卡洛估计器,该估计器使用在离散参数值处采样的轨迹数据来近似条件评分函数,从而能够在状态空间和连续参数域之间进行插值。一旦训练完成,生成的模型可以在训练范围内的任何参数值生成样本轨迹,无需重新训练,从而显著加速参数研究、不确定性量化和实时滤波应用。通过三个复杂度递增的数值示例,展示了所提出方法的性能,展示了在不同参数值下对条件分布的准确近似。
Summary / 总结
The paper addresses the computational challenges of simulating parameter-dependent stochastic differential equations (SDEs) by proposing a training-free conditional diffusion model. This model learns the stochastic flow maps of SDEs where both drift and diffusion coefficients depend on physical parameters. The key method involves a joint kernel-weighted Monte Carlo estimator to approximate the conditional score function using trajectory data sampled at discrete parameter values. The model can generate sample trajectories for any parameter value within the training range without retraining, significantly accelerating parameter studies and real-time filtering applications. Experimental results show accurate approximation of conditional distributions across varying parameter values.
该论文通过提出一种无需训练的条件扩散模型来解决参数依赖的随机微分方程(SDE)的模拟计算难题。该方法使用联合核加权蒙特卡洛估计器来近似条件分数函数,无需昂贵的神经网络训练即可学习SDE动力学。主要发现表明,所提出的方法可以在训练范围内为任何参数值生成样本轨迹,显著加速参数研究和实时滤波应用。
U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding
Authors: Anjie Le, Henan Liu, Yue Wang, Zhenyu Liu, Rongkun Zhu, Taohan Weng, Jinze Yu, Boyang Wang, Yalun Wu, Kaiwen Yan, Quanlin Sun, Meirui Jiang, Jialun Pei, Siya Liu, Haoyun Zheng, Zhoujun Li, Alison Noble, Jacques Souquet, Xiaoqing Guo, Manxi Lin, Hongcheng Guo
First: 2025-05-23T11:48:48+00:00 · Latest: 2026-02-02T13:10:09+00:00
Abstract
Ultrasound is a widely-used imaging modality critical to global healthcare, yet its interpretation remains challenging due to its varying image quality on operators, noises, and anatomical structures. Although large vision-language models (LVLMs) have demonstrated impressive multimodal capabilities across natural and medical domains, their performance on ultrasound remains largely unexplored. We introduce U2-BENCH, the first comprehensive benchmark to evaluate LVLMs on ultrasound understanding across classification, detection, regression, and text generation tasks. U2-BENCH aggregates 7,241 cases spanning 15 anatomical regions and defines 8 clinically inspired tasks, such as diagnosis, view recognition, lesion localization, clinical value estimation, and report generation, across 50 ultrasound application scenarios. We evaluate 23 state-of-the-art LVLMs, both open- and closed-source, general-purpose and medical-specific. Our results reveal strong performance on image-level classification, but persistent challenges in spatial reasoning and clinical language generation. U2-BENCH establishes a rigorous and unified testbed to assess and accelerate LVLM research in the uniquely multimodal domain of medical ultrasound imaging.
中文标题/摘要
标题:U2-BENCH:在超声理解方面评估大型视觉-语言模型
超声是一种广泛使用的成像技术,在全球医疗保健中至关重要,但由于操作者、噪声和解剖结构的不同,其解释仍然具有挑战性。尽管大型视觉-语言模型(LVLMs)在自然和医学领域展示了令人印象深刻的多模态能力,但它们在超声方面的表现尚未得到充分探索。我们介绍了U2-BENCH,这是首个全面评估LVLMs在超声理解方面的基准,涵盖分类、检测、回归和文本生成任务。U2-BENCH汇集了15个解剖区域的7,241个病例,并定义了8个临床启发的任务,如诊断、视图识别、病灶定位、临床价值评估和报告生成,覆盖了50个超声应用场景。我们评估了23个最先进的LVLMs,包括开源和闭源、通用和医学专用模型。我们的结果显示了在图像级分类上的强大表现,但在空间推理和临床语言生成方面仍存在持续挑战。U2-BENCH为医学超声成像这一独特多模态领域中的LVLM研究提供了一个严格的统一测试平台。
Summary / 总结
U2-BENCH is the first benchmark to evaluate large vision-language models (LVLMs) on ultrasound understanding tasks, including classification, detection, regression, and text generation. It includes 7,241 cases from 15 anatomical regions and 50 application scenarios, defining 8 clinically inspired tasks. The study evaluates 23 state-of-the-art LVLMs and finds strong performance in image-level classification but challenges in spatial reasoning and clinical language generation. U2-BENCH provides a rigorous testbed for LVLM research in medical ultrasound imaging.
论文介绍了U2-BENCH,这是一个用于评估大型视觉-语言模型(LVLM)在超声理解上的基准,涵盖了图像分类、检测、回归和文本生成任务。它在15个解剖区域和50个应用场景下的7,241个超声案例上评估了23种最先进的LVLM,结果显示在图像级分类上表现出色,但在空间推理和临床语言生成方面存在挑战。U2-BENCH为推进医学超声成像领域的LVLM研究提供了一个严格的测试平台。
See2Refine: Vision-Language Feedback Improves LLM-Based eHMI Action Designers
Authors: Ding Xia, Xinyue Gui, Mark Colley, Fan Gao, Zhongyi Zhou, Dongyuan Li, Renhe Jiang, Takeo Igarashi
First: 2026-02-02T13:03:48+00:00 · Latest: 2026-02-02T13:03:48+00:00
Comments: Under Review
Abstract
Automated vehicles lack natural communication channels with other road users, making external Human-Machine Interfaces (eHMIs) essential for conveying intent and maintaining trust in shared environments. However, most eHMI studies rely on developer-crafted message-action pairs, which are difficult to adapt to diverse and dynamic traffic contexts. A promising alternative is to use Large Language Models (LLMs) as action designers that generate context-conditioned eHMI actions, yet such designers lack perceptual verification and typically depend on fixed prompts or costly human-annotated feedback for improvement. We present See2Refine, a human-free, closed-loop framework that uses vision-language model (VLM) perceptual evaluation as automated visual feedback to improve an LLM-based eHMI action designer. Given a driving context and a candidate eHMI action, the VLM evaluates the perceived appropriateness of the action, and this feedback is used to iteratively revise the designer's outputs, enabling systematic refinement without human supervision. We evaluate our framework across three eHMI modalities (lightbar, eyes, and arm) and multiple LLM model sizes. Across settings, our framework consistently outperforms prompt-only LLM designers and manually specified baselines in both VLM-based metrics and human-subject evaluations. Results further indicate that the improvements generalize across modalities and that VLM evaluations are well aligned with human preferences, supporting the robustness and effectiveness of See2Refine for scalable action design.
中文标题/摘要
标题:See2Refine:视觉-语言反馈提高基于LLM的eHMI动作设计师
自动驾驶车辆缺乏与其他道路使用者的自然通信渠道,因此在共享环境中传达意图并维持信任需要外部人机界面(eHMIs)。然而,大多数eHMI研究依赖于开发人员设计的消息-动作对,难以适应多变和动态的交通环境。一种有前景的替代方案是使用大型语言模型(LLMs)作为动作设计师,生成基于上下文的eHMI动作,但这些设计师缺乏感知验证,通常依赖于固定提示或昂贵的人工标注反馈进行改进。我们提出See2Refine,这是一种无需人工的闭环框架,使用视觉-语言模型(VLM)的感知评估作为自动视觉反馈,以改进基于LLM的eHMI动作设计师。给定驾驶环境和候选eHMI动作,VLM评估动作的感知适宜性,此反馈用于迭代修订设计师的输出,从而在无需人类监督的情况下实现系统性改进。我们跨三种eHMI模态(灯条、眼睛和手臂)和多个LLM模型大小评估了该框架。在所有设置中,我们的框架在基于VLM的指标和人类受试者评估中均优于仅提示的LLM设计师和手动指定的基线。结果进一步表明,改进在不同模态间具有泛化性,且VLM评估与人类偏好高度一致,支持See2Refine在可扩展动作设计中的稳健性和有效性。
Summary / 总结
See2Refine is a framework that uses vision-language models to provide automated visual feedback for improving an LLM-based eHMI action designer. It iteratively refines eHMI actions without human supervision, evaluating the perceived appropriateness of actions and using this feedback to revise outputs. Across different eHMI modalities and LLM sizes, See2Refine outperforms prompt-only LLM designers and manual baselines in both VLM-based metrics and human evaluations, indicating its robustness and effectiveness for scalable action design.
研究旨在通过使用视觉语言模型(VLM)提供自动视觉反馈来逐步改进基于大型语言模型(LLM)的外部人机界面(eHMI)动作设计。方法是使用VLM评估候选eHMI动作,并利用反馈来修订LLM的输出。研究结果表明,该框架在不同eHMI模态和LLM模型大小的情况下,优于仅使用提示的LLM设计师和手动指定的基线,在VLM基线指标和人类受试者评估中均表现出色,证明了其在可扩展动作设计中的稳健性和有效性。
SciTextures: Collecting and Connecting Visual Patterns, Models, and Code Across Science and Art
Authors: Sagi Eppel, Alona Strugatski
First: 2025-11-03T18:22:11+00:00 · Latest: 2026-02-02T13:00:09+00:00
Abstract
The ability to connect visual patterns with the processes that form them represents one of the deepest forms of visual understanding. Textures of clouds and waves, the growth of cities and forests, or the formation of materials and landscapes are all examples of patterns emerging from underlying mechanisms. We present the SciTextures dataset, a large-scale collection of textures and visual patterns from all domains of science, tech, and art, along with the models and code that generate these images. Covering over 1,270 different models and 100,000 images of patterns and textures from physics, chemistry, biology, sociology, technology, mathematics, and art, this dataset offers a way to explore the deep connection between the visual patterns that shape our world and the mechanisms that produce them. Built through an agentic AI pipeline that autonomously collects, implements, and standardizes scientific and generative models. This AI pipeline is also used to autonomously invent and implement novel methods for generating visual patterns and textures. SciTextures enables systematic evaluation of vision language models (VLM's) ability to link visual patterns to the models and code that generate them, and to identify different patterns that emerge from the same underlying process. We also test VLMs ability to infer and recreate the mechanisms behind visual patterns by providing a natural image of a real-world phenomenon and asking the AI to identify and code a model of the process that formed it, then run this code to generate a simulated image that is compared to the reference image. These benchmarks reveal that VLM's can understand and simulate physical systems beyond visual patterns at multiple levels of abstraction. The dataset and code are available at: https://zenodo.org/records/17485502
中文标题/摘要
标题:SciTextures:收集和连接科学与艺术中的视觉模式、模型和代码
将视觉模式与其形成过程联系起来的能力代表了最深层次的视觉理解之一。云朵和波浪的纹理、城市和森林的增长、材料和景观的形成都是从底层机制中涌现出来的模式的例子。我们介绍了SciTextures数据集,这是一个涵盖科学、技术和艺术所有领域的大型纹理和视觉模式集合,以及生成这些图像的模型和代码。该数据集包括超过1,270个不同模型和来自物理学、化学、生物学、社会学、技术、数学和艺术领域的100,000张模式和纹理图像,提供了一种探索塑造我们世界的视觉模式与其产生机制之间深层联系的方法。该数据集通过自主收集、实施和标准化科学与生成模型的代理AI管道构建。该AI管道还用于自主发明和实施生成视觉模式和纹理的新方法。SciTextures使系统评估视觉语言模型(VLM)将视觉模式与其生成的模型和代码联系起来的能力成为可能,并识别出源自相同底层过程的不同模式。我们还通过提供真实世界现象的自然图像并要求AI识别和编码形成该图像的过程模型,然后运行该代码生成与参考图像进行比较的模拟图像,来测试VLM推断和重现视觉模式背后机制的能力。这些基准揭示了VLM能够在多个抽象层次上理解并模拟物理系统,而不仅仅是视觉模式。数据集和代码可在以下链接获取:https://zenodo.org/records/17485502
Summary / 总结
The SciTextures dataset collects and connects visual patterns, models, and code from science and art, covering over 1,270 models and 100,000 images. It enables the evaluation of vision language models in linking visual patterns to their generating models and code. The dataset also tests VLMs' ability to infer and recreate the mechanisms behind visual patterns by comparing simulated images to real-world phenomena. Benchmarks show that VLMs can understand and simulate physical systems at various levels of abstraction.
SciTextures 数据集旨在将视觉模式与其背后的机制联系起来,涵盖多个科学和艺术领域,包含超过1,270个模型和100,000张图像,通过自主收集和实现这些模型的AI管道生成。关键发现表明,视觉语言模型能够有效地将视觉模式与其生成的模型和代码联系起来,并能够推断和重现这些模式背后的机制,展示了它们在多个抽象层次上理解和模拟物理系统的能力。
SIDiffAgent: Self-Improving Diffusion Agent
Authors: Shivank Garg, Ayush Singh, Gaurav Kumar Nayak
First: 2026-02-02T12:53:21+00:00 · Latest: 2026-02-02T12:53:21+00:00
Abstract
Text-to-image diffusion models have revolutionized generative AI, enabling high-quality and photorealistic image synthesis. However, their practical deployment remains hindered by several limitations: sensitivity to prompt phrasing, ambiguity in semantic interpretation (e.g., ``mouse" as animal vs. a computer peripheral), artifacts such as distorted anatomy, and the need for carefully engineered input prompts. Existing methods often require additional training and offer limited controllability, restricting their adaptability in real-world applications. We introduce Self-Improving Diffusion Agent (SIDiffAgent), a training-free agentic framework that leverages the Qwen family of models (Qwen-VL, Qwen-Image, Qwen-Edit, Qwen-Embedding) to address these challenges. SIDiffAgent autonomously manages prompt engineering, detects and corrects poor generations, and performs fine-grained artifact removal, yielding more reliable and consistent outputs. It further incorporates iterative self-improvement by storing a memory of previous experiences in a database. This database of past experiences is then used to inject prompt-based guidance at each stage of the agentic pipeline. \modelour achieved an average VQA score of 0.884 on GenAIBench, significantly outperforming open-source, proprietary models and agentic methods. We will publicly release our code upon acceptance.
中文标题/摘要
标题:SIDiffAgent: 自我提升扩散代理
文本到图像的扩散模型已经革新了生成式AI,使其能够生成高质量和逼真的图像。然而,它们的实际部署仍受到一些限制:提示措辞敏感性、语义解释的模糊性(例如,“鼠标”作为动物还是计算机配件)、图像中的失真(如解剖结构扭曲)以及需要精心设计的输入提示。现有方法通常需要额外的训练,并且提供的可控性有限,限制了它们在实际应用中的适应性。我们引入了自我提升扩散代理(SIDiffAgent),这是一种无需训练的代理框架,利用Qwen家族模型(Qwen-VL、Qwen-Image、Qwen-Edit、Qwen-Embedding)来解决这些挑战。SIDiffAgent自主管理提示工程,检测并纠正不良生成,并执行精细的去噪操作,从而产生更可靠和一致的输出。此外,它还通过在数据库中存储以往经验来实现迭代自我提升。这些以往经验的数据库在代理管道的每个阶段被用来注入基于提示的指导。我们的模型在GenAIBench上获得了平均VQA得分为0.884,显著优于开源、专有模型和代理方法。在被接受后,我们将公开发布我们的代码。
Summary / 总结
The research aims to address the limitations of text-to-image diffusion models, such as sensitivity to prompts and generation artifacts, by introducing SIDiffAgent. SIDiffAgent uses the Qwen family of models to autonomously manage prompt engineering, detect and correct poor generations, and perform fine-grained artifact removal, leading to more reliable and consistent outputs. The model achieved an average VQA score of 0.884 on GenAIBench, outperforming other open-source and proprietary models and agentic methods.
研究旨在通过引入SIDiffAgent解决文本到图像扩散模型的限制,如对提示的敏感性和生成的瑕疵。该框架自主管理提示工程,检测并纠正不良生成,并通过过去经验的数据库实现迭代自我改进。SIDiffAgent在GenAIBench上的平均VQA得分为0.884,优于其他模型和方法。
MLLMEraser: Achieving Test-Time Unlearning in Multimodal Large Language Models through Activation Steering
Authors: Chenlu Ding, Jiancan Wu, Leheng Sheng, Fan Zhang, Yancheng Yuan, Xiang Wang, Xiangnan He
First: 2025-10-05T14:20:17+00:00 · Latest: 2026-02-02T12:41:26+00:00
Abstract
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities across vision-language tasks, yet their large-scale deployment raises pressing concerns about memorized private data, outdated knowledge, and harmful content. Existing unlearning approaches for MLLMs typically adapt training-based strategies such as gradient ascent or preference optimization, but these methods are computationally expensive, irreversible, and often distort retained knowledge. In this work, we propose MLLMEraser, an input-aware, training-free framework for test-time unlearning. Our approach leverages activation steering to enable dynamic knowledge erasure without parameter updates. Specifically, we construct a multimodal erasure direction by contrasting adversarially perturbed, knowledge-recall image-text pairs with knowledge-erasure counterparts, capturing both textual and visual discrepancies. To prevent unnecessary interference, we further design an input-aware steering mechanism that adaptively determines when and how the erasure direction should be applied, preserving utility on retained knowledge while enforcing forgetting on designated content. Experiments on LLaVA-1.5 and Qwen-2.5-VL demonstrate that MLLMEraser consistently outperforms state-of-the-art MLLM unlearning baselines, achieving stronger forgetting performance with lower computational cost and minimal utility degradation.
中文标题/摘要
标题:MLLMEraser:通过激活导向实现多模态大型语言模型的测试时遗忘
多模态大型语言模型(MLLMs)在视觉语言任务中展现了卓越的能力,但其大规模部署引发了关于存储的私人数据、过时的知识和有害内容的严重关切。现有的MLLM遗忘方法通常采用基于训练的策略,如梯度上升或偏好优化,但这些方法计算成本高、不可逆且往往会导致保留知识的扭曲。在本文中,我们提出MLLMEraser,这是一种输入感知、无需训练的测试时遗忘框架。我们的方法利用激活导向来实现无需参数更新的知识动态擦除。具体而言,我们通过对比对抗扰动的知识回忆图像-文本对与知识擦除的对应物来构建多模态擦除方向,捕捉文本和视觉的差异。为了防止不必要的干扰,我们进一步设计了一种输入感知的导向机制,该机制能够适应性地确定何时以及如何应用擦除方向,从而在保留有用知识的同时强制遗忘指定的内容。在LLaVA-1.5和Qwen-2.5-VL上的实验表明,MLLMEraser在遗忘性能、计算成本和保留知识的实用性损失方面均优于最先进的MLLM遗忘基准。
Summary / 总结
MLLMEraser is a framework for test-time unlearning in MLLMs using activation steering. It constructs an erasure direction by contrasting adversarially perturbed image-text pairs with knowledge-erasure counterparts, and applies an input-aware steering mechanism to selectively erase designated content without parameter updates. Experiments show MLLMEraser outperforms existing baselines with better forgetting performance, lower computational cost, and minimal utility degradation.
MLLMEraser 是一种无需训练的框架,用于在多模态大型语言模型(MLLMs)中进行测试时的知识擦除,通过激活导向来擦除知识而不更新参数。它通过对比对抗扰动的图像-文本对与知识擦除的对应体来构建擦除方向,并采用输入感知的导向机制以选择性地擦除指定内容。实验表明,MLLMEraser 在遗忘性能、计算效率和功能保持方面优于现有方法。
Auto-Comp: An Automated Pipeline for Scalable Compositional Probing of Contrastive Vision-Language Models
Authors: Cristian Sbrolli, Matteo Matteucci, Toshihiko Yamasaki
First: 2026-02-02T12:39:39+00:00 · Latest: 2026-02-02T12:39:39+00:00
Abstract
Modern Vision-Language Models (VLMs) exhibit a critical flaw in compositional reasoning, often confusing "a red cube and a blue sphere" with "a blue cube and a red sphere". Disentangling the visual and linguistic roots of these failures is a fundamental challenge for robust evaluation. To enable fine-grained, controllable analysis, we introduce Auto-Comp, a fully automated and synthetic pipeline for generating scalable benchmarks. Its controllable nature is key to dissecting and isolating different reasoning skills. Auto-Comp generates paired images from Minimal (e.g., "a monitor to the left of a bicycle on a white background") and LLM-generated Contextual captions (e.g., "In a brightly lit photography studio, a monitor is positioned to the left of a bicycle"), allowing a controlled A/B test to disentangle core binding ability from visio-linguistic complexity. Our evaluation of 20 VLMs on novel benchmarks for color binding and spatial relations reveals universal compositional failures in both CLIP and SigLIP model families. Crucially, our novel "Confusion Benchmark" reveals a deeper flaw beyond simple attribute swaps: models are highly susceptible to low-entropy distractors (e.g., repeated objects or colors), demonstrating their compositional failures extend beyond known bag-of-words limitations. we uncover a surprising trade-off: visio-linguistic context, which provides global scene cues, aids spatial reasoning but simultaneously hinders local attribute binding by introducing visual clutter. We release the Auto-Comp pipeline to facilitate future benchmark creation, alongside all our generated benchmarks (https://huggingface.co/AutoComp).
中文标题/摘要
标题:Auto-Comp:一种可扩展的对比视觉-语言模型组合探查自动化流水线
现代视觉-语言模型(VLMs)在组合推理方面存在一个关键缺陷,经常将“一个红色立方体和一个蓝色球体”误认为是“一个蓝色立方体和一个红色球体”。解开这些失败的视觉和语言根源是实现稳健评估的基本挑战。为了实现精细、可控的分析,我们引入了Auto-Comp,这是一种完全自动化和合成的生成可扩展基准的流水线。其可控性是分解和隔离不同推理技能的关键。Auto-Comp 从最小描述(例如,“一个显示器在自行车左侧的白色背景上”)和LLM生成的上下文描述(例如,“在一个明亮的摄影棚里,一个显示器被放置在自行车左侧”)生成配对图像,允许进行受控的A/B测试,以分离核心绑定能力与视觉-语言复杂性。我们在20个VLMs上对颜色绑定和空间关系的新基准进行评估,揭示了CLIP和SigLIP模型家族中普遍存在组合推理失败。最关键的是,我们新颖的“混淆基准”揭示了比简单的属性交换更深层次的缺陷:模型对低熵干扰(例如,重复的对象或颜色)高度敏感,表明它们的组合推理失败超出了已知的词袋限制。我们发现了一个令人惊讶的权衡:视觉-语言上下文,它提供了全局场景线索,有助于空间推理,但同时通过引入视觉杂乱,阻碍了局部属性绑定。我们发布了Auto-Comp流水线,以促进未来基准的创建,同时提供了所有生成的基准(https://huggingface.co/AutoComp)。
Summary / 总结
Auto-Comp is an automated pipeline designed to evaluate the compositional reasoning abilities of Vision-Language Models (VLMs) by generating scalable benchmarks. It uses minimal descriptions and LLM-generated contextual captions to isolate and test different reasoning skills. The evaluation of 20 VLMs on novel benchmarks for color binding and spatial relations revealed universal compositional failures, particularly in models' susceptibility to low-entropy distractors, indicating deeper issues beyond simple attribute swaps. Additionally, it was found that visio-linguistic context aids spatial reasoning but hinders local attribute binding due to visual clutter.
Auto-Comp 是一个自动化管道,用于通过生成可扩展的基准来评估视觉语言模型(VLMs)的组合推理能力。它使用最小描述和由LLM生成的上下文描述来隔离并测试不同的推理技能。对20个VLMs在颜色绑定和空间关系新型基准上的评估揭示了普遍的组合推理失败,特别是模型对低熵干扰物的高度敏感性,表明问题远超简单的词袋限制。此外,还发现视觉语言上下文有助于空间推理但会因视觉杂乱而妨碍局部属性绑定。
VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
Authors: Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Yejin Bang, Allen Bolourchi, Yann LeCun, Pascale Fung
First: 2025-12-11T18:59:22+00:00 · Latest: 2026-02-02T12:38:20+00:00
Abstract
We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.
中文标题/摘要
标题:VL-JEPA:联合嵌入预测架构的跨模态模型
我们介绍了基于联合嵌入预测架构(JEPA)的跨模态模型VL-JEPA。与经典视觉语言模型(VLM)逐个生成标记不同,VL-JEPA预测目标文本的连续嵌入。通过在抽象表示空间中学习,该模型专注于与任务相关的语义,同时抽象掉表面语言的变异性。在严格控制的比较中,与使用相同视觉编码器和训练数据的标准标记空间VLM训练相比,VL-JEPA在参数量减少50%的情况下实现了更强的性能。在推理时,仅在需要时调用轻量级文本解码器将VL-JEPA预测的嵌入转换为文本。我们展示了VL-JEPA原生支持选择性解码,将解码操作减少2.85倍,同时保持与非自适应均匀解码相似的性能。除了生成之外,VL-JEPA的嵌入空间自然支持开放词汇分类、文本到视频检索和区分型VQA,无需任何架构修改。在八个视频分类数据集和八个视频检索数据集上,VL-JEPA的平均性能超过了CLIP、SigLIP2和感知编码器。同时,尽管只有1.6B参数,该模型在四个VQA数据集(GQA、TallyQA、POPE和POPEv2)上的性能与经典VLM(InstructBLIP、QwenVL)相当。
Summary / 总结
VL-JEPA is a vision-language model that uses a Joint Embedding Predictive Architecture to predict continuous embeddings of target texts instead of generating tokens autoregressively. This approach leads to better performance with fewer parameters and supports selective decoding, reducing the number of decoding operations by 2.85x. VL-JEPA outperforms several models on video classification and retrieval tasks and achieves comparable performance on VQA tasks with significantly fewer parameters.
VL-JEPA 是一种使用联合嵌入预测架构来预测目标文本连续嵌入的视觉-语言模型,专注于任务相关的语义。它比标准的基于标记空间的 VLM 用更少的参数(少 50%)实现了更好的性能,并支持将解码操作减少 2.85 倍的可选解码。VL-JEPA 在视频分类、视频检索和 VQA 任务中表现出色,尽管参数量仅为 1.6B,但仍能达到与更大模型相当的性能。
Rethinking Genomic Modeling Through Optical Character Recognition
Authors: Hongxin Xiang, Pengsen Ma, Yunkang Cao, Di Yu, Haowen Chen, Xinyu Yang, Xiangxiang Zeng
First: 2026-02-02T12:12:00+00:00 · Latest: 2026-02-02T12:12:00+00:00
Abstract
Recent genomic foundation models largely adopt large language model architectures that treat DNA as a one-dimensional token sequence. However, exhaustive sequential reading is structurally misaligned with sparse and discontinuous genomic semantics, leading to wasted computation on low-information background and preventing understanding-driven compression for long contexts. Here, we present OpticalDNA, a vision-based framework that reframes genomic modeling as Optical Character Recognition (OCR)-style document understanding. OpticalDNA renders DNA into structured visual layouts and trains an OCR-capable vision--language model with a \emph{visual DNA encoder} and a \emph{document decoder}, where the encoder produces compact, reconstructible visual tokens for high-fidelity compression. Building on this representation, OpticalDNA defines prompt-conditioned objectives over core genomic primitives-reading, region grounding, subsequence retrieval, and masked span completion-thereby learning layout-aware DNA representations that retain fine-grained genomic information under a reduced effective token budget. Across diverse genomic benchmarks, OpticalDNA consistently outperforms recent baselines; on sequences up to 450k bases, it achieves the best overall performance with nearly $20\times$ fewer effective tokens, and surpasses models with up to $985\times$ more activated parameters while tuning only 256k \emph{trainable} parameters.
中文标题/摘要
标题:通过光学字符识别重新思考基因组建模
最近的基因组基础模型大多采用大型语言模型架构,将DNA视为一维标记序列。然而,这种全面的顺序阅读与稀疏和不连续的基因组语义结构不匹配,导致在低信息背景上浪费计算资源,并阻碍了对长上下文的理解驱动压缩。在此,我们提出了OpticalDNA,这是一种基于视觉的框架,将基因组建模重新定义为光学字符识别(OCR)风格的文档理解。OpticalDNA将DNA渲染为结构化的视觉布局,并训练一个具备OCR能力的视觉-语言模型,其中编码器使用一个“视觉DNA编码器”和一个“文档解码器”来生成紧凑且可重构的视觉标记,以实现高保真压缩。基于此表示,OpticalDNA定义了基于提示的基因组核心原语-读取、区域定位、子序列检索和掩码片段完成的目标,从而学习布局感知的DNA表示,即使在减少有效标记预算的情况下也能保留细粒度的基因组信息。在各种基因组基准测试中,OpticalDNA 一贯优于最近的基线;在多达450k碱基的序列上,它以几乎少20倍的有效标记实现了最佳的整体性能,并且在调优仅256k可训练参数的情况下,超过了具有多达985倍更多激活参数的模型。
Summary / 总结
The study aims to improve genomic modeling by addressing the limitations of sequential reading in large language models. It introduces OpticalDNA, a vision-based framework that reframes genomic data as OCR-style document understanding. By rendering DNA into structured visual layouts and training a vision-language model with a visual DNA encoder and a document decoder, OpticalDNA achieves better performance on genomic benchmarks, using nearly 20 times fewer effective tokens compared to recent baselines and outperforming models with up to 985 times more parameters while tuning only 256k trainable parameters.
本文提出了一种基于视觉的框架OpticalDNA,通过将基因组数据重新框定为OCR风格的文档理解来解决当前基因组建模的局限性。该框架将DNA序列渲染为结构化的视觉布局,并训练一个视觉-语言模型,包括视觉DNA编码器和文档解码器。该模型在各种基因组基准测试中表现出色,使用比最近基线模型少近20倍的有效令牌,并且在调优仅256k可训练参数的情况下,超过了具有多达985倍更多激活参数的模型。
ClueTracer: Question-to-Vision Clue Tracing for Training-Free Hallucination Suppression in Multimodal Reasoning
Authors: Gongli Xi, Kun Wang, Zeming Gao, Huahui Yi, Haolang Lu, Ye Tian, Wendong Wang
First: 2026-02-02T12:03:56+00:00 · Latest: 2026-02-02T12:03:56+00:00
Comments: 20 pages, 7 figures
Abstract
Large multimodal reasoning models solve challenging visual problems via explicit long-chain inference: they gather visual clues from images and decode clues into textual tokens. Yet this capability also increases hallucinations, where the model generates content that is not supported by the input image or the question. To understand this failure mode, we identify \emph{reasoning drift}: during clue gathering, the model over-focuses on question-irrelevant entities, diluting focus on task-relevant cues and gradually decoupling the reasoning trace from visual grounding. As a consequence, many inference-time localization or intervention methods developed for non-reasoning models fail to pinpoint the true clues in reasoning settings. Motivated by these insights, we introduce ClueRecall, a metric for assessing visual clue retrieval, and present ClueTracer, a training-free, parameter-free, and architecture-agnostic plugin for hallucination suppression. ClueTracer starts from the question and traces how key clues propagate along the model's reasoning pathway (question $\rightarrow$ outputs $\rightarrow$ visual tokens), thereby localizing task-relevant patches while suppressing spurious attention to irrelevant regions. Remarkably, \textbf{without any additional training}, ClueTracer improves all \textbf{reasoning} architectures (including \texttt{R1-OneVision}, \texttt{Ocean-R1}, \texttt{MM-Eureka}, \emph{etc}.) by $\mathbf{1.21\times}$ on reasoning benchmarks. When transferred to \textbf{non-reasoning} settings, it yields a $\mathbf{1.14\times}$ gain.
中文标题/摘要
标题:ClueTracer:训练无监督的幻觉抑制视觉线索追踪方法
大型多模态推理模型通过显式的长链推理解决复杂的视觉问题:它们从图像中收集视觉线索并将其解码为文本标记。然而,这种能力也会增加幻觉,即模型生成的内容与输入图像或问题无关。为了理解这种失败模式,我们识别了推理漂移:在收集线索时,模型过度关注与问题无关的实体,分散了对任务相关线索的关注,从而逐渐使推理轨迹与视觉定位脱钩。因此,许多为非推理模型开发的推理时的定位或干预方法在推理环境中无法准确指出真正的线索。基于这些见解,我们引入了ClueRecall,一种评估视觉线索检索的度量标准,并提出了ClueTracer,一种训练无监督、参数无依赖且架构无关的插件,用于幻觉抑制。ClueTracer 从问题出发,追踪关键线索如何在模型的推理路径中传播(问题 $\rightarrow$ 输出 $\rightarrow$ 视觉标记),从而定位任务相关区域,抑制对无关区域的错误关注。令人惊讶的是,**无需额外训练**,ClueTracer 在所有 **推理** 架构(包括 **R1-OneVision**、**Ocean-R1**、**MM-Eureka** 等)上提高了 **1.21 倍** 的推理基准性能。当转移到 **非推理** 环境时,它带来了 **1.14 倍** 的提升。
Summary / 总结
ClueTracer addresses hallucinations in multimodal reasoning models by identifying reasoning drift, where the model over-focuses on irrelevant entities. It introduces ClueRecall to assess visual clue retrieval and ClueTracer, a training-free plugin that traces key clues from questions to outputs, suppressing irrelevant attention. ClueTracer improves reasoning architectures by 1.21 times on reasoning benchmarks and yields a 1.14 times gain in non-reasoning settings.
ClueTracer 是一个无需额外训练和参数的插件,通过追踪模型推理路径中从问题到关键视觉线索的过程来解决多模态推理模型中的幻觉问题。它在推理基准测试中将推理架构的性能提高了1.21倍,在非推理设置中也提高了1.14倍。
Enhancing Multi-Image Understanding through Delimiter Token Scaling
Authors: Minyoung Lee, Yeji Park, Dongjun Hwang, Yejin Kim, Seong Joon Oh, Junsuk Choe
Venue: ICLR 2026
First: 2026-02-02T11:38:01+00:00 · Latest: 2026-02-02T11:38:01+00:00
Comments: Accepted at ICLR 2026
Abstract
Large Vision-Language Models (LVLMs) achieve strong performance on single-image tasks, but their performance declines when multiple images are provided as input. One major reason is the cross-image information leakage, where the model struggles to distinguish information across different images. Existing LVLMs already employ delimiter tokens to mark the start and end of each image, yet our analysis reveals that these tokens fail to effectively block cross-image information leakage. To enhance their effectiveness, we propose a method that scales the hidden states of delimiter tokens. This enhances the model's ability to preserve image-specific information by reinforcing intra-image interaction and limiting undesired cross-image interactions. Consequently, the model is better able to distinguish between images and reason over them more accurately. Experiments show performance gains on multi-image benchmarks such as Mantis, MuirBench, MIRB, and QBench2. We further evaluate our method on text-only tasks that require clear distinction. The method improves performance on multi-document and multi-table understanding benchmarks, including TQABench, MultiNews, and WCEP-10. Notably, our method requires no additional training or inference cost.
中文标题/摘要
标题:通过分隔符标记缩放提升多图像理解
大型视觉-语言模型(LVLMs)在单图像任务中表现出色,但在提供多个图像作为输入时,其性能会下降。主要原因在于跨图像信息泄露,模型难以区分不同图像中的信息。现有的LVLMs已经使用分隔符标记来标记每个图像的开始和结束,但我们的分析表明,这些标记未能有效阻止跨图像信息泄露。为了提高其有效性,我们提出了一种方法,通过缩放分隔符标记的隐藏状态来增强模型保留图像特定信息的能力,从而加强图像内交互并限制不必要的跨图像交互。因此,模型能够更好地区分图像并在它们之间进行更准确的推理。实验显示,该方法在Mantis、MuirBench、MIRB和QBench2等多图像基准测试中提高了性能。我们进一步在需要清晰区分的纯文本任务上评估了该方法,该方法在多文档和多表格理解基准测试TQABench、MultiNews和WCEP-10中提高了性能。值得注意的是,该方法无需额外的训练或推理成本。
Summary / 总结
This paper addresses the issue of cross-image information leakage in large vision-language models (LVLMs) when processing multiple images. The authors propose scaling the hidden states of delimiter tokens to better distinguish between images. Experiments show improvements in multi-image understanding benchmarks and text-only tasks requiring clear distinction between documents or tables.
论文解决了大型视觉-语言模型在处理多张图片时出现的跨图片信息泄露问题。提出了一种缩放分隔标记隐藏状态的方法,以增强模型保留每张图片特定信息的能力,并更好地区分图片。该方法在多图片和纯文本基准测试中表现出色,且无需额外的训练或推理成本。
Evolving from Tool User to Creator via Training-Free Experience Reuse in Multimodal Reasoning
Authors: Xintian Shen, Jiawei Chen, Lihao Zheng, Hao Ma, Tao Wei, Kun Zhan
First: 2026-02-02T11:37:45+00:00 · Latest: 2026-02-02T11:37:45+00:00
Abstract
Existing Tool-Integrated Reasoning (TIR) models have effectively extended the question-answering capabilities of LLMs by incorporating external tools. However, real-world scenarios present numerous open-ended problems where fixed tools often fail to meet task requirements. Furthermore, the lack of self-optimization mechanisms means that erroneous tool outputs can mislead the LLM's responses. Additionally, the construction of existing tools entails significant manual effort, which consequently constrains their applicability. Recognizing that the reasoning traces of LLMs encapsulate implicit problem-solving capabilities, we propose UCT, a novel training-free framework that transforms agents from tool users to tool creators. This approach harvests reasoning experiences and distills them into reusable assets. This method transforms the agent from a mere tool user into a tool creator, enabling adaptive tool creation and self-updating during the inference process. We also introduce a memory consolidation mechanism to maintain the tool library, ensuring high reusability of retained experiential memory for subsequent reasoning tasks. This novel automated tool construction paradigm continuously improves tool quality during reasoning, allowing the overall agent system to progress without additional training. Extensive experiments demonstrate that our method serves as a novel paradigm for enhancing the capabilities of TIR models. In particular, the significant performance gains achieved +20.86%$\uparrow$ and +23.04%$\uparrow$ on benchmarks across multi-domain mathematical and scientific reasoning tasks validate the self-evolving capability of the agent.
中文标题/摘要
标题:通过无监督经验重用从工具使用者转变为创造者实现多模态推理的演变
现有的工具集成推理(TIR)模型通过引入外部工具有效扩展了LLM的问答能力。然而,现实场景中存在许多开放性问题,固定工具往往无法满足任务需求。此外,缺乏自我优化机制意味着错误的工具输出可能会误导LLM的回答。另外,现有工具的构建需要大量的手动努力,这限制了它们的应用范围。鉴于LLM的推理轨迹中蕴含了隐含的问题解决能力,我们提出了一种名为UCT的新颖无监督框架,该框架能够将代理从工具使用者转变为工具创造者。该方法通过收集和提炼推理经验,将其转化为可重用的资产。这种方法将代理从简单的工具使用者转变为工具创造者,使其在推理过程中能够实现工具的自适应创建和自我更新。我们还引入了一种记忆巩固机制来维护工具库,确保保留的经验记忆在后续推理任务中具有高重用性。这种新颖的自动化工具构建范式在推理过程中不断改进工具质量,使整个代理系统能够在无需额外训练的情况下进步。广泛的实验表明,我们的方法为增强TIR模型的能力提供了一种新颖的范式。特别是,在多领域数学和科学推理任务基准测试中实现的显著性能提升+20.86%和+23.04%,验证了代理的自我进化能力。
Summary / 总结
The paper addresses the limitations of existing Tool-Integrated Reasoning (TIR) models, which rely on fixed tools that often fail to meet task requirements and can be misled by erroneous tool outputs. To overcome these issues, the authors propose UCT, a training-free framework that enables agents to evolve from tool users to creators. This framework harvests reasoning experiences and converts them into reusable assets, allowing for adaptive tool creation and self-updating during inference. The method also includes a memory consolidation mechanism to maintain a high-quality tool library. Experiments show that this approach significantly improves performance on multi-domain mathematical and scientific reasoning tasks, with gains of +20.86% and +23.04% on benchmarks.
论文针对现有工具集成推理(TIR)模型依赖固定工具,往往无法满足任务需求且可能引入错误的问题,提出了一个无需训练的框架UCT,使代理能够从工具使用者转变为工具创造者。UCT通过收集推理经验并提炼成可重用的资产,实现推理过程中的自适应工具创建和自我更新。实验表明,该方法在多领域推理任务中显著提升了性能,特别是在基准测试中取得了+20.86%和+23.04%的提升。
Zero-Shot Off-Policy Learning
Authors: Arip Asadulaev, Maksim Bobrin, Salem Lahlou, Dmitry Dylov, Fakhri Karray, Martin Takac
First: 2026-02-02T11:06:31+00:00 · Latest: 2026-02-02T11:06:31+00:00
Abstract
Off-policy learning methods seek to derive an optimal policy directly from a fixed dataset of prior interactions. This objective presents significant challenges, primarily due to the inherent distributional shift and value function overestimation bias. These issues become even more noticeable in zero-shot reinforcement learning, where an agent trained on reward-free data must adapt to new tasks at test time without additional training. In this work, we address the off-policy problem in a zero-shot setting by discovering a theoretical connection of successor measures to stationary density ratios. Using this insight, our algorithm can infer optimal importance sampling ratios, effectively performing a stationary distribution correction with an optimal policy for any task on the fly. We benchmark our method in motion tracking tasks on SMPL Humanoid, continuous control on ExoRL, and for the long-horizon OGBench tasks. Our technique seamlessly integrates into forward-backward representation frameworks and enables fast-adaptation to new tasks in a training-free regime. More broadly, this work bridges off-policy learning and zero-shot adaptation, offering benefits to both research areas.
中文标题/摘要
标题:零样本离策略学习
离策略学习方法旨在直接从固定的历史交互数据集中推导出最优策略。这一目标面临重大挑战,主要由于固有的分布偏移和价值函数的高估偏差。这些问题在零样本强化学习中尤为明显,即一个在无奖励数据上训练的智能体必须在测试时适应新的任务,而无需额外的训练。在本文中,我们通过发现后续措施与稳态密度比之间的理论联系,解决了零样本设置下的离策略问题。利用这一洞察,我们的算法可以推断出最优的重要性采样比,有效地对任何任务进行稳态分布校正,以获得最优策略。我们在SMPL Humanoid的运动跟踪任务、ExoRL的连续控制任务以及OGBench的长时序任务上对我们的方法进行了基准测试。我们的技术无缝地集成到前向-后向表示框架中,并在无需训练的情况下实现对新任务的快速适应。更广泛地说,本文将离策略学习和零样本适应联系起来,为两个研究领域都带来了好处。
Summary / 总结
The research aims to address the challenges of off-policy learning in a zero-shot setting, where an agent trained on reward-free data must adapt to new tasks without additional training. The method leverages the theoretical connection between successor measures and stationary density ratios to infer optimal importance sampling ratios, allowing for a stationary distribution correction with an optimal policy for any task. Key experimental findings show that the proposed technique effectively adapts to new tasks in motion tracking, continuous control, and long-horizon tasks, enabling fast adaptation in a training-free regime.
研究解决了零样本设置下的离策学习问题,即在无需额外训练的情况下,代理必须适应新任务。通过利用后继度量与稳定密度比之间的联系,所提出的方法推断出最优的重要性采样比,从而实现对新任务的快速适应。在运动跟踪、连续控制和长期任务中的实验表明,该方法能够进行稳定分布校正,并在无需训练的情况下实现快速适应。
Beyond Open Vocabulary: Multimodal Prompting for Object Detection in Remote Sensing Images
Authors: Shuai Yang, Ziyue Huang, Jiaxin Chen, Qingjie Liu, Yunhong Wang
First: 2026-02-02T11:03:01+00:00 · Latest: 2026-02-02T11:03:01+00:00
Abstract
Open-vocabulary object detection in remote sensing commonly relies on text-only prompting to specify target categories, implicitly assuming that inference-time category queries can be reliably grounded through pretraining-induced text-visual alignment. In practice, this assumption often breaks down in remote sensing scenarios due to task- and application-specific category semantics, resulting in unstable category specification under open-vocabulary settings. To address this limitation, we propose RS-MPOD, a multimodal open-vocabulary detection framework that reformulates category specification beyond text-only prompting by incorporating instance-grounded visual prompts, textual prompts, and their multimodal integration. RS-MPOD introduces a visual prompt encoder to extract appearance-based category cues from exemplar instances, enabling text-free category specification, and a multimodal fusion module to integrate visual and textual information when both modalities are available. Extensive experiments on standard, cross-dataset, and fine-grained remote sensing benchmarks show that visual prompting yields more reliable category specification under semantic ambiguity and distribution shifts, while multimodal prompting provides a flexible alternative that remains competitive when textual semantics are well aligned.
中文标题/摘要
标题:超越开放词汇:遥感图像目标检测的多模态提示
遥感中的开放词汇目标检测通常依赖于仅文本的提示来指定目标类别,隐含假设推理时的类别查询可以通过预训练诱导的文本-视觉对齐可靠地进行接地。实际上,由于任务和应用场景特定的类别语义,这一假设在遥感场景中经常失效,导致在开放词汇设置下类别指定不稳定。为解决这一局限,我们提出了一种多模态开放词汇检测框架RS-MPOD,通过结合实例接地的视觉提示、文本提示及其多模态集成,重新定义类别指定,超越仅文本的提示。RS-MPOD引入了视觉提示编码器,从示例实例中提取基于外观的类别线索,实现无文本的类别指定,并引入了多模态融合模块,在两种模态都可用时整合视觉和文本信息。在标准、跨数据集和细粒度遥感基准上的广泛实验表明,在语义模糊和分布偏移情况下,视觉提示提供了更可靠的类别指定,而多模态提示则提供了一种灵活的替代方案,在文本语义对齐良好时仍具有竞争力。
Summary / 总结
The paper addresses the instability of category specification in open-vocabulary object detection for remote sensing images due to task-specific semantics. It proposes RS-MPOD, a multimodal open-vocabulary detection framework that uses both visual and textual prompts for category specification, improving reliability under semantic ambiguity and distribution shifts. Experiments show that visual prompting is more reliable, while multimodal prompting remains competitive when textual semantics are well aligned.
论文提出了一种多模态提示框架RS-MPOD,以解决遥感领域开放词汇目标检测的局限性。该框架结合了视觉和文本提示,以在语义模糊和分布偏移情况下提供更可靠的类别指定。实验表明,视觉提示在这些情况下提供了更稳定的类别指定,而多模态提示在文本语义对齐良好时仍具有竞争力。
VLM-Guided Experience Replay
Authors: Elad Sharony, Tom Jurgenson, Orr Krupnik, Dotan Di Castro, Shie Mannor
First: 2026-02-02T10:19:59+00:00 · Latest: 2026-02-02T10:19:59+00:00
Abstract
Recent advances in Large Language Models (LLMs) and Vision-Language Models (VLMs) have enabled powerful semantic and multimodal reasoning capabilities, creating new opportunities to enhance sample efficiency, high-level planning, and interpretability in reinforcement learning (RL). While prior work has integrated LLMs and VLMs into various components of RL, the replay buffer, a core component for storing and reusing experiences, remains unexplored. We propose addressing this gap by leveraging VLMs to guide the prioritization of experiences in the replay buffer. Our key idea is to use a frozen, pre-trained VLM (requiring no fine-tuning) as an automated evaluator to identify and prioritize promising sub-trajectories from the agent's experiences. Across scenarios, including game-playing and robotics, spanning both discrete and continuous domains, agents trained with our proposed prioritization method achieve 11-52% higher average success rates and improve sample efficiency by 19-45% compared to previous approaches. https://esharony.me/projects/vlm-rb/
中文标题/摘要
标题:VLM 引导的经验重放
近年来,大型语言模型(LLMs)和视觉-语言模型(VLMs)的发展赋予了强大的语义和多模态推理能力,为增强强化学习(RL)中的样本效率、高级规划和可解释性提供了新的机会。尽管先前的工作将LLMs和VLMs整合到RL的各个组件中,但用于存储和重用经验的核心组件——重放缓冲区,仍处于未被探索的状态。我们提出通过利用VLMs来引导重放缓冲区中经验的优先级排序来填补这一空白。我们的核心思想是使用一个冻结的、预训练的VLM(无需微调)作为自动评估器,以识别并优先处理代理经验中的有前途的子轨迹。在包括游戏和机器人技术在内的多种场景中,涵盖离散和连续领域,使用我们提出的方法进行优先级排序训练的代理平均成功率提高了11-52%,样本效率提高了19-45%,优于先前的方法。https://esharony.me/projects/vlm-rb/
Summary / 总结
This paper aims to enhance sample efficiency and interpretability in reinforcement learning by integrating Vision-Language Models (VLMs) into the replay buffer. The method uses a pre-trained VLM to prioritize experiences, improving average success rates by 11-52% and sample efficiency by 19-45% across various scenarios, including game-playing and robotics. The VLM acts as an automated evaluator without requiring fine-tuning, making the approach scalable and efficient.
本文提出了一种方法,通过Vision-Language模型(VLM)来指导强化学习(RL)中的经验回放缓冲区的经验优先级排序。该方法使用预训练的VLM来评估和优先级排序子轨迹,从而在游戏和机器人等不同场景中,无论是离散还是连续域,都能提高样本效率和成功率,分别提高了11-52%和19-45%。
Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models
Authors: Jiaqi Liu, Lang Sun, Ronghao Fu, Bo Yang
First: 2025-09-26T11:34:42+00:00 · Latest: 2026-02-02T10:01:02+00:00
Abstract
Vision-Language Models (VLMs) in remote sensing often fail at complex analytical tasks, a limitation stemming from their end-to-end training paradigm that bypasses crucial reasoning steps and leads to unverifiable outputs. To address this limitation, we introduce the Perceptually-Grounded Geospatial Chain-of-Thought (Geo-CoT), a framework that models remote sensing analysis as a verifiable, multi-step process. We instill this analytical process through a two-stage alignment strategy, leveraging Geo-CoT380k, the first large-scale dataset of structured Geo-CoT rationales. This strategy first employs supervised fine-tuning (SFT) to instill the foundational cognitive architecture, then leverages Group Reward Policy Optimization (GRPO) to refine the model's reasoning policy towards factual correctness. The resulting model, RSThinker, outputs both a final answer and its justifying, verifiable analytical trace. This capability yields dominant performance, significantly outperforming state-of-the-art models across a comprehensive range of tasks. The public release of our Geo-CoT380k dataset and RSThinker model upon publication serves as a concrete pathway from opaque perception towards structured, verifiable reasoning for Earth Observation.
中文标题/摘要
标题:遥感中的忠实推理走向:基于感知的地理空间链式思考框架
遥感中的视觉-语言模型(VLMs)在复杂分析任务中常常表现不佳,这一局限性源于它们的端到端训练范式,该范式绕过了关键的推理步骤,导致不可验证的输出。为解决这一局限性,我们引入了基于感知的地理空间链式思考(Geo-CoT)框架,该框架将遥感分析建模为一个可验证的多步骤过程。我们通过两阶段对齐策略灌输这一分析过程,利用Geo-CoT380k,这是首个大规模结构化Geo-CoT推理数据集。该策略首先采用监督微调(SFT)来灌输基础的认知架构,然后利用组奖励策略优化(GRPO)来细化模型的推理策略,使其更接近事实正确性。由此产生的模型RSThinker不仅输出最终答案,还输出其验证性的分析过程。这种能力在一系列任务中表现出了显著的优势,大幅超越了最先进的模型。我们将在发表时公开Geo-CoT380k数据集和RSThinker模型,这为从不透明的感知向结构化、可验证的推理提供了明确的路径,适用于地球观测。
Summary / 总结
The research aims to enhance the analytical capabilities of Vision-Language Models (VLMs) in remote sensing by addressing their limitations in complex tasks due to end-to-end training. The Perceptually-Grounded Geospatial Chain-of-Thought (Geo-CoT) framework is introduced, which models remote sensing analysis as a verifiable, multi-step process. The framework uses a two-stage alignment strategy, including supervised fine-tuning and Group Reward Policy Optimization, to refine the model's reasoning policy. The resulting RSThinker model not only provides a final answer but also a justifying analytical trace, achieving superior performance across various remote sensing tasks.
研究旨在通过解决视觉-语言模型(VLMs)在遥感领域复杂分析任务中的局限性,增强其推理能力。研究引入了感知导向的地理空间链式思考(Geo-CoT)框架,将遥感分析建模为一个可验证的多步骤过程。这通过两阶段对齐策略实现:监督微调以建立基础认知架构,然后利用组奖励策略优化来细化模型的推理策略。最终生成的RSThinker模型不仅输出最终答案,还输出可验证的分析过程,其在各种遥感任务中的表现显著优于现有模型。
AVERY: Adaptive VLM Split Computing through Embodied Self-Awareness for Efficient Disaster Response Systems
Authors: Rajat Bhattacharjya, Sing-Yao Wu, Hyunwoo Oh, Chaewon Nam, Suyeon Koo, Mohsen Imani, Elaheh Bozorgzadeh, Nikil Dutt
First: 2025-11-22T18:42:04+00:00 · Latest: 2026-02-02T09:53:38+00:00
Comments: 8 pages, 5 figures. Paper is currently under review. Authors' version posted for personal use and not for redistribution
Abstract
Unmanned Aerial Vehicles (UAVs) in disaster response require complex, queryable intelligence that on-board CNNs cannot provide. While Vision-Language Models (VLMs) offer this semantic reasoning, their high resource demands make on-device deployment infeasible, and naive cloud offloading fails under the low-bandwidth networks common in disaster zones. We present AVERY, a framework that enables VLM deployment through adaptive split computing. We advance the split computing paradigm beyond traditional depth-wise partitioning by introducing a functional, cognitive-inspired dual-stream split that separates the VLM into a high-frequency, low-resolution "context stream" for real-time awareness and a low-frequency, high-fidelity "insight stream" for deep analysis. A lightweight, self-aware on-board controller manages this architecture, monitoring network conditions and operator intent to dynamically select from pre-trained compression models, navigating the fundamental accuracy-throughput trade-off. Evaluated using the VLM LISA-7B across an edge-cloud scenario under fluctuating network conditions, AVERY consistently outperforms static configurations, achieving 11.2% higher accuracy than raw image compression and 93.98% lower energy consumption compared to full-edge execution, thereby enhancing mission efficiency and enabling real-time, queryable intelligence on resource-constrained platforms in dynamic environments.
中文标题/摘要
标题:AVERY:通过具身自我意识实现自适应VLM拆分计算以提高灾害响应系统的效率
在灾害响应中,无人驾驶航空器(UAV)需要复杂的可查询智能,而机载CNN无法提供。尽管视觉语言模型(VLM)提供了这种语义推理,但其高资源需求使其在设备上部署不可行,而简单的云卸载在灾害区域常见的低带宽网络下也行不通。我们提出了AVERY框架,通过自适应拆分计算实现VLM部署。我们超越了传统的深度拆分,引入了一种功能性的、认知启发式的双流拆分,将VLM拆分为一个高频率、低分辨率的“上下文流”用于实时意识和一个低频率、高保真度的“洞察流”用于深入分析。一个轻量级的、自我意识的机载控制器管理这一架构,监控网络条件和操作员意图,动态选择预训练的压缩模型,从而在基本的准确率-吞吐量权衡中导航。在波动的网络条件下,使用VLM LISA-7B在边缘-云场景下评估,AVERY始终优于静态配置,准确率比原始图像压缩高11.2%,能量消耗比全边缘执行低93.98%,从而提高任务效率,并在资源受限的平台上实现实时、可查询的智能。
Summary / 总结
AVERY is a framework that enables the deployment of Vision-Language Models (VLMs) through adaptive split computing for disaster response systems. It introduces a functional, cognitive-inspired dual-stream split that separates the VLM into a context stream for real-time awareness and an insight stream for deep analysis. The lightweight on-board controller manages this architecture by dynamically selecting from pre-trained compression models based on network conditions and operator intent. Experiments show that AVERY outperforms static configurations, achieving higher accuracy and lower energy consumption compared to raw image compression and full-edge execution, respectively.
AVERY 是一种通过自适应分割计算来部署视觉语言模型(VLMs)的框架,以解决无人机在灾害响应中的高资源需求问题。通过将 VLMs 分割为实时上下文流和深度分析洞察流,AVERY 动态管理网络条件和操作员意图以优化准确性和能耗。评估结果显示,AVERY 的性能优于静态配置,实现了更高的准确性和更低的能耗。
VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration
Authors: Hanxun Yu, Wentong Li, Xuan Qu, Song Wang, Junbo Chen, Jianke Zhu
First: 2026-01-30T07:45:48+00:00 · Latest: 2026-02-02T09:21:10+00:00
Comments: ICLR2026, Code Link: https://github.com/hanxunyu/VisionTrim
Abstract
Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC) module, which facilitates context-aware token merging guided by textual cues. Extensive experiments across diverse image and video multimodal benchmarks demonstrate the performance superiority of our VisionTrim, advancing practical MLLM deployment in real-world applications. The code is available at: https://github.com/hanxunyu/VisionTrim.
中文标题/摘要
标题:VisionTrim:统一的训练无损视觉标记压缩以加速MLLM
多模态大型语言模型(MLLMs)因视觉标记过多而面临高计算成本问题,特别是在高分辨率和基于视频的场景中。现有标记减少方法通常专注于孤立的管道组件,往往忽视文本对齐,导致性能下降。本文提出VisionTrim,这是一种统一的训练无损MLLM加速框架,结合了两个有效的即插即用模块:1)主导视觉标记选择(DVTS)模块,通过全局-局部视角保留关键视觉标记;2)文本引导视觉补充(TGVC)模块,通过文本线索指导上下文感知标记合并。在多种图像和视频多模态基准上的广泛实验表明,VisionTrim的性能优越,推动了实际MLLM在真实世界应用中的部署。代码可在:https://github.com/hanxunyu/VisionTrim 获取。
Summary / 总结
VisionTrim is a unified framework for training-free acceleration of multimodal large language models (MLLMs) by reducing visual tokens. It includes the Dominant Vision Token Selection (DVTS) module, which selects essential visual tokens, and the Text-Guided Vision Complement (TGVC) module, which merges tokens contextually guided by text. Experiments show that VisionTrim outperforms existing methods across various image and video benchmarks, enhancing practical MLLM deployment.
VisionTrim 是一个用于训练-free 加速多模态大型语言模型 (MLLM) 的统一框架,通过集成两个模块实现:DVTS 用于选择关键的视觉令牌,TGVC 用于由文本线索引导的上下文感知令牌合并。实验表明,VisionTrim 在各种基准测试中优于现有方法,提高了实际 MLLM 在现实世界应用中的部署能力。
Efficient Cross-Country Data Acquisition Strategy for ADAS via Street-View Imagery
Authors: Yin Wu, Daniel Slieter, Carl Esselborn, Ahmed Abouelazm, Tsung Yuan Tseng, J. Marius Zöllner
First: 2026-02-02T09:09:07+00:00 · Latest: 2026-02-02T09:09:07+00:00
Abstract
Deploying ADAS and ADS across countries remains challenging due to differences in legislation, traffic infrastructure, and visual conventions, which introduce domain shifts that degrade perception performance. Traditional cross-country data collection relies on extensive on-road driving, making it costly and inefficient to identify representative locations. To address this, we propose a street-view-guided data acquisition strategy that leverages publicly available imagery to identify places of interest (POI). Two POI scoring methods are introduced: a KNN-based feature distance approach using a vision foundation model, and a visual-attribution approach using a vision-language model. To enable repeatable evaluation, we adopt a collect-detect protocol and construct a co-located dataset by pairing the Zenseact Open Dataset with Mapillary street-view images. Experiments on traffic sign detection, a task particularly sensitive to cross-country variations in sign appearance, show that our approach achieves performance comparable to random sampling while using only half of the target-domain data. We further provide cost estimations for full-country analysis, demonstrating that large-scale street-view processing remains economically feasible. These results highlight the potential of street-view-guided data acquisition for efficient and cost-effective cross-country model adaptation.
中文标题/摘要
标题:利用街景图像的高效跨国数据采集策略以实现ADAS
在不同国家部署ADAS和ADS仍然具有挑战性,因为各国的法律法规、交通基础设施和视觉习惯存在差异,这会导致领域偏移,从而降低感知性能。传统的跨国数据采集依赖于广泛的路面驾驶,这使得识别代表性地点的成本高昂且效率低下。为了解决这个问题,我们提出了一种街景引导的数据采集策略,利用公开的图像来识别兴趣点(POI)。介绍了两种POI评分方法:基于KNN的特征距离方法,使用视觉基础模型;以及使用视觉-语言模型的视觉归因方法。为了实现可重复评估,我们采用了收集-检测协议,并通过将Zenseact开源数据集与Mapillary街景图像配对来构建一个共定位数据集。在交通标志检测任务中,该任务对标志外观的跨国差异特别敏感,实验结果显示,我们的方法在使用目标领域数据量仅为一半的情况下,性能与随机采样相当。我们还提供了全国家分析的成本估算,证明大规模街景处理在经济上仍然是可行的。这些结果突显了街景引导的数据采集在高效和低成本跨国模型适应方面的潜力。
Summary / 总结
The paper addresses the challenges of deploying ADAS and ADS across countries due to domain shifts. It proposes a street-view-guided data acquisition strategy using publicly available imagery to identify places of interest (POI). Two scoring methods are introduced: a KNN-based feature distance approach and a visual-attribution approach. Experiments show that this approach achieves comparable performance to random sampling while using only half of the target-domain data, making it a cost-effective solution for cross-country model adaptation.
论文提出了一种基于街景的数据采集策略,以解决在不同国家部署ADAS和ADS的挑战。该方法利用公开的街景图像识别兴趣点,并采用两种评分方法:基于KNN的特征距离方法和视觉归因方法。实验表明,该方法在交通标志检测任务上的性能与随机采样相当,但仅使用了目标域数据的一半,从而更具成本效益和效率。
Object-Centric Representation Learning for Enhanced 3D Scene Graph Prediction
Authors: KunHo Heo, GiHyun Kim, SuYeon Kim, MyeongAh Cho
Venue: NeurIPS 2025
First: 2025-10-06T11:33:09+00:00 · Latest: 2026-02-02T08:57:36+00:00
Comments: Accepted by NeurIPS 2025. Code: https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes
Abstract
3D Semantic Scene Graph Prediction aims to detect objects and their semantic relationships in 3D scenes, and has emerged as a crucial technology for robotics and AR/VR applications. While previous research has addressed dataset limitations and explored various approaches including Open-Vocabulary settings, they frequently fail to optimize the representational capacity of object and relationship features, showing excessive reliance on Graph Neural Networks despite insufficient discriminative capability. In this work, we demonstrate through extensive analysis that the quality of object features plays a critical role in determining overall scene graph accuracy. To address this challenge, we design a highly discriminative object feature encoder and employ a contrastive pretraining strategy that decouples object representation learning from the scene graph prediction. This design not only enhances object classification accuracy but also yields direct improvements in relationship prediction. Notably, when plugging in our pretrained encoder into existing frameworks, we observe substantial performance improvements across all evaluation metrics. Additionally, whereas existing approaches have not fully exploited the integration of relationship information, we effectively combine both geometric and semantic features to achieve superior relationship prediction. Comprehensive experiments on the 3DSSG dataset demonstrate that our approach significantly outperforms previous state-of-the-art methods. Our code is publicly available at https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes.
中文标题/摘要
标题:面向对象的表示学习以增强3D场景图预测
3D语义场景图预测旨在检测3D场景中的对象及其语义关系,并已成为机器人技术和AR/VR应用中的关键技术。尽管先前的研究解决了数据集限制并探索了各种方法,包括开放式词汇设置,但它们经常未能优化对象和关系特征的表示能力,过度依赖图神经网络,尽管其区分能力不足。在本工作中,我们通过广泛分析表明,对象特征的质量对整体场景图准确性起着关键作用。为了解决这一挑战,我们设计了一种高度区分的对象特征编码器,并采用对比预训练策略,将对象表示学习与场景图预测分离。这一设计不仅提高了对象分类准确性,还直接提高了关系预测。值得注意的是,当将我们的预训练编码器插入现有框架时,我们观察到所有评估指标上都取得了显著性能提升。此外,与现有方法未能充分利用关系信息的整合不同,我们有效结合了几何和语义特征,实现了更优的关系预测。在3DSSG数据集上的全面实验表明,我们的方法显著优于先前的最先进方法。我们的代码可在https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes公开获取。
Summary / 总结
This research aims to improve 3D semantic scene graph prediction by focusing on the quality of object features. The authors propose a discriminative object feature encoder and a contrastive pretraining strategy that decouples object representation learning from scene graph prediction. This approach enhances both object classification and relationship prediction, leading to significant performance improvements over previous methods on the 3DSSG dataset.
该论文旨在提升3D语义场景图预测中的对象和关系特征表示,提出了一种区分性对象特征编码器和对比预训练策略以改进对象和关系预测。在3DSSG数据集上的全面实验表明,其方法在所有评估指标上显著优于先前的方法。
FlowBypass: Rectified Flow Trajectory Bypass for Training-Free Image Editing
Authors: Menglin Han, Zhangkai Ni
First: 2026-02-02T08:37:00+00:00 · Latest: 2026-02-02T08:37:00+00:00
Abstract
Training-free image editing has attracted increasing attention for its efficiency and independence from training data. However, existing approaches predominantly rely on inversion-reconstruction trajectories, which impose an inherent trade-off: longer trajectories accumulate errors and compromise fidelity, while shorter ones fail to ensure sufficient alignment with the edit prompt. Previous attempts to address this issue typically employ backbone-specific feature manipulations, limiting general applicability. To address these challenges, we propose FlowBypass, a novel and analytical framework grounded in Rectified Flow that constructs a bypass directly connecting inversion and reconstruction trajectories, thereby mitigating error accumulation without relying on feature manipulations. We provide a formal derivation of two trajectories, from which we obtain an approximate bypass formulation and its numerical solution, enabling seamless trajectory transitions. Extensive experiments demonstrate that FlowBypass consistently outperforms state-of-the-art image editing methods, achieving stronger prompt alignment while preserving high-fidelity details in irrelevant regions.
中文标题/摘要
标题:FlowBypass:基于修正流的训练-free 图像编辑轨迹旁路
训练-free 图像编辑因其高效性和独立于训练数据而越来越受到关注。然而,现有方法主要依赖于反转重构轨迹,这导致了一个固有的权衡:较长的轨迹会累积误差并损害保真度,而较短的轨迹则无法确保与编辑提示的充分对齐。之前尝试解决这一问题的方法通常采用特定骨干网络的特征操作,限制了其通用性。为了解决这些挑战,我们提出了一种基于修正流的新颖且分析性的框架FlowBypass,该框架直接构建了连接反转和重构轨迹的旁路,从而在不依赖特征操作的情况下减轻误差累积。我们提供了两个轨迹的正式推导,从中获得了一个近似旁路公式及其数值解,使轨迹过渡变得无缝。广泛的实验表明,FlowBypass 一致地优于最先进的图像编辑方法,在保持无关区域高保真细节的同时实现了更强的提示对齐。
Summary / 总结
FlowBypass is a training-free image editing method that addresses the trade-off between trajectory length and image fidelity by using Rectified Flow to create a direct bypass between inversion and reconstruction trajectories. This approach avoids the need for backbone-specific feature manipulations, leading to better prompt alignment and preservation of high-fidelity details in irrelevant regions compared to existing methods.
FlowBypass 是一种无需训练的图像编辑方法,通过使用修正流直接连接反向和重建轨迹来解决轨迹长度与图像保真度之间的权衡问题,避免了特征操作的需求,并在提示对齐和细节保真度方面始终优于现有方法。
History
20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553