SafeBimanual: Diffusion-based Trajectory Optimization for Safe Bimanual Manipulation
Authors: Haoyuan Deng, Wenkai Guo, Qianzhun Wang, Zhenyu Wu, Ziwei Wang
First: 2025-08-25T17:59:02+00:00 · Latest: 2025-08-25T17:59:02+00:00
Comments: Project website is at: https://denghaoyuan123.github.io/SafeBimanip/
Abstract
Bimanual manipulation has been widely applied in household services and
manufacturing, which enables the complex task completion with coordination
requirements. Recent diffusion-based policy learning approaches have achieved
promising performance in modeling action distributions for bimanual
manipulation. However, they ignored the physical safety constraints of bimanual
manipulation, which leads to the dangerous behaviors with damage to robots and
objects. To this end, we propose a test-time trajectory optimization framework
named SafeBimanual for any pre-trained diffusion-based bimanual manipulation
policies, which imposes the safety constraints on bimanual actions to avoid
dangerous robot behaviors with improved success rate. Specifically, we design
diverse cost functions for safety constraints in different dual-arm cooperation
patterns including avoidance of tearing objects and collision between arms and
objects, which optimizes the manipulator trajectories with guided sampling of
diffusion denoising process. Moreover, we employ a vision-language model (VLM)
to schedule the cost functions by specifying keypoints and corresponding
pairwise relationship, so that the optimal safety constraint is dynamically
generated in the entire bimanual manipulation process. SafeBimanual
demonstrates superiority on 8 simulated tasks in RoboTwin with a 13.7% increase
in success rate and a 18.8% reduction in unsafe interactions over
state-of-the-art diffusion-based methods. Extensive experiments on 4 real-world
tasks further verify its practical value by improving the success rate by
32.5%.
Summary / 总结
Bimanual manipulation has been widely applied in household services and manufacturing, which enables the complex task completion with coordination requirements.
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
Authors: Sixun Dong, Juhua Hu, Mian Zhang, Ming Yin, Yanjie Fu, Qi Qian
First: 2025-08-25T17:57:49+00:00 · Latest: 2025-08-25T17:57:49+00:00
Comments: Project page: https://project.ironieser.cc/mmtok
Abstract
Vision-Language Models (VLMs) demonstrate impressive performance in
understanding visual content with language instruction by converting visual
input to vision tokens. However, redundancy in vision tokens results in the
degenerated inference efficiency of VLMs. While many algorithms have been
proposed to reduce the number of vision tokens, most of them apply only
unimodal information (i.e., vision/text) for pruning and ignore the inherent
multimodal property of vision-language tasks. Moreover, it lacks a generic
criterion that can be applied to different modalities. To mitigate this
limitation, in this work, we propose to leverage both vision and text tokens to
select informative vision tokens by the criterion of coverage. We first
formulate the subset selection problem as a maximum coverage problem.
Afterward, a subset of vision tokens is optimized to cover the text tokens and
the original set of vision tokens, simultaneously. Finally, a VLM agent can be
adopted to further improve the quality of text tokens for guiding vision
pruning. The proposed method MMTok is extensively evaluated on benchmark
datasets with different VLMs. The comparison illustrates that vision and text
information are complementary, and combining multimodal information can surpass
the unimodal baseline with a clear margin. Moreover, under the maximum coverage
criterion on the POPE dataset, our method achieves a 1.87x speedup while
maintaining 98.7% of the original performance on LLaVA-NeXT-13B. Furthermore,
with only four vision tokens, it still preserves 87.7% of the original
performance on LLaVA-1.5-7B. These results highlight the effectiveness of
coverage in token selection.
Summary / 总结
Vision-Language Models (VLMs) demonstrate impressive performance in understanding visual content with language instruction by converting visual input to vision tokens.
SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models
Authors: Zhenwei Tang, Difan Jiao, Blair Yang, Ashton Anderson
First: 2025-08-25T16:33:07+00:00 · Latest: 2025-08-25T16:33:07+00:00
Comments: COLM 2025
Abstract
Evaluating whether vision-language models (VLMs) reason consistently across
representations is challenging because modality comparisons are typically
confounded by task differences and asymmetric information. We introduce SEAM, a
benchmark that pairs semantically equivalent inputs across four domains that
have existing standardized textual and visual notations. By employing distinct
notation systems across modalities, in contrast to OCR-based image-text
pairing, SEAM provides a rigorous comparative assessment of the
textual-symbolic and visual-spatial reasoning capabilities of VLMs. Across 21
contemporary models, we observe systematic modality imbalance: vision
frequently lags language in overall performance, despite the problems
containing semantically equivalent information, and cross-modal agreement is
relatively low. Our error analysis reveals two main drivers: textual perception
failures from tokenization in domain notation and visual perception failures
that induce hallucinations. We also show that our results are largely robust to
visual transformations. SEAM establishes a controlled, semantically equivalent
setting for measuring and improving modality-agnostic reasoning.
Summary / 总结
Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information.
Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance
Authors: Xiangxiang Wang, Xuanyu Wang, YiJia Luo, Yongbin Yu, Manping Fan, Jingtao Zhang, Liyong Ren
First: 2025-08-25T16:32:32+00:00 · Latest: 2025-08-25T16:32:32+00:00
Comments: 28 pages,9 figures
Abstract
This study proposes the dual technological innovation framework, including a
cross-modal differ entiated quantization framework for vision-language models
(VLMs) and a scene-aware vectorized
memory multi-agent system for visually impaired assistance. The modular
framework was developed
implementing differentiated processing strategies, effectively reducing
memory requirements from
38GB to 16GB while maintaining model performance. The multi-agent
architecture combines
scene classification, vectorized memory, and multimodal interaction, enabling
persistent storage
and efficient retrieval of scene memories. Through
perception-memory-reasoning workflows, the
system provides environmental information beyond the current view using
historical memories.
Experiments show the quantized 19B-parameter model only experiences a 2.05%
performance drop
on MMBench and maintains 63.7 accuracy on OCR-VQA (original: 64.9),
outperforming smaller
models with equivalent memory requirements like the Molmo-7B series. The
system maintains
response latency between 2.83-3.52 seconds from scene analysis to initial
speech output, substantially
faster than non-streaming methods. This research advances computational
efficiency and assistive
technology, offering visually impaired users comprehensive real-time
assistance in scene perception,
text recognition, and navigation.
Summary / 总结
This study proposes the dual technological innovation framework, including a cross-modal differ entiated quantization framework for vision-language models (VLMs) and a scene-aware vectorized memory multi-agent system for visually impaired assistance.
Controllable Hybrid Captioner for Improved Long-form Video Understanding
Authors: Kuleen Sasse, Efsun Sarioglu Kayi, Arun Reddy
First: 2025-07-22T22:09:00+00:00 · Latest: 2025-08-25T16:17:48+00:00
Abstract
Video data, especially long-form video, is extremely dense and
high-dimensional. Text-based summaries of video content offer a way to
represent query-relevant content in a much more compact manner than raw video.
In addition, textual representations are easily ingested by state-of-the-art
large language models (LLMs), which enable reasoning over video content to
answer complex natural language queries. To solve this issue, we rely on the
progressive construction of a text-based memory by a video captioner operating
on shorter chunks of the video, where spatio-temporal modeling is
computationally feasible. We explore ways to improve the quality of the
activity log comprised solely of short video captions. Because the video
captions tend to be focused on human actions, and questions may pertain to
other information in the scene, we seek to enrich the memory with static scene
descriptions using Vision Language Models (VLMs). Our video understanding
system relies on the LaViLa video captioner in combination with a LLM to answer
questions about videos. We first explored different ways of partitioning the
video into meaningful segments such that the textual descriptions more
accurately reflect the structure of the video content. Furthermore, we
incorporated static scene descriptions into the captioning pipeline using LLaVA
VLM, resulting in a more detailed and complete caption log and expanding the
space of questions that are answerable from the textual memory. Finally, we
have successfully fine-tuned the LaViLa video captioner to produce both action
and scene captions, significantly improving the efficiency of the captioning
pipeline compared to using separate captioning models for the two tasks. Our
model, controllable hybrid captioner, can alternate between different types of
captions according to special input tokens that signals scene changes detected
in the video.
Summary / 总结
Video data, especially long-form video, is extremely dense and high-dimensional.
Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models
Authors: Zesen Lyu, Dandan Zhang, Wei Ye, Fangdi Li, Zhihang Jiang, Yao Yang
Venue: EMNLP 2025
First: 2025-05-27T05:17:41+00:00 · Latest: 2025-08-25T15:55:22+00:00
Comments: Accepted by EMNLP 2025 Main conference
Abstract
Spatial reasoning is a core component of human cognition, enabling
individuals to perceive, comprehend, and interact with the physical world. It
relies on a nuanced understanding of spatial structures and inter-object
relationships, serving as the foundation for complex reasoning and
decision-making. To investigate whether current vision-language models (VLMs)
exhibit similar capability, we introduce Jigsaw-Puzzles, a novel benchmark
consisting of 1,100 carefully curated real-world images with high spatial
complexity. Based on this dataset, we design five tasks to rigorously evaluate
VLMs' spatial perception, structural understanding, and reasoning capabilities,
while deliberately minimizing reliance on domain-specific knowledge to better
isolate and assess the general spatial reasoning capability. We conduct a
comprehensive evaluation across 24 state-of-the-art VLMs. The results show that
even the strongest model, Gemini-2.5-Pro, achieves only 77.14% overall accuracy
and performs particularly poorly on the Order Generation task, with only 30.00%
accuracy, far below the performance exceeding 90% achieved by human
participants. This persistent gap underscores the need for continued progress,
positioning Jigsaw-Puzzles as a challenging and diagnostic benchmark for
advancing spatial reasoning research in VLMs. Our project page is at
https://zesen01.github.io/jigsaw-puzzles.
Summary / 总结
Spatial reasoning is a core component of human cognition, enabling individuals to perceive, comprehend, and interact with the physical world.
Architecting Clinical Collaboration: Multi-Agent Reasoning Systems for Multimodal Medical VQA
Authors: Karishma Thakrar, Shreyas Basavatia, Akshay Daftardar
First: 2025-07-07T22:31:56+00:00 · Latest: 2025-08-25T14:55:38+00:00
Abstract
Dermatological care via telemedicine often lacks the rich context of
in-person visits. Clinicians must make diagnoses based on a handful of images
and brief descriptions, without the benefit of physical exams, second opinions,
or reference materials. While many medical AI systems attempt to bridge these
gaps with domain-specific fine-tuning, this work hypothesized that mimicking
clinical reasoning processes could offer a more effective path forward. This
study tested seven vision-language models on medical visual question answering
across six configurations: baseline models, fine-tuned variants, and both
augmented with either reasoning layers that combine multiple model
perspectives, analogous to peer consultation, or retrieval-augmented generation
that incorporates medical literature at inference time, serving a role similar
to reference-checking. While fine-tuning degraded performance in four of seven
models with an average 30\% decrease, baseline models collapsed on test data.
Clinical-inspired architectures, meanwhile, achieved up to 70\% accuracy,
maintaining performance on unseen data while generating explainable,
literature-grounded outputs critical for clinical adoption. These findings
demonstrate that medical AI succeeds by reconstructing the collaborative and
evidence-based practices fundamental to clinical diagnosis.
Summary / 总结
Dermatological care via telemedicine often lacks the rich context of in-person visits.
Annotation-Free Open-Vocabulary Segmentation for Remote-Sensing Images
Authors: Kaiyu Li, Xiangyong Cao, Ruixun Liu, Shihong Wang, Zixuan Jiang, Zhi Wang, Deyu Meng
First: 2025-08-25T14:22:57+00:00 · Latest: 2025-08-25T14:22:57+00:00
Comments: All codes and models will be released at
https://github.com/earth-insights/SegEarth-OV-2
Abstract
Semantic segmentation of remote sensing (RS) images is pivotal for
comprehensive Earth observation, but the demand for interpreting new object
categories, coupled with the high expense of manual annotation, poses
significant challenges. Although open-vocabulary semantic segmentation (OVSS)
offers a promising solution, existing frameworks designed for natural images
are insufficient for the unique complexities of RS data. They struggle with
vast scale variations and fine-grained details, and their adaptation often
relies on extensive, costly annotations. To address this critical gap, this
paper introduces SegEarth-OV, the first framework for annotation-free
open-vocabulary segmentation of RS images. Specifically, we propose SimFeatUp,
a universal upsampler that robustly restores high-resolution spatial details
from coarse features, correcting distorted target shapes without any
task-specific post-training. We also present a simple yet effective Global Bias
Alleviation operation to subtract the inherent global context from patch
features, significantly enhancing local semantic fidelity. These components
empower SegEarth-OV to effectively harness the rich semantics of pre-trained
VLMs, making OVSS possible in optical RS contexts. Furthermore, to extend the
framework's universality to other challenging RS modalities like SAR images,
where large-scale VLMs are unavailable and expensive to create, we introduce
AlignEarth, which is a distillation-based strategy and can efficiently transfer
semantic knowledge from an optical VLM encoder to an SAR encoder, bypassing the
need to build SAR foundation models from scratch and enabling universal OVSS
across diverse sensor types. Extensive experiments on both optical and SAR
datasets validate that SegEarth-OV can achieve dramatic improvements over the
SOTA methods, establishing a robust foundation for annotation-free and
open-world Earth observation.
Summary / 总结
Semantic segmentation of remote sensing (RS) images is pivotal for comprehensive Earth observation, but the demand for interpreting new object categories, coupled with the high expense of manual annotation, poses significant challenges.
ArgusCogito: Chain-of-Thought for Cross-Modal Synergy and Omnidirectional Reasoning in Camouflaged Object Segmentation
Authors: Jianwen Tan, Huiyao Zhang, Rui Xiong, Han Zhou, Hongfei Wang, Ye Li
First: 2025-08-25T14:08:17+00:00 · Latest: 2025-08-25T14:08:17+00:00
Abstract
Camouflaged Object Segmentation (COS) poses a significant challenge due to
the intrinsic high similarity between targets and backgrounds, demanding models
capable of profound holistic understanding beyond superficial cues. Prevailing
methods, often limited by shallow feature representation, inadequate reasoning
mechanisms, and weak cross-modal integration, struggle to achieve this depth of
cognition, resulting in prevalent issues like incomplete target separation and
imprecise segmentation. Inspired by the perceptual strategy of the Hundred-eyed
Giant-emphasizing holistic observation, omnidirectional focus, and intensive
scrutiny-we introduce ArgusCogito, a novel zero-shot, chain-of-thought
framework underpinned by cross-modal synergy and omnidirectional reasoning
within Vision-Language Models (VLMs). ArgusCogito orchestrates three
cognitively-inspired stages: (1) Conjecture: Constructs a strong cognitive
prior through global reasoning with cross-modal fusion (RGB, depth, semantic
maps), enabling holistic scene understanding and enhanced target-background
disambiguation. (2) Focus: Performs omnidirectional, attention-driven scanning
and focused reasoning, guided by semantic priors from Conjecture, enabling
precise target localization and region-of-interest refinement. (3) Sculpting:
Progressively sculpts high-fidelity segmentation masks by integrating
cross-modal information and iteratively generating dense positive/negative
point prompts within focused regions, emulating Argus' intensive scrutiny.
Extensive evaluations on four challenging COS benchmarks and three Medical
Image Segmentation (MIS) benchmarks demonstrate that ArgusCogito achieves
state-of-the-art (SOTA) performance, validating the framework's exceptional
efficacy, superior generalization capability, and robustness.
Summary / 总结
Camouflaged Object Segmentation (COS) poses a significant challenge due to the intrinsic high similarity between targets and backgrounds, demanding models capable of profound holistic understanding beyond superficial cues.
Architectural Co-Design for Zero-Shot Anomaly Detection: Decoupling Representation and Dynamically Fusing Features in CLIP
Authors: Ke Ma, Jun Long, Hongxiao Fei, Liujie Hua, Yiran Qian, Zhen Dai, Yueyi Luo
Venue: icassp 2026
First: 2025-08-11T10:03:45+00:00 · Latest: 2025-08-25T13:57:47+00:00
Comments: 4 pages, 1 reference, 3 figures, icassp 2026
Abstract
Pre-trained Vision-Language Models (VLMs) face a significant adaptation gap
when applied to Zero-Shot Anomaly Detection (ZSAD), stemming from their lack of
local inductive biases for dense prediction and their reliance on inflexible
feature fusion paradigms. We address these limitations through an Architectural
Co-Design framework that jointly refines feature representation and cross-modal
fusion. Our method proposes a parameter-efficient Convolutional Low-Rank
Adaptation (Conv-LoRA) adapter to inject local inductive biases for
fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that
leverages visual context to adaptively modulate text prompts, enabling a
powerful bidirectional fusion. Extensive experiments on diverse industrial and
medical benchmarks demonstrate superior accuracy and robustness, validating
that this synergistic co-design is critical for robustly adapting foundation
models to dense perception tasks.
Summary / 总结
Pre-trained Vision-Language Models (VLMs) face a significant adaptation gap when applied to Zero-Shot Anomaly Detection (ZSAD), stemming from their lack of local inductive biases for dense prediction and their reliance on inflexible feature fusion paradigms.
PerPilot: Personalizing VLM-based Mobile Agents via Memory and Exploration
Authors: Xin Wang, Zhiyao Cui, Hao Li, Ya Zeng, Chenxu Wang, Ruiqi Song, Yihang Chen, Kun Shao, Qiaosheng Zhang, Jinzhuo Liu, Siyue Ren, Shuyue Hu, Zhen Wang
First: 2025-08-25T13:57:02+00:00 · Latest: 2025-08-25T13:57:02+00:00
Abstract
Vision language model (VLM)-based mobile agents show great potential for
assisting users in performing instruction-driven tasks. However, these agents
typically struggle with personalized instructions -- those containing
ambiguous, user-specific context -- a challenge that has been largely
overlooked in previous research. In this paper, we define personalized
instructions and introduce PerInstruct, a novel human-annotated dataset
covering diverse personalized instructions across various mobile scenarios.
Furthermore, given the limited personalization capabilities of existing mobile
agents, we propose PerPilot, a plug-and-play framework powered by large
language models (LLMs) that enables mobile agents to autonomously perceive,
understand, and execute personalized user instructions. PerPilot identifies
personalized elements and autonomously completes instructions via two
complementary approaches: memory-based retrieval and reasoning-based
exploration. Experimental results demonstrate that PerPilot effectively handles
personalized tasks with minimal user intervention and progressively improves
its performance with continued use, underscoring the importance of
personalization-aware reasoning for next-generation mobile agents. The dataset
and code are available at: https://github.com/xinwang-nwpu/PerPilot
Summary / 总结
Vision language model (VLM)-based mobile agents show great potential for assisting users in performing instruction-driven tasks.
See What You Need: Query-Aware Visual Intelligence through Reasoning-Perception Loops
Authors: Zixuan Dong, Baoyun Peng, Yufei Wang, Lin Liu, Xinxin Dong, Yunlong Cao, Xiaodong Wang
First: 2025-08-25T12:00:12+00:00 · Latest: 2025-08-25T12:00:12+00:00
Comments: 14 pages, 6 figures
Abstract
Human video comprehension demonstrates dynamic coordination between reasoning
and visual attention, adaptively focusing on query-relevant details. However,
current long-form video question answering systems employ rigid pipelines that
decouple reasoning from perception, leading to either information loss through
premature visual abstraction or computational inefficiency through exhaustive
processing. The core limitation lies in the inability to adapt visual
extraction to specific reasoning requirements, different queries demand
fundamentally different visual evidence from the same video content. In this
work, we present CAVIA, a training-free framework that revolutionizes video
understanding through reasoning, perception coordination. Unlike conventional
approaches where visual processing operates independently of reasoning, CAVIA
creates a closed-loop system where reasoning continuously guides visual
extraction based on identified information gaps. CAVIA introduces three
innovations: (1) hierarchical reasoning, guided localization to precise frames;
(2) cross-modal semantic bridging for targeted extraction; (3)
confidence-driven iterative synthesis. CAVIA achieves state-of-the-art
performance on challenging benchmarks: EgoSchema (65.7%, +5.3%), NExT-QA
(76.1%, +2.6%), and IntentQA (73.8%, +6.9%), demonstrating that dynamic
reasoning-perception coordination provides a scalable paradigm for video
understanding.
Summary / 总结
Human video comprehension demonstrates dynamic coordination between reasoning and visual attention, adaptively focusing on query-relevant details.
3D Feature Distillation with Object-Centric Priors
Authors: Georgios Tziafas, Yucheng Xu, Zhibin Li, Hamidreza Kasaei
First: 2024-06-26T20:16:49+00:00 · Latest: 2025-08-25T11:04:46+00:00
Abstract
Grounding natural language to the physical world is a ubiquitous topic with a
wide range of applications in computer vision and robotics. Recently, 2D
vision-language models such as CLIP have been widely popularized, due to their
impressive capabilities for open-vocabulary grounding in 2D images. Recent
works aim to elevate 2D CLIP features to 3D via feature distillation, but
either learn neural fields that are scene-specific and hence lack
generalization, or focus on indoor room scan data that require access to
multiple camera views, which is not practical in robot manipulation scenarios.
Additionally, related methods typically fuse features at pixel-level and assume
that all camera views are equally informative. In this work, we show that this
approach leads to sub-optimal 3D features, both in terms of grounding accuracy,
as well as segmentation crispness. To alleviate this, we propose a multi-view
feature fusion strategy that employs object-centric priors to eliminate
uninformative views based on semantic information, and fuse features at
object-level via instance segmentation masks. To distill our object-centric 3D
features, we generate a large-scale synthetic multi-view dataset of cluttered
tabletop scenes, spawning 15k scenes from over 3300 unique object instances,
which we make publicly available. We show that our method reconstructs 3D CLIP
features with improved grounding capacity and spatial consistency, while doing
so from single-view RGB-D, thus departing from the assumption of multiple
camera views at test time. Finally, we show that our approach can generalize to
novel tabletop domains and be re-purposed for 3D instance segmentation without
fine-tuning, and demonstrate its utility for language-guided robotic grasping
in clutter.
Summary / 总结
Grounding natural language to the physical world is a ubiquitous topic with a wide range of applications in computer vision and robotics.
AVAM: Universal Training-free Adaptive Visual Anchoring Embedded into Multimodal Large Language Model for Multi-image Question Answering
Authors: Kang Zeng, Guojin Zhong, Jintao Cheng, Jin Yuan, Zhiyong Li
First: 2025-08-25T10:10:46+00:00 · Latest: 2025-08-25T10:10:46+00:00
Comments: 14 pages, 5 figures
Abstract
The advancement of Multimodal Large Language Models (MLLMs) has driven
significant progress in Visual Question Answering (VQA), evolving from Single
to Multi Image VQA (MVQA). However, the increased number of images in MVQA
inevitably introduces substantial visual redundancy that is irrelevant to
question answering, negatively impacting both accuracy and efficiency. To
address this issue, existing methods lack flexibility in controlling the number
of compressed visual tokens and tend to produce discrete visual fragments,
which hinder MLLMs' ability to comprehend images holistically. In this paper,
we propose a straightforward yet universal Adaptive Visual Anchoring strategy,
which can be seamlessly integrated into existing MLLMs, offering significant
accuracy improvements through adaptive compression. Meanwhile, to balance the
results derived from both global and compressed visual input, we further
introduce a novel collaborative decoding mechanism, enabling optimal
performance. Extensive experiments validate the effectiveness of our method,
demonstrating consistent performance improvements across various MLLMs. The
code will be publicly available.
Summary / 总结
The advancement of Multimodal Large Language Models (MLLMs) has driven significant progress in Visual Question Answering (VQA), evolving from Single to Multi Image VQA (MVQA).
Alternating Training-based Label Smoothing Enhances Prompt Generalization
Authors: Yang Chen, Yanbin Wei, Ke Jin, Yi Kong, James Kwok, Yu Zhang
First: 2025-08-25T09:54:37+00:00 · Latest: 2025-08-25T09:54:37+00:00
Abstract
Recent advances in pre-trained vision-language models have demonstrated
remarkable zero-shot generalization capabilities. To further enhance these
models' adaptability to various downstream tasks, prompt tuning has emerged as
a parameter-efficient fine-tuning method. However, despite its efficiency, the
generalization ability of prompt remains limited. In contrast, label smoothing
(LS) has been widely recognized as an effective regularization technique that
prevents models from becoming over-confident and improves their generalization.
This inspires us to explore the integration of LS with prompt tuning. However,
we have observed that the vanilla LS even weakens the generalization ability of
prompt tuning. To address this issue, we propose the Alternating Training-based
Label Smoothing (ATLaS) method, which alternately trains with standard one-hot
labels and soft labels generated by LS to supervise the prompt tuning.
Moreover, we introduce two types of efficient offline soft labels, including
Class-wise Soft Labels (CSL) and Instance-wise Soft Labels (ISL), to provide
inter-class or instance-class relationships for prompt tuning. The theoretical
properties of the proposed ATLaS method are analyzed. Extensive experiments
demonstrate that the proposed ATLaS method, combined with CSL and ISL,
consistently enhances the generalization performance of prompt tuning.
Moreover, the proposed ATLaS method exhibits high compatibility with prevalent
prompt tuning methods, enabling seamless integration into existing methods.
Summary / 总结
Recent advances in pre-trained vision-language models have demonstrated remarkable zero-shot generalization capabilities.
PoRe: Position-Reweighted Visual Token Pruning for Vision Language Models
Authors: Kai Zhao, Wubang Yuan, Alex Lingyu Hung, Dan Zeng
First: 2025-08-25T08:56:32+00:00 · Latest: 2025-08-25T08:56:32+00:00
Abstract
Vision-Language Models (VLMs) typically process a significantly larger number
of visual tokens compared to text tokens due to the inherent redundancy in
visual signals. Visual token pruning is a promising direction to reduce the
computational cost of VLMs by eliminating redundant visual tokens. The
text-visual attention score is a widely adopted criterion for visual token
pruning as it reflects the relevance of visual tokens to the text input.
However, many sequence models exhibit a recency bias, where tokens appearing
later in the sequence exert a disproportionately large influence on the model's
output. In VLMs, this bias manifests as inflated attention scores for tokens
corresponding to the lower regions of the image, leading to suboptimal pruning
that disproportionately retains tokens from the image bottom. In this paper, we
present an extremely simple yet effective approach to alleviate the recency
bias in visual token pruning. We propose a straightforward reweighting
mechanism that adjusts the attention scores of visual tokens according to their
spatial positions in the image. Our method, termed Position-reweighted Visual
Token Pruning, is a plug-and-play solution that can be seamlessly incorporated
into existing visual token pruning frameworks without any changes to the model
architecture or extra training. Extensive experiments on LVLMs demonstrate that
our method improves the performance of visual token pruning with minimal
computational overhead.
Summary / 总结
Vision-Language Models (VLMs) typically process a significantly larger number of visual tokens compared to text tokens due to the inherent redundancy in visual signals.
Scaling Capability in Token Space: An Analysis of Large Vision Language Model
Authors: Tenghui Li, Guoxu Zhou, Xuyang Zhao, Qibin Zhao
First: 2024-12-24T12:20:24+00:00 · Latest: 2025-08-25T08:35:47+00:00
Abstract
Large language models have demonstrated predictable scaling behaviors with
respect to model parameters and training data. This study investigates whether
a similar scaling relationship exist for vision-language models with respect to
the number of vision tokens. A mathematical framework is developed to
characterize a relationship between vision token number and the expected
divergence of distance between vision-referencing sequences. The theoretical
analysis reveals two distinct scaling regimes: sublinear scaling for less
vision tokens and linear scaling for more vision tokens. This aligns with model
performance relationships of the form \(S(n) \approx c / n^{\alpha(n)}\), where
the scaling exponent relates to the correlation structure between vision token
representations. Empirical validations across multiple vision-language
benchmarks show that model performance matches the prediction from scaling
relationship. The findings contribute to understanding vision token scaling in
transformers through a theoretical framework that complements empirical
observations.
Summary / 总结
Large language models have demonstrated predictable scaling behaviors with respect to model parameters and training data.
DSADF: Thinking Fast and Slow for Decision Making
Authors: Zhihao Dou, Dongfei Cui, Jun Yan, Weida Wang, Benteng Chen, Haoming Wang, Zeke Xie, Shufei Zhang
First: 2025-05-13T02:58:04+00:00 · Latest: 2025-08-25T08:03:43+00:00
Abstract
Although Reinforcement Learning (RL) agents are effective in well-defined
environments, they often struggle to generalize their learned policies to
dynamic settings due to their reliance on trial-and-error interactions. Recent
work has explored applying Large Language Models (LLMs) or Vision Language
Models (VLMs) to boost the generalization of RL agents through policy
optimization guidance or prior knowledge. However, these approaches often lack
seamless coordination between the RL agent and the foundation model, leading to
unreasonable decision-making in unfamiliar environments and efficiency
bottlenecks. Making full use of the inferential capabilities of foundation
models and the rapid response capabilities of RL agents and enhancing the
interaction between the two to form a dual system is still a lingering
scientific question. To address this problem, we draw inspiration from
Kahneman's theory of fast thinking (System 1) and slow thinking (System 2),
demonstrating that balancing intuition and deep reasoning can achieve nimble
decision-making in a complex world. In this study, we propose a Dual-System
Adaptive Decision Framework (DSADF), integrating two complementary modules:
System 1, comprising an RL agent and a memory space for fast and intuitive
decision making, and System 2, driven by a VLM for deep and analytical
reasoning. DSADF facilitates efficient and adaptive decision-making by
combining the strengths of both systems. The empirical study in the video game
environment: Crafter and Housekeep demonstrates the effectiveness of our
proposed method, showing significant improvements in decision abilities for
both unseen and known tasks.
Summary / 总结
Although Reinforcement Learning (RL) agents are effective in well-defined environments, they often struggle to generalize their learned policies to dynamic settings due to their reliance on trial-and-error interactions.
SuperGen: An Efficient Ultra-high-resolution Video Generation System with Sketching and Tiling
Authors: Fanjiang Ye, Zepeng Zhao, Yi Mu, Jucheng Shen, Renjie Li, Kaijian Wang, Desen Sun, Saurabh Agarwal, Myungjin Lee, Triston Cao, Aditya Akella, Arvind Krishnamurthy, T. S. Eugene Ng, Zhengzhong Tu, Yuke Wang
First: 2025-08-25T07:49:17+00:00 · Latest: 2025-08-25T07:49:17+00:00
Abstract
Diffusion models have recently achieved remarkable success in generative
tasks (e.g., image and video generation), and the demand for high-quality
content (e.g., 2K/4K videos) is rapidly increasing across various domains.
However, generating ultra-high-resolution videos on existing
standard-resolution (e.g., 720p) platforms remains challenging due to the
excessive re-training requirements and prohibitively high computational and
memory costs. To this end, we introduce SuperGen, an efficient tile-based
framework for ultra-high-resolution video generation. SuperGen features a novel
training-free algorithmic innovation with tiling to successfully support a wide
range of resolutions without additional training efforts while significantly
reducing both memory footprint and computational complexity. Moreover, SuperGen
incorporates a tile-tailored, adaptive, region-aware caching strategy that
accelerates video generation by exploiting redundancy across denoising steps
and spatial regions. SuperGen also integrates cache-guided,
communication-minimized tile parallelism for enhanced throughput and minimized
latency. Evaluations demonstrate that SuperGen harvests the maximum performance
gains while achieving high output quality across various benchmarks.
Summary / 总结
Diffusion models have recently achieved remarkable success in generative tasks (e.g., image and video generation), and the demand for high-quality content (e.g., 2K/4K videos) is rapidly increasing across various domains.
Instant Preference Alignment for Text-to-Image Diffusion Models
Authors: Yang Li, Songlin Yang, Xiaoxuan Han, Wei Wang, Jing Dong, Yueming Lyu, Ziyu Xue
First: 2025-08-25T06:51:15+00:00 · Latest: 2025-08-25T06:51:15+00:00
Comments: 17 figures
Abstract
Text-to-image (T2I) generation has greatly enhanced creative expression, yet
achieving preference-aligned generation in a real-time and training-free manner
remains challenging. Previous methods often rely on static, pre-collected
preferences or fine-tuning, limiting adaptability to evolving and nuanced user
intents. In this paper, we highlight the need for instant preference-aligned
T2I generation and propose a training-free framework grounded in multimodal
large language model (MLLM) priors. Our framework decouples the task into two
components: preference understanding and preference-guided generation. For
preference understanding, we leverage MLLMs to automatically extract global
preference signals from a reference image and enrich a given prompt using
structured instruction design. Our approach supports broader and more
fine-grained coverage of user preferences than existing methods. For
preference-guided generation, we integrate global keyword-based control and
local region-aware cross-attention modulation to steer the diffusion model
without additional training, enabling precise alignment across both global
attributes and local elements. The entire framework supports multi-round
interactive refinement, facilitating real-time and context-aware image
generation. Extensive experiments on the Viper dataset and our collected
benchmark demonstrate that our method outperforms prior approaches in both
quantitative metrics and human evaluations, and opens up new possibilities for
dialog-based generation and MLLM-diffusion integration.
Summary / 总结
Text-to-image (T2I) generation has greatly enhanced creative expression, yet achieving preference-aligned generation in a real-time and training-free manner remains challenging.
F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model
Authors: Hanbo Bi, Zhiqiang Yuan, Zexi Jia, Jiapei Zhang, Chongyang Li, Peixiang Luo, Ying Deng, Xiaoyue Duan, Jinchao Zhang
First: 2025-08-25T06:42:47+00:00 · Latest: 2025-08-25T06:42:47+00:00
Abstract
Traditional dialogue retrieval aims to select the most appropriate utterance
or image from recent dialogue history. However, they often fail to meet users'
actual needs for revisiting semantically coherent content scattered across
long-form conversations. To fill this gap, we define the Fine-grained Fragment
Retrieval (FFR) task, requiring models to locate query-relevant fragments,
comprising both utterances and images, from multimodal long-form dialogues. As
a foundation for FFR, we construct MLDR, the longest-turn multimodal dialogue
retrieval dataset to date, averaging 25.45 turns per dialogue, with each
naturally spanning three distinct topics. To evaluate generalization in
real-world scenarios, we curate and annotate a WeChat-based test set comprising
real-world multimodal dialogues with an average of 75.38 turns. Building on
these resources, we explore existing generation-based Vision-Language Models
(VLMs) on FFR and observe that they often retrieve incoherent utterance-image
fragments. While optimized for generating responses from visual-textual inputs,
these models lack explicit supervision to ensure semantic coherence within
retrieved fragments. To this end, we propose F2RVLM, a generative retrieval
model trained in a two-stage paradigm: (1) supervised fine-tuning to inject
fragment-level retrieval knowledge, and (2) GRPO-based reinforcement learning
with multi-objective rewards promoting semantic precision, relevance, and
contextual coherence. To handle varying intra-fragment complexity, from locally
dense to sparsely distributed, we introduce difficulty-aware curriculum
sampling that ranks training instances by model-predicted difficulty and
gradually exposes the model to harder samples. This boosts reasoning ability in
long, multi-turn contexts. F2RVLM outperforms popular VLMs in both in-domain
and real-domain settings, demonstrating superior retrieval performance.
Summary / 总结
Traditional dialogue retrieval aims to select the most appropriate utterance or image from recent dialogue history.
Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing
Authors: Yogesh Kumar
First: 2025-08-25T05:51:21+00:00 · Latest: 2025-08-25T05:51:21+00:00
Abstract
Vision Language Models (VLMs) struggle with long-form videos due to the
quadratic complexity of attention mechanisms. We propose Language-Guided
Temporal Token Pruning (LGTTP), which leverages temporal cues from queries to
adaptively prune video tokens, preserving contextual continuity while reducing
computational overhead. Unlike uniform pruning or keyframe selection, LGTTP
retains higher token density in temporally relevant segments. Our
model-agnostic framework integrates with TimeChat and LLaVA-Video, achieving a
65% reduction in computation while preserving 97-99% of the original
performance. On QVHighlights, LGTTP improves HIT@1 by +9.5%, and on
Charades-STA, it retains 99.6% of R@1. It excels on queries with explicit
temporal markers and remains effective across general video understanding
tasks.
Summary / 总结
Vision Language Models (VLMs) struggle with long-form videos due to the quadratic complexity of attention mechanisms.
Hierarchical Vision-Language Learning for Medical Out-of-Distribution Detection
Authors: Runhe Lai, Xinhua Lu, Kanghao Chen, Qichao Chen, Wei-Shi Zheng, Ruixuan Wang
First: 2025-08-25T04:55:27+00:00 · Latest: 2025-08-25T04:55:27+00:00
Comments: 10 pages, 2 figures, Accepted by MICCAI2025
Abstract
In trustworthy medical diagnosis systems, integrating out-of-distribution
(OOD) detection aims to identify unknown diseases in samples, thereby
mitigating the risk of misdiagnosis. In this study, we propose a novel OOD
detection framework based on vision-language models (VLMs), which integrates
hierarchical visual information to cope with challenging unknown diseases that
resemble known diseases. Specifically, a cross-scale visual fusion strategy is
proposed to couple visual embeddings from multiple scales. This enriches the
detailed representation of medical images and thus improves the discrimination
of unknown diseases. Moreover, a cross-scale hard pseudo-OOD sample generation
strategy is proposed to benefit OOD detection maximally. Experimental
evaluations on three public medical datasets support that the proposed
framework achieves superior OOD detection performance compared to existing
methods. The source code is available at https://openi.pcl.ac.cn/OpenMedIA/HVL.
Summary / 总结
In trustworthy medical diagnosis systems, integrating out-of-distribution (OOD) detection aims to identify unknown diseases in samples, thereby mitigating the risk of misdiagnosis.
Dynamic Embedding of Hierarchical Visual Features for Efficient Vision-Language Fine-Tuning
Authors: Xinyu Wei, Guoli Yang, Jialu Zhou, Mingyue Yang, Leqian Li, Kedi Zhang, Chunping Qiu
First: 2025-08-25T03:57:46+00:00 · Latest: 2025-08-25T03:57:46+00:00
Abstract
Large Vision-Language Models (LVLMs) commonly follow a paradigm that projects
visual features and then concatenates them with text tokens to form a unified
sequence input for Large Language Models (LLMs). However, this paradigm leads
to a significant increase in the length of the input sequence, resulting in
substantial computational overhead. Existing methods attempt to fuse visual
information into the intermediate layers of LLMs, which alleviate the sequence
length issue but often neglect the hierarchical semantic representations within
the model and the fine-grained visual information available in the shallower
visual encoding layers. To address this limitation, we propose DEHVF, an
efficient vision-language fine-tuning method based on dynamic embedding and
fusion of hierarchical visual features. Its core lies in leveraging the
inherent hierarchical representation characteristics of visual encoders and
language models. Through a lightweight hierarchical visual fuser, it
dynamically selects and fuses hierarchical features corresponding to semantic
granularity based on the internal representations of each layer in LLMs. The
fused layer-related visual features are then projected and aligned before being
directly embedded into the Feed-Forward Network (FFN) of the corresponding
layer in LLMs. This approach not only avoids sequence expansion but also
dynamically fuses multi-layer visual information. By fine-tuning only a small
number of parameters, DEHVF achieves precise alignment and complementarity of
cross-modal information at the same semantic granularity. We conducted
experiments across various VL benchmarks, including visual question answering
on ScienceQA and image captioning on COCO Captions. The results demonstrate
that DEHVF achieves higher accuracy than existing parameter-efficient
fine-tuning (PEFT) baselines while maintaining efficient training and
inference.
Summary / 总结
Large Vision-Language Models (LVLMs) commonly follow a paradigm that projects visual features and then concatenates them with text tokens to form a unified sequence input for Large Language Models (LLMs).
OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models
Authors: Huanpeng Chu, Wei Wu, Guanyu Fen, Yutao Zhang
Venue: ICCV 2025
First: 2025-08-22T08:36:58+00:00 · Latest: 2025-08-25T03:07:02+00:00
Comments: Accepted by ICCV 2025
Abstract
Diffusion models have emerged as a powerful paradigm for generative tasks
such as image synthesis and video generation, with Transformer architectures
further enhancing performance. However, the high computational cost of
diffusion Transformers-stemming from a large number of sampling steps and
complex per-step computations-presents significant challenges for real-time
deployment. In this paper, we introduce OmniCache, a training-free acceleration
method that exploits the global redundancy inherent in the denoising process.
Unlike existing methods that determine caching strategies based on inter-step
similarities and tend to prioritize reusing later sampling steps, our approach
originates from the sampling perspective of DIT models. We systematically
analyze the model's sampling trajectories and strategically distribute cache
reuse across the entire sampling process. This global perspective enables more
effective utilization of cached computations throughout the diffusion
trajectory, rather than concentrating reuse within limited segments of the
sampling procedure. In addition, during cache reuse, we dynamically estimate
the corresponding noise and filter it out to reduce its impact on the sampling
direction. Extensive experiments demonstrate that our approach accelerates the
sampling process while maintaining competitive generative quality, offering a
promising and practical solution for efficient deployment of diffusion-based
generative models.
Summary / 总结
Diffusion models have emerged as a powerful paradigm for generative tasks such as image synthesis and video generation, with Transformer architectures further enhancing performance.
T*: Re-thinking Temporal Search for Long-Form Video Understanding
Authors: Jinhui Ye, Zihan Wang, Haosen Sun, Keshigeyan Chandrasegaran, Zane Durante, Cristobal Eyzaguirre, Yonatan Bisk, Juan Carlos Niebles, Ehsan Adeli, Li Fei-Fei, Jiajun Wu, Manling Li
Venue: CVPR 2025 long
First: 2025-04-03T04:03:10+00:00 · Latest: 2025-08-25T02:57:46+00:00
Comments: Accepted by CVPR 2025; A real-world long video needle-in-haystack
benchmark; long-video QA with human ref frames
Abstract
Efficiently understanding long-form videos remains a significant challenge in
computer vision. In this work, we revisit temporal search paradigms for
long-form video understanding and address a fundamental issue pertaining to all
state-of-the-art (SOTA) long-context vision-language models (VLMs). Our
contributions are twofold: First, we frame temporal search as a Long Video
Haystack problem: finding a minimal set of relevant frames (e.g., one to five)
from tens of thousands based on specific queries. Upon this formulation, we
introduce LV-Haystack, the first dataset with 480 hours of videos, 15,092
human-annotated instances for both training and evaluation aiming to improve
temporal search quality and efficiency. Results on LV-Haystack highlight a
significant research gap in temporal search capabilities, with current SOTA
search methods only achieving 2.1% temporal F1 score on the Longvideobench
subset. Next, inspired by visual search in images, we propose a lightweight
temporal search framework, T* that reframes costly temporal search as spatial
search. T* leverages powerful visual localization techniques commonly used in
images and introduces an adaptive zooming-in mechanism that operates across
both temporal and spatial dimensions. Extensive experiments show that
integrating T* with existing methods significantly improves SOTA long-form
video understanding. Under an inference budget of 32 frames, T* improves
GPT-4o's performance from 50.5% to 53.1% and LLaVA-OneVision-OV-72B's
performance from 56.5% to 62.4% on the Longvideobench XL subset. Our code,
benchmark, and models are provided in the Supplementary material.
Summary / 总结
Efficiently understanding long-form videos remains a significant challenge in computer vision.
TinyGiantVLM: A Lightweight Vision-Language Architecture for Spatial Reasoning under Resource Constraints
Authors: Vinh-Thuan Ly, Hoang M. Truong, Xuan-Huong Nguyen
Venue: ICCV
First: 2025-08-25T01:36:22+00:00 · Latest: 2025-08-25T01:36:22+00:00
Comments: Accepted for presentation at the IEEE/CVF International Conference on
Computer Vision (ICCV) Workshops, 2025
Abstract
Reasoning about fine-grained spatial relationships in warehouse-scale
environments poses a significant challenge for existing vision-language models
(VLMs), which often struggle to comprehend 3D layouts, object arrangements, and
multimodal cues in real-world industrial settings. In this paper, we present
TinyGiantVLM, a lightweight and modular two-stage framework designed for
physical spatial reasoning, distinguishing itself from traditional geographic
reasoning in complex logistics scenes. Our approach encodes both global and
region-level features from RGB and depth modalities using pretrained visual
backbones. To effectively handle the complexity of high-modality inputs and
diverse question types, we incorporate a Mixture-of-Experts (MoE) fusion
module, which dynamically combines spatial representations to support
downstream reasoning tasks and improve convergence. Training is conducted in a
two-phase strategy: the first phase focuses on generating free-form answers to
enhance spatial reasoning ability, while the second phase uses normalized
answers for evaluation. Evaluated on Track 3 of the AI City Challenge 2025, our
64M-parameter base model achieved 5th place on the leaderboard with a score of
66.8861, demonstrating strong performance in bridging visual perception and
spatial understanding in industrial environments. We further present an
80M-parameter variant with expanded MoE capacity, which demonstrates improved
performance on spatial reasoning tasks.
Summary / 总结
Reasoning about fine-grained spatial relationships in warehouse-scale environments poses a significant challenge for existing vision-language models (VLMs), which often struggle to comprehend 3D layouts, object arrangements, and multimodal cues in real-world industrial settings.
HERO: Hierarchical Extrapolation and Refresh for Efficient World Models
Authors: Quanjian Song, Xinyu Wang, Donghao Zhou, Jingyu Lin, Cunjian Chen, Yue Ma, Xiu Li
First: 2025-08-25T01:22:15+00:00 · Latest: 2025-08-25T01:22:15+00:00
Comments: 12 pages in total
Abstract
Generation-driven world models create immersive virtual environments but
suffer slow inference due to the iterative nature of diffusion models. While
recent advances have improved diffusion model efficiency, directly applying
these techniques to world models introduces limitations such as quality
degradation. In this paper, we present HERO, a training-free hierarchical
acceleration framework tailored for efficient world models. Owing to the
multi-modal nature of world models, we identify a feature coupling phenomenon,
wherein shallow layers exhibit high temporal variability, while deeper layers
yield more stable feature representations. Motivated by this, HERO adopts
hierarchical strategies to accelerate inference: (i) In shallow layers, a
patch-wise refresh mechanism efficiently selects tokens for recomputation. With
patch-wise sampling and frequency-aware tracking, it avoids extra metric
computation and remain compatible with FlashAttention. (ii) In deeper layers, a
linear extrapolation scheme directly estimates intermediate features. This
completely bypasses the computations in attention modules and feed-forward
networks. Our experiments show that HERO achieves a 1.73$\times$ speedup with
minimal quality degradation, significantly outperforming existing diffusion
acceleration methods.
Summary / 总结
Generation-driven world models create immersive virtual environments but suffer slow inference due to the iterative nature of diffusion models.
MetaGen: A DSL, Database, and Benchmark for VLM-Assisted Metamaterial Generation
Authors: Liane Makatura, Benjamin Jones, Siyuan Bian, Wojciech Matusik
First: 2025-08-25T00:36:07+00:00 · Latest: 2025-08-25T00:36:07+00:00
Abstract
Metamaterials are micro-architected structures whose geometry imparts highly
tunable-often counter-intuitive-bulk properties. Yet their design is difficult
because of geometric complexity and a non-trivial mapping from architecture to
behaviour. We address these challenges with three complementary contributions.
(i) MetaDSL: a compact, semantically rich domain-specific language that
captures diverse metamaterial designs in a form that is both human-readable and
machine-parsable. (ii) MetaDB: a curated repository of more than 150,000
parameterized MetaDSL programs together with their
derivatives-three-dimensional geometry, multi-view renderings, and simulated
elastic properties. (iii) MetaBench: benchmark suites that test three core
capabilities of vision-language metamaterial assistants-structure
reconstruction, property-driven inverse design, and performance prediction. We
establish baselines by fine-tuning state-of-the-art vision-language models and
deploy an omni-model within an interactive, CAD-like interface. Case studies
show that our framework provides a strong first step toward integrated design
and understanding of structure-representation-property relationships.
Summary / 总结
Metamaterials are micro-architected structures whose geometry imparts highly tunable-often counter-intuitive-bulk properties.
RT-Cache: Training-Free Retrieval for Real-Time Manipulation
Authors: Owen Kwon, Abraham George, Alison Bartsch, Amir Barati Farimani
First: 2025-05-14T00:41:44+00:00 · Latest: 2025-08-25T00:15:27+00:00
Comments: 8 pages, 6 figures. 2025 IEEE-RAS 24th International Conference on
Humanoid Robots
Abstract
Real robots are expected to repeat the same behavior in new environments with
very little new data, yet modern controllers either incur heavy per-step
inference or require deployment-time fine-tuning. We propose RT-Cache, a
training-free retrieval-as-control pipeline that caches diverse image action
trajectories in a unified vector memory and, at test time, embeds the current
frame to retrieve and replay multi-step snippets, replacing per-step model
calls. A hierarchical search keeps lookups sub-second at million scale,
shifting cost from compute to storage and enabling real-time control on modest
GPUs. Across real-robot tasks and large open logs, RT-Cache achieves higher
success and lower completion time than strong retrieval baselines
(approximately x2 higher success and ~30% faster in our settings), and a
single-episode anchoring study shows immediate adaptation to a more complex,
contact-rich task without fine-tuning. RT-Cache turns experience into an
append-only memory, offering a simple, scalable path to few-shot deployment
today and a foundation for multimodal keys and optional integration with
high-level policies. Project page: https://rt-cache.github.io/.
Summary / 总结
Real robots are expected to repeat the same behavior in new environments with very little new data, yet modern controllers either incur heavy per-step inference or require deployment-time fine-tuning.