Flow Matching-Based Generative Modeling for Efficient and Scalable Data Assimilation
Authors: Taos Transue, Bohan Chen, So Takao, Bao Wang
First: 2025-08-18T19:00:45+00:00 · Latest: 2025-08-22T15:54:49+00:00
Comments: correcting authorship footnote, reformatting figures
Abstract
Data assimilation (DA) is the problem of sequentially estimating the state of
a dynamical system from noisy observations. Recent advances in generative
modeling have inspired new approaches to DA in high-dimensional nonlinear
settings, especially the ensemble score filter (EnSF). However, these come at a
significant computational burden due to slow sampling. In this paper, we
introduce a new filtering framework based on flow matching (FM) -- called the
ensemble flow filter (EnFF) -- to accelerate sampling and enable flexible
design of probability paths. EnFF -- a training-free DA approach -- integrates
MC estimators for the marginal FM vector field (VF) and a localized guidance to
assimilate observations. EnFF has faster sampling and more flexibility in VF
design compared to existing generative modeling for DA. Theoretically, we show
that EnFF encompasses classical filtering methods such as the bootstrap
particle filter and the ensemble Kalman filter as special cases. Experiments on
high-dimensional filtering benchmarks demonstrate improved cost-accuracy
tradeoffs and the ability to leverage larger ensembles than prior methods. Our
results highlight the promise of FM as a scalable tool for filtering in
high-dimensional applications that enable the use of large ensembles.
中文标题/摘要
标题:基于流匹配的高效可扩展数据同化生成建模
数据同化(DA)是从噪声观测中顺序估计动态系统状态的问题。生成建模的最新进展为高维非线性场景下的DA提供了新方法,尤其是集成评分滤波器(EnSF)。然而,由于采样速度慢,这些方法带来了显著的计算负担。本文提出了一种基于流匹配(FM)的新滤波框架——集成流滤波器(EnFF),以加速采样并实现概率路径的灵活设计。作为一种无需训练的DA方法,EnFF整合了边际FM向量场(VF)的蒙特卡洛估计器和局部化指导来同化观测。与现有DA生成建模相比,EnFF具有更快的采样速度和更灵活的VF设计。理论上,我们证明EnFF包含了经典滤波方法(如自举粒子滤波器和集成卡尔曼滤波器)作为特例。在高维滤波基准测试中,实验显示出改进的成本-精度权衡以及利用比先前方法更大集成规模的能力。我们的结果凸显了FM作为高维应用中可扩展滤波工具的潜力,能够支持大规模集成使用。
Summary / 总结
Data assimilation (DA) is the problem of sequentially estimating the state of a dynamical system from noisy observations.
Modular Embedding Recomposition for Incremental Learning
Authors: Aniello Panariello, Emanuele Frascaroli, Pietro Buzzega, Lorenzo Bonicelli, Angelo Porrello, Simone Calderara
First: 2025-08-22T15:25:40+00:00 · Latest: 2025-08-22T15:25:40+00:00
Comments: Accepted to the 36th British Machine Vision Conference (BMVC 2025),
Sheffield, UK
Abstract
The advent of pre-trained Vision-Language Models (VLMs) has significantly
transformed Continual Learning (CL), mainly due to their zero-shot
classification abilities. Such proficiency makes VLMs well-suited for
real-world applications, enabling robust performance on novel unseen classes
without requiring adaptation. However, fine-tuning remains essential when
downstream tasks deviate significantly from the pre-training domain. Prior CL
approaches primarily focus on preserving the zero-shot capabilities of VLMs
during incremental fine-tuning on a downstream task. We take a step further by
devising an approach that transforms preservation into enhancement of the
zero-shot capabilities of VLMs. Our approach, named MoDular Embedding
Recomposition (MoDER), introduces a modular framework that trains multiple
textual experts, each specialized in a single seen class, and stores them in a
foundational hub. At inference time, for each unseen class, we query the hub
and compose the retrieved experts to synthesize a refined prototype that
improves classification. We show the effectiveness of our method across two
popular zero-shot incremental protocols, Class-IL and MTIL, comprising a total
of 14 datasets. The codebase is available at
https://github.com/aimagelab/mammoth.
中文标题/摘要
标题:模块化嵌入重组在增量学习中的应用
预训练视觉-语言模型(VLMs)的出现显著改变了持续学习(CL)领域,主要得益于其零样本分类能力。这种能力使VLMs非常适合现实应用,能在无需适配的情况下对未见类别保持强大性能。然而当下游任务与预训练领域差异较大时,微调仍不可或缺。现有CL方法主要关注在下游任务增量微调期间保持VLMs的零样本能力,我们进一步提出将这种保持转化为增强的方法——模块化嵌入重组(MoDER)。该方法通过训练多个文本专家(每个专精于一个已见类别)并存储于基础中心,推理时针对未见类别查询中心并组合检索到的专家以合成改进的分类原型。我们在Class-IL和MTIL两种零样本增量协议(共包含14个数据集)上验证了方法的有效性。代码库详见:https://github.com/aimagelab/mammoth。
Summary / 总结
The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities.
PediatricsMQA: a Multi-modal Pediatrics Question Answering Benchmark
Authors: Adil Bahaj, Mounir Ghogho
First: 2025-08-22T14:50:55+00:00 · Latest: 2025-08-22T14:50:55+00:00
Abstract
Large language models (LLMs) and vision-augmented LLMs (VLMs) have
significantly advanced medical informatics, diagnostics, and decision support.
However, these models exhibit systematic biases, particularly age bias,
compromising their reliability and equity. This is evident in their poorer
performance on pediatric-focused text and visual question-answering tasks. This
bias reflects a broader imbalance in medical research, where pediatric studies
receive less funding and representation despite the significant disease burden
in children. To address these issues, a new comprehensive multi-modal pediatric
question-answering benchmark, PediatricsMQA, has been introduced. It consists
of 3,417 text-based multiple-choice questions (MCQs) covering 131 pediatric
topics across seven developmental stages (prenatal to adolescent) and 2,067
vision-based MCQs using 634 pediatric images from 67 imaging modalities and 256
anatomical regions. The dataset was developed using a hybrid manual-automatic
pipeline, incorporating peer-reviewed pediatric literature, validated question
banks, existing benchmarks, and existing QA resources. Evaluating
state-of-the-art open models, we find dramatic performance drops in younger
cohorts, highlighting the need for age-aware methods to ensure equitable AI
support in pediatric care.
中文标题/摘要
标题:儿科多模态问答基准:PediatricsMQA
大型语言模型(LLMs)与视觉增强语言模型(VLMs)显著推动了医学信息学、诊断及决策支持的发展。然而,这些模型存在系统性偏见,尤其是年龄偏见,损害了其可靠性与公平性。这在针对儿科文本和视觉问答任务的表现下降中尤为明显。该偏见反映了医学研究中更广泛的不平衡现象——尽管儿童疾病负担重大,但儿科研究获得的资金和代表性不足。为解决这些问题,我们推出了全新的多模态儿科问答基准PediatricsMQA,包含3,417道基于文本的多选题(涵盖131个儿科主题及七个发育阶段)和2,067道基于视觉的多选题(使用634张儿科图像,覆盖67种成像模式和256个解剖区域)。该数据集通过人工-自动混合流程开发,整合了经同行评审的儿科文献、验证题库、现有基准及问答资源。对顶尖开源模型的评估显示,其在低龄组别的性能显著下降,凸显了需要开发年龄感知方法以确保儿科护理中人工智能支持的公平性。
Summary / 总结
Large language models (LLMs) and vision-augmented LLMs (VLMs) have significantly advanced medical informatics, diagnostics, and decision support.
CAMA: Enhancing Multimodal In-Context Learning with Context-Aware Modulated Attention
Authors: Yanshu Li, Jianjiang Yang, Ziteng Yang, Bozheng Li, Hongyang He, Zhengtao Yao, Ligong Han, Yingjie Victor Chen, Songlin Fei, Dongfang Liu, Ruixiang Tang
First: 2025-05-21T04:25:23+00:00 · Latest: 2025-08-22T14:44:22+00:00
Comments: 14 pages, 8 figures, 5 tables
Abstract
Multimodal in-context learning (ICL) is emerging as a key capability that
enables large vision-language models (LVLMs) to adapt to novel tasks without
parameter updates, expanding their utility across various real-world
applications. However, ICL remains unstable, even with well-matched in-context
demonstrations (ICDs), suggesting that LVLMs struggle to fully utilize the
provided context. While existing efforts focus on prompt engineering or
post-hoc logit calibration, we instead investigate the underlying attention
dynamics to overcome LVLMs' inherent limitations. We identify two critical
deficits in their self-attention that impair effective ICL. To bridge the gap,
we propose \textbf{Context-Aware Modulated Attention} (CAMA), a plug-and-play
and training-free method that dynamically modulates LVLM's attention logits
based on the input in-context sequence. CAMA employs a two-stage attention
modulation to address both identified deficits, enhancing the focus on
semantically significant tokens, particularly visual ones. Across four LVLMs
and seven benchmarks, CAMA consistently outperforms vanilla models and
baselines, demonstrating great effectiveness and generalization. It can also
activate the desired effects of prompt engineering methods and remains robust
under diverse sequence configurations. Thus, CAMA paves the way for deeper
explorations of attention dynamics to advance multimodal reasoning.
中文标题/摘要
标题:CAMA:通过上下文感知调制注意力增强多模态上下文学习
多模态上下文学习(ICL)正成为大型视觉语言模型(LVLM)无需参数更新即可适应新任务的关键能力,扩展了其在实际应用中的效用。然而,即使有匹配的上下文演示(ICD),ICL仍不稳定,表明LVLM难以充分利用所提供的上下文。现有研究多关注提示工程或后验逻辑校准,而本研究从注意力动力学入手,旨在克服LVLM的固有局限。我们识别了自注意力机制中两个损害有效ICL的关键缺陷,并提出即插即用、无需训练的**上下文感知调制注意力**(CAMA)方法。CAMA通过两阶段注意力调制动态调整LVLM的注意力逻辑,增强对语义显著标记(尤其是视觉标记)的关注。在四个LVLM和七个基准测试中,CAMA始终优于原始模型和基线方法,展现出卓越的有效性和泛化能力。该方法还能激活提示工程的预期效果,并在不同序列配置下保持稳健,为深入探索注意力动力学以推动多模态推理开辟了新途径。
Summary / 总结
Multimodal in-context learning (ICL) is emerging as a key capability that enables large vision-language models (LVLMs) to adapt to novel tasks without parameter updates, expanding their utility across various real-world applications.
Do What? Teaching Vision-Language-Action Models to Reject the Impossible
Authors: Wen-Han Hsieh, Elvis Hsieh, Dantong Niu, Trevor Darrell, Roei Herzig, David M. Chan
First: 2025-08-22T10:54:33+00:00 · Latest: 2025-08-22T10:54:33+00:00
Comments: 9 pages, 2 figures, 1 table
Abstract
Recently, Vision-Language-Action (VLA) models have demonstrated strong
performance on a range of robotic tasks. These models rely on multimodal
inputs, with language instructions playing a crucial role -- not only in
predicting actions, but also in robustly interpreting user intent, even when
the requests are impossible to fulfill. In this work, we investigate how VLAs
can recognize, interpret, and respond to false-premise instructions: natural
language commands that reference objects or conditions absent from the
environment. We propose Instruct-Verify-and-Act (IVA), a unified framework that
(i) detects when an instruction cannot be executed due to a false premise, (ii)
engages in language-based clarification or correction, and (iii) grounds
plausible alternatives in perception and action. Towards this end, we construct
a large-scale instruction tuning setup with structured language prompts and
train a VLA model capable of handling both accurate and erroneous requests. Our
approach leverages a contextually augmented, semi-synthetic dataset containing
paired positive and false-premise instructions, enabling robust detection and
natural language correction. Our experiments show that IVA improves false
premise detection accuracy by 97.56% over baselines, while increasing
successful responses in false-premise scenarios by 50.78%.
中文标题/摘要
标题:做什么?教导视觉-语言-动作模型拒绝不可能的任务
近期,视觉-语言-动作(VLA)模型在一系列机器人任务中展现出强大性能。这些模型依赖多模态输入,其中语言指令不仅对预测动作至关重要,还能在请求无法实现时稳健解读用户意图。本研究探索VLA模型如何识别、解释并响应错误前提指令——即引用环境中不存在对象或条件的自然语言命令。我们提出指令-验证-执行(IVA)统一框架,该框架能够:(i)检测因错误前提导致指令不可执行的情况,(ii)进行基于语言的澄清或修正,(iii)在感知与行动中落实合理替代方案。为此,我们构建了包含结构化语言提示的大规模指令调优设置,并训练出能同时处理准确与错误请求的VLA模型。该方法利用上下文增强的半合成数据集(包含配对的正误前提指令),实现稳健检测与自然语言修正。实验表明,IVA将错误前提检测准确率较基线提升97.56%,在错误前提场景下的成功响应率提高50.78%。
Summary / 总结
Recently, Vision-Language-Action (VLA) models have demonstrated strong performance on a range of robotic tasks.
Structuring GUI Elements through Vision Language Models: Towards Action Space Generation
Authors: Yi Xu, Yesheng Zhang, jiajia Liu, Jingdong Chen
First: 2025-08-22T10:14:15+00:00 · Latest: 2025-08-22T10:14:15+00:00
Comments: 10pageV0
Abstract
Multimodal large language models (MLLMs) have emerged as pivotal tools in
enhancing human-computer interaction. In this paper we focus on the application
of MLLMs in the field of graphical user interface (GUI) elements structuring,
where they assist in processing user instructions based on screen contents.
Despite the promise of MLLMs, their performance in precisely generating UI
element coordinates, a critical aspect of GUI understanding, is hindered by the
nature of next-token prediction training. This challenge arises from the
semantic void surrounding numerical UI coordinates in language representation
spaces, necessitating a substantial and diverse dataset to bolster visual
module capabilities. To address these limitations, we introduce an
IoU-Augmented Maximum Likelihood (IAML) training paradigm. Specifically, our
approach involves a novel pipeline for IoU-based coordinate sampling to augment
the training data, which considers the proximity to ground truth coordinates.
This data augmentation strategy is then employed to fine-tune MLLMs under the
IAML paradigm, which is designed to mitigate the exposure bias problem inherent
in traditional maximum likelihood estimation. Through extensive experiments, we
demonstrate the superior performance of our IAML training approach over
traditional training paradigms.
中文标题/摘要
标题:基于视觉语言模型的GUI元素结构化:面向动作空间生成
多模态大语言模型(MLLMs)已成为增强人机交互的关键工具。本文聚焦于MLLMs在图形用户界面(GUI)元素结构化领域的应用,通过解析屏幕内容协助处理用户指令。尽管MLLMs展现出巨大潜力,但其在精确生成UI元素坐标(GUI理解的核心环节)方面受限于下一词预测的训练机制。这一挑战源于语言表示空间中数值坐标的语义缺失,需要大规模多样化数据集来增强视觉模块能力。为此,我们提出了交并比增强最大似然(IAML)训练范式:首先构建基于IoU的坐标采样新流程扩充训练数据,该流程考虑与真实坐标的邻近度;随后运用此数据增强策略在IAML范式下微调MLLMs,以缓解传统最大似然估计固有的曝光偏差问题。大量实验证明,IAML训练方法显著优于传统训练范式。
Summary / 总结
Multimodal large language models (MLLMs) have emerged as pivotal tools in enhancing human-computer interaction.
Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
Authors: Konstantin Berestizshevsky, Renzo Andri, Lukas Cavigelli
First: 2025-02-12T12:50:15+00:00 · Latest: 2025-08-22T09:24:39+00:00
Comments: 11 pages, 11 figures + Appendix. work under submission
Abstract
We present Top-Theta (Top-$\theta$) Attention, a training-free method for
sparsifying transformer attention during inference. Our key insight is that
static, per-head thresholds can be calibrated to retain the desired constant
number of significant elements per attention row. This approach enables
content-based sparsity without retraining, and it remains robust across data
domains. We further introduce compensation techniques to preserve accuracy
under aggressive sparsification, establishing attention thresholding as a
practical and principled alternative to top-k attention. We provide extensive
evaluation on natural language processing tasks, showing that Top-$\theta$
achieves 3-10x reduction in V-cache usage and up to 10x fewer attention
elements during inference while degrading no more than 1% in accuracy.
中文标题/摘要
标题:Top-Theta注意力:通过补偿阈值实现Transformer稀疏化
我们提出Top-Theta(Top-$\theta$)注意力,一种在推理过程中无需训练即可稀疏化Transformer注意力的方法。核心发现是通过校准静态的每头阈值,可保留注意力行中所需的恒定数量显著元素。该方法支持基于内容的稀疏化且无需重训练,并在多个数据领域保持稳健性。我们进一步引入补偿技术,在激进稀疏化下保持精度,使注意力阈值法成为top-k注意力的实用且原理清晰的替代方案。通过在自然语言处理任务上的广泛评估,表明Top-$\theta$在推理期间可实现V-cache使用量减少3-10倍,注意力元素减少高达10倍,而精度下降不超过1%。
Summary / 总结
We present Top-Theta (Top-$\theta$) Attention, a training-free method for sparsifying transformer attention during inference.
HPSv3: Towards Wide-Spectrum Human Preference Score
Authors: Yuhang Ma, Yunhao Shui, Xiaoshi Wu, Keqiang Sun, Hongsheng Li
First: 2025-08-05T17:17:13+00:00 · Latest: 2025-08-22T08:53:37+00:00
Comments: ICCV2025
Abstract
Evaluating text-to-image generation models requires alignment with human
perception, yet existing human-centric metrics are constrained by limited data
coverage, suboptimal feature extraction, and inefficient loss functions. To
address these challenges, we introduce Human Preference Score v3 (HPSv3). (1)
We release HPDv3, the first wide-spectrum human preference dataset integrating
1.08M text-image pairs and 1.17M annotated pairwise comparisons from
state-of-the-art generative models and low to high-quality real-world images.
(2) We introduce a VLM-based preference model trained using an
uncertainty-aware ranking loss for fine-grained ranking. Besides, we propose
Chain-of-Human-Preference (CoHP), an iterative image refinement method that
enhances quality without extra data, using HPSv3 to select the best image at
each step. Extensive experiments demonstrate that HPSv3 serves as a robust
metric for wide-spectrum image evaluation, and CoHP offers an efficient and
human-aligned approach to improve image generation quality. The code and
dataset are available at the HPSv3 Homepage.
中文标题/摘要
标题:HPSv3:迈向广谱人类偏好评分
评估文本到图像生成模型需与人类感知对齐,但现有人本度量受限于数据覆盖狭窄、特征提取欠佳及损失函数低效。为此,我们推出人类偏好评分v3(HPSv3)。(1)发布首个广谱人类偏好数据集HPDv3,整合108万文本-图像对和117万标注配对比较,涵盖顶尖生成模型及低至高质量真实图像。(2)采用基于视觉语言模型的偏好模型,通过不确定性感知排序损失实现细粒度排名。另提出人类偏好链(CoHP)迭代优化方法,无需额外数据即可提升图像质量,通过HPSv3逐步筛选最优图像。大量实验表明HPSv3可作为广谱图像评估的稳健指标,CoHP则提供高效且符合人类偏好的图像质量提升方案。代码与数据集详见HPSv3主页。
Summary / 总结
Evaluating text-to-image generation models requires alignment with human perception, yet existing human-centric metrics are constrained by limited data coverage, suboptimal feature extraction, and inefficient loss functions.
OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models
Authors: Huanpeng Chu, Wei Wu, Guanyu Fen, Yutao Zhang
Venue: ICCV 2025
First: 2025-08-22T08:36:58+00:00 · Latest: 2025-08-22T08:36:58+00:00
Comments: Accepted by ICCV 2025
Abstract
Diffusion models have emerged as a powerful paradigm for generative tasks
such as image synthesis and video generation, with Transformer architectures
further enhancing performance. However, the high computational cost of
diffusion Transformers-stemming from a large number of sampling steps and
complex per-step computations-presents significant challenges for real-time
deployment. In this paper, we introduce OmniCache, a training-free acceleration
method that exploits the global redundancy inherent in the denoising process.
Unlike existing methods that determine caching strategies based on inter-step
similarities and tend to prioritize reusing later sampling steps, our approach
originates from the sampling perspective of DIT models. We systematically
analyze the model's sampling trajectories and strategically distribute cache
reuse across the entire sampling process. This global perspective enables more
effective utilization of cached computations throughout the diffusion
trajectory, rather than concentrating reuse within limited segments of the
sampling procedure.In addition, during cache reuse, we dynamically estimate the
corresponding noise and filter it out to reduce its impact on the sampling
direction.Extensive experiments demonstrate that our approach accelerates the
sampling process while maintaining competitive generative quality, offering a
promising and practical solution for efficient deployment of diffusion-based
generative models.
中文标题/摘要
标题:OmniCache:面向轨迹的扩散Transformer模型免训练缓存重用全局视角
扩散模型已成为图像合成和视频生成等生成任务的强大范式,Transformer架构进一步提升了其性能。然而,扩散Transformer的高计算成本——源于大量采样步骤和复杂的每步计算——给实时部署带来重大挑战。本文提出OmniCache,一种免训练加速方法,利用去噪过程中固有的全局冗余。与现有基于步骤间相似性确定缓存策略并倾向于优先重用后期采样步骤的方法不同,我们的方法从DIT模型的采样视角出发,系统分析模型采样轨迹并策略性地在整个采样过程中分配缓存重用。这种全局视角能更有效地利用扩散轨迹中的缓存计算,而非将重用集中在采样过程的有限区段。此外,在缓存重用期间,我们动态估计相应噪声并进行滤除以降低其对采样方向的影响。大量实验表明,本方法在保持竞争力的生成质量同时加速了采样过程,为基于扩散的生成模型高效部署提供了实用解决方案。
Summary / 总结
Diffusion models have emerged as a powerful paradigm for generative tasks such as image synthesis and video generation, with Transformer architectures further enhancing performance.
SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
Authors: Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li
Venue: EMNLP 2025
First: 2025-08-22T08:23:09+00:00 · Latest: 2025-08-22T08:23:09+00:00
Comments: Accepted at EMNLP 2025
Abstract
Video large language models (Vid-LLMs) have shown strong capabilities in
understanding video content. However, their reliance on dense video token
representations introduces substantial memory and computational overhead in
both prefilling and decoding. To mitigate the information loss of recent video
token reduction methods and accelerate the decoding stage of Vid-LLMs
losslessly, we introduce SpecVLM, a training-free speculative decoding (SD)
framework tailored for Vid-LLMs that incorporates staged video token pruning.
Building on our novel finding that the draft model's speculation exhibits low
sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens,
enabling efficient speculation without sacrificing accuracy. To achieve this,
it performs a two-stage pruning process: Stage I selects highly informative
tokens guided by attention signals from the verifier (target model), while
Stage II prunes remaining redundant ones in a spatially uniform manner.
Extensive experiments on four video understanding benchmarks demonstrate the
effectiveness and robustness of SpecVLM, which achieves up to 2.68$\times$
decoding speedup for LLaVA-OneVision-72B and 2.11$\times$ speedup for
Qwen2.5-VL-32B.
中文标题/摘要
标题:SpecVLM:通过验证器引导的令牌剪枝增强视频大语言模型的推测解码
视频大语言模型(Vid-LLMs)在理解视频内容方面展现出强大能力,但其对密集视频令牌表示的依赖导致预填充和解码阶段产生显著的内存与计算开销。为减少近期视频令牌缩减方法的信息损失并无损加速Vid-LLMs解码,我们提出SpecVLM——一个专为Vid-LLMs设计的无训练推测解码(SD)框架,融合分阶段视频令牌剪枝技术。基于我们发现草案模型的推测对视频令牌剪枝具有低敏感性的新结论,SpecVLM可剪除高达90%的视频令牌,实现高效推测且不损失精度。该框架执行两阶段剪枝:第一阶段根据验证器(目标模型)的注意力信号选择高信息量令牌,第二阶段以空间均匀方式剪除剩余冗余令牌。在四个视频理解基准上的大量实验证明了SpecVLM的有效性与鲁棒性,其为LLaVA-OneVision-72B带来最高2.68倍解码加速,为Qwen2.5-VL-32B实现2.11倍加速。
Summary / 总结
Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content.
RAGSR: Regional Attention Guided Diffusion for Image Super-Resolution
Authors: Haodong He, Yancheng Bai, Rui Lan, Xu Duan, Lei Sun, Xiangxiang Chu, Gui-Song Xia
First: 2025-08-22T07:28:34+00:00 · Latest: 2025-08-22T07:28:34+00:00
Abstract
The rich textual information of large vision-language models (VLMs) combined
with the powerful generative prior of pre-trained text-to-image (T2I) diffusion
models has achieved impressive performance in single-image super-resolution
(SISR). However, existing methods still face significant challenges in
generating clear and accurate regional details, particularly in scenarios
involving multiple objects. This challenge primarily stems from a lack of
fine-grained regional descriptions and the models' insufficient ability to
capture complex prompts. To address these limitations, we propose a Regional
Attention Guided Super-Resolution (RAGSR) method that explicitly extracts
localized fine-grained information and effectively encodes it through a novel
regional attention mechanism, enabling both enhanced detail and overall
visually coherent SR results. Specifically, RAGSR localizes object regions in
an image and assigns fine-grained caption to each region, which are formatted
as region-text pairs as textual priors for T2I models. A regional guided
attention is then leveraged to ensure that each region-text pair is properly
considered in the attention process while preventing unwanted interactions
between unrelated region-text pairs. By leveraging this attention mechanism,
our approach offers finer control over the integration of text and image
information, thereby effectively overcoming limitations faced by traditional
SISR techniques. Experimental results on benchmark datasets demonstrate that
our approach exhibits superior performance in generating perceptually authentic
visual details while maintaining contextual consistency compared to existing
approaches.
中文标题/摘要
标题:RAGSR:区域注意力引导的图像超分辨率扩散方法
大型视觉语言模型(VLMs)丰富的文本信息与预训练文本到图像(T2I)扩散模型的强大生成先验相结合,在单图像超分辨率(SISR)领域取得了显著成果。然而,现有方法在生成清晰准确的区域细节方面仍面临重大挑战,尤其是在涉及多对象的场景中。这一挑战主要源于缺乏细粒度区域描述以及模型捕捉复杂提示的能力不足。为解决这些局限,我们提出了区域注意力引导超分辨率(RAGSR)方法,该方法显式提取局部细粒度信息,并通过新颖的区域注意力机制有效编码,既能增强细节表现,又能保持整体视觉一致性。具体而言,RAGSR定位图像中的对象区域并为每个区域分配细粒度描述,将其格式化为区域-文本对作为T2I模型的文本先验。随后利用区域引导注意力机制,确保在注意力过程中正确处理每个区域-文本对,同时防止不相关区域-文本对之间的干扰。通过这种注意力机制,我们的方法能更精细地控制文本与图像信息的融合,从而有效克服传统SISR技术的局限。在基准数据集上的实验结果表明,与现有方法相比,我们的方法在生成感知真实的视觉细节同时保持上下文一致性方面表现出更优的性能。
Summary / 总结
The rich textual information of large vision-language models (VLMs) combined with the powerful generative prior of pre-trained text-to-image (T2I) diffusion models has achieved impressive performance in single-image super-resolution (SISR).
Beyond Human-prompting: Adaptive Prompt Tuning with Semantic Alignment for Anomaly Detection
Authors: Pi-Wei Chen, Jerry Chun-Wei Lin, Wei-Han Chen, Jia Ji, Zih-Ching Chen, Feng-Hao Yeh, Chao-Chun Chen
First: 2025-08-22T07:26:56+00:00 · Latest: 2025-08-22T07:26:56+00:00
Abstract
Pre-trained Vision-Language Models (VLMs) have recently shown promise in
detecting anomalies. However, previous approaches are fundamentally limited by
their reliance on human-designed prompts and the lack of accessible anomaly
samples, leading to significant gaps in context-specific anomaly understanding.
In this paper, we propose \textbf{A}daptive \textbf{P}rompt \textbf{T}uning
with semantic alignment for anomaly detection (APT), a groundbreaking prior
knowledge-free, few-shot framework and overcomes the limitations of traditional
prompt-based approaches. APT uses self-generated anomaly samples with noise
perturbations to train learnable prompts that capture context-dependent
anomalies in different scenarios. To prevent overfitting to synthetic noise, we
propose a Self-Optimizing Meta-prompt Guiding Scheme (SMGS) that iteratively
aligns the prompts with general anomaly semantics while incorporating diverse
synthetic anomaly. Our system not only advances pixel-wise anomaly detection,
but also achieves state-of-the-art performance on multiple benchmark datasets
without requiring prior knowledge for prompt crafting, establishing a robust
and versatile solution for real-world anomaly detection.
中文标题/摘要
标题:超越人工提示:面向异常检测的语义对齐自适应提示调优
预训练视觉-语言模型(VLMs)近期在异常检测领域展现出潜力。然而,现有方法因依赖人工设计的提示模板及缺乏可用异常样本而存在根本局限,导致对场景特定异常的理解存在显著差距。本文提出基于语义对齐的异常检测自适应提示调优框架(APT),这一突破性先验知识无关的少样本框架克服了传统基于提示方法的局限性。APT利用带噪声扰动的自生成异常样本训练可学习提示,以捕捉不同场景下的上下文相关异常。为防止过拟合合成噪声,我们提出自优化元提示引导方案(SMGS),通过迭代对齐提示与通用异常语义,同时融合多样化的合成异常。该系统不仅推进了像素级异常检测,还在多个基准数据集上实现了最先进性能,且无需提示设计的先验知识,为现实世界异常检测提供了鲁棒且通用的解决方案。
Summary / 总结
Pre-trained Vision-Language Models (VLMs) have recently shown promise in detecting anomalies.
Take That for Me: Multimodal Exophora Resolution with Interactive Questioning for Ambiguous Out-of-View Instructions
Authors: Akira Oyama, Shoichi Hasegawa, Akira Taniguchi, Yoshinobu Hagiwara, Tadahiro Taniguchi
First: 2025-08-22T07:09:06+00:00 · Latest: 2025-08-22T07:09:06+00:00
Comments: See website at https://emergentsystemlabstudent.github.io/MIEL/.
Accepted at IEEE RO-MAN 2025
Abstract
Daily life support robots must interpret ambiguous verbal instructions
involving demonstratives such as ``Bring me that cup,'' even when objects or
users are out of the robot's view. Existing approaches to exophora resolution
primarily rely on visual data and thus fail in real-world scenarios where the
object or user is not visible. We propose Multimodal Interactive Exophora
resolution with user Localization (MIEL), which is a multimodal exophora
resolution framework leveraging sound source localization (SSL), semantic
mapping, visual-language models (VLMs), and interactive questioning with
GPT-4o. Our approach first constructs a semantic map of the environment and
estimates candidate objects from a linguistic query with the user's skeletal
data. SSL is utilized to orient the robot toward users who are initially
outside its visual field, enabling accurate identification of user gestures and
pointing directions. When ambiguities remain, the robot proactively interacts
with the user, employing GPT-4o to formulate clarifying questions. Experiments
in a real-world environment showed results that were approximately 1.3 times
better when the user was visible to the robot and 2.0 times better when the
user was not visible to the robot, compared to the methods without SSL and
interactive questioning. The project website is
https://emergentsystemlabstudent.github.io/MIEL/.
中文标题/摘要
标题:为我指代:通过交互式提问解决模糊视外指令的多模态外指问题
日常生活辅助机器人需解析含指示词的模糊语音指令(如“把那个杯子拿给我”),即使物体或用户处于机器人视野外。现有外指消解方法主要依赖视觉数据,在实际场景中当目标不可见时会失效。我们提出多模态交互式外指消解与用户定位框架(MIEL),该框架融合声源定位(SSL)、语义建图、视觉语言模型(VLM)及GPT-4o交互提问。该方法先构建环境语义地图,结合用户骨骼数据从语言查询中推定候选对象,利用SSL引导机器人转向初始视野外的用户,精准识别手势与指向。当存在歧义时,机器人主动交互并运用GPT-4o生成澄清问题。真实环境实验显示:相较于无SSL和交互提问的方法,用户可见时效果提升约1.3倍,不可见时提升2.0倍。项目网站详见https://emergentsystemlabstudent.github.io/MIEL/。
Summary / 总结
Daily life support robots must interpret ambiguous verbal instructions involving demonstratives such as ``Bring me that cup,'' even when objects or users are out of the robot's view.
AutoSketch: VLM-assisted Style-Aware Vector Sketch Completion
Authors: Hsiao-Yuan Chin, I-Chao Shen, Yi-Ting Chiu, Ariel Shamir, Bing-Yu Chen
First: 2025-02-07T23:57:22+00:00 · Latest: 2025-08-22T06:58:44+00:00
Comments: 11 pages, Hsiao-Yuan Chin and I-Chao Shen contributed equally to the
paper
Abstract
The ability to automatically complete a partial sketch that depicts a complex
scene, e.g., "a woman chatting with a man in the park", is very useful.
However, existing sketch generation methods create sketches from scratch; they
do not complete a partial sketch in the style of the original. To address this
challenge, we introduce AutoSketch, a styleaware vector sketch completion
method that accommodates diverse sketch styles. Our key observation is that the
style descriptions of a sketch in natural language preserve the style during
automatic sketch completion. Thus, we use a pretrained vision-language model
(VLM) to describe the styles of the partial sketches in natural language and
replicate these styles using newly generated strokes. We initially optimize the
strokes to match an input prompt augmented by style descriptions extracted from
the VLM. Such descriptions allow the method to establish a diffusion prior in
close alignment with that of the partial sketch. Next, we utilize the VLM to
generate an executable style adjustment code that adjusts the strokes to
conform to the desired style. We compare our method with existing methods
across various sketch styles and prompts, performed extensive ablation studies
and qualitative and quantitative evaluations, and demonstrate that AutoSketch
can support various sketch scenarios.
中文标题/摘要
标题:AutoSketch:视觉语言模型辅助的风格感知矢量草图补全
自动补全描绘复杂场景(如“公园中交谈的男女”)的部分草图具有重要应用价值。现有草图生成方法仅能从头创建草图,无法保持原始风格的局部补全。为此,我们提出AutoSketch——一种支持多风格适配的矢量草图补全方法。核心发现是:自然语言中的草图风格描述能在自动补全过程中保持风格一致性。我们采用预训练视觉语言模型(VLM)解析部分草图的自然语言风格描述,并通过新笔划复现这些风格。首先优化笔划以匹配经VLM提取风格描述增强的输入提示,由此建立与部分草图高度对齐的扩散先验。继而利用VLM生成可执行风格调整代码来优化笔划风格。通过多风格多提示词的对比实验、消融研究及定性定量评估,证明该方法适用于多样化的草图场景。
Summary / 总结
The ability to automatically complete a partial sketch that depicts a complex scene, e.g., "a woman chatting with a man in the park", is very useful.
CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
Authors: Yixuan Wang, Haoyu Qiao, Lujun Li, Qingfu Zhu, Wanxiang Che
First: 2025-08-22T06:55:45+00:00 · Latest: 2025-08-22T06:55:45+00:00
Comments: 10 pages, 5 figures
Abstract
Large Language Models (LLMs) confront significant memory challenges due to
the escalating KV cache with increasing sequence length. As a crucial
technique, existing cross-layer KV cache sharing methods either necessitate
modified model architectures with subsequent pre-training or incur significant
performance degradation at high compression rates. To mitigate these
challenges, we propose CommonKV, a training-free method for cross-layer KV
cache compression through adjacent parameters sharing. Inspired by the high
similarity observed in cross-layer hidden states, we utilize Singular Value
Decomposition (SVD) to achieve weight sharing across adjacent parameters,
resulting in a more easily mergeable latent KV cache. Furthermore, we also
introduce an adaptive budget allocation strategy. It dynamically assigns
compression budgets based on cosine similarity, ensuring that dissimilar caches
are not over-compressed. Experiments across multiple backbone models and
benchmarks including LongBench and Ruler demonstrate that the proposed method
consistently outperforms existing low-rank and cross-layer approaches at
various compression ratios. Moreover, we find that the benefits of CommonKV are
orthogonal to other quantization and eviction methods. By integrating these
approaches, we can ultimately achieve a 98\% compression ratio without
significant performance loss.
中文标题/摘要
标题:CommonKV:通过跨层参数共享压缩键值缓存
大型语言模型(LLMs)因序列长度增加导致键值(KV)缓存急剧增长而面临重大内存挑战。现有跨层KV缓存共享技术虽关键,但需修改模型架构并重新预训练,或在高压缩率下导致性能显著下降。为应对这些挑战,我们提出CommonKV——一种通过相邻参数共享实现跨层KV缓存压缩的无训练方法。受跨层隐藏状态高度相似性启发,我们采用奇异值分解(SVD)实现相邻参数间的权重共享,从而生成更易融合的潜在KV缓存。此外,我们还引入自适应预算分配策略,基于余弦相似度动态分配压缩预算,确保不同缓存不被过度压缩。在多个骨干模型及LongBench、Ruler等基准测试上的实验表明,该方法在各种压缩比下持续优于现有低秩和跨层方法。值得注意的是,CommonKV的优势与其他量化和驱逐方法正交,通过整合这些技术,我们最终实现了98%的压缩率且无显著性能损失。
Summary / 总结
Large Language Models (LLMs) confront significant memory challenges due to the escalating KV cache with increasing sequence length.
Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment
Authors: Youjia Zhang, Youngeun Kim, Young-Geun Choi, Hongyeob Kim, Huiling Liu, Sungeun Hong
First: 2025-08-21T13:42:49+00:00 · Latest: 2025-08-22T05:07:51+00:00
Abstract
Test-time adaptation (TTA) enhances the zero-shot robustness under
distribution shifts by leveraging unlabeled test data during inference. Despite
notable advances, several challenges still limit its broader applicability.
First, most methods rely on backpropagation or iterative optimization, which
limits scalability and hinders real-time deployment. Second, they lack explicit
modeling of class-conditional feature distributions. This modeling is crucial
for producing reliable decision boundaries and calibrated predictions, but it
remains underexplored due to the lack of both source data and supervision at
test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and
backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian
probabilistic inference task by modeling class-conditional likelihoods using
gradually updated class means and a shared covariance matrix. This enables
closed-form, training-free inference. To correct potential likelihood bias, we
introduce lightweight regularization guided by CLIP priors and a historical
knowledge bank. ADAPT requires no source data, no gradient updates, and no full
access to target data, supporting both online and transductive settings.
Extensive experiments across diverse benchmarks demonstrate that our method
achieves state-of-the-art performance under a wide range of distribution shifts
with superior scalability and robustness.
中文标题/摘要
标题:基于概率高斯对齐的无反向传播测试时适应方法
测试时适应(TTA)通过利用推理过程中的未标记测试数据,增强分布偏移下的零样本鲁棒性。尽管取得显著进展,其广泛应用仍受限于若干挑战:多数方法依赖反向传播或迭代优化,影响可扩展性并阻碍实时部署;同时缺乏对类条件特征分布的显式建模,而这种建模对产生可靠决策边界和校准预测至关重要。本文提出ADAPT——一种先进的分布感知且无需反向传播的测试时适应方法。通过使用逐步更新的类均值和共享协方差矩阵建模类条件似然,将TTA重构为高斯概率推断任务,实现闭式、无训练推断。为修正潜在似然偏差,引入由CLIP先验和历史知识库指导的轻量正则化。ADAPT无需源数据、梯度更新或完全访问目标数据,支持在线与转导设置。多基准测试表明,该方法在广泛分布偏移下以卓越的可扩展性和鲁棒性实现了最先进性能。
Summary / 总结
Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference.
Prompting with Sign Parameters for Low-resource Sign Language Instruction Generation
Authors: Md Tariquzzaman, Md Farhan Ishmam, Saiyma Sittul Muna, Md Kamrul Hasan, Hasan Mahmud
Venue: ICCV 2025
First: 2025-08-22T04:11:28+00:00 · Latest: 2025-08-22T04:11:28+00:00
Comments: CV4A11y@ICCV 2025
Abstract
Sign Language (SL) enables two-way communication for the deaf and
hard-of-hearing community, yet many sign languages remain under-resourced in
the AI space. Sign Language Instruction Generation (SLIG) produces step-by-step
textual instructions that enable non-SL users to imitate and learn SL gestures,
promoting two-way interaction. We introduce BdSLIG, the first Bengali SLIG
dataset, used to evaluate Vision Language Models (VLMs) (i) on under-resourced
SLIG tasks, and (ii) on long-tail visual concepts, as Bengali SL is unlikely to
appear in the VLM pre-training data. To enhance zero-shot performance, we
introduce Sign Parameter-Infused (SPI) prompting, which integrates standard SL
parameters, like hand shape, motion, and orientation, directly into the textual
prompts. Subsuming standard sign parameters into the prompt makes the
instructions more structured and reproducible than free-form natural text from
vanilla prompting. We envision that our work would promote inclusivity and
advancement in SL learning systems for the under-resourced communities.
中文标题/摘要
标题:利用手语参数提示生成低资源手语指令
手语(SL)为聋哑及听力障碍群体提供了双向交流方式,然而许多手语在人工智能领域仍属资源匮乏。手语指令生成(SLIG)通过生成逐步文本指令,使非手语使用者能够模仿和学习手语手势,促进双向互动。我们推出首个孟加拉手语SLIG数据集BdSLIG,用于评估视觉语言模型(VLMs)在(i)资源匮乏的SLIG任务和(ii)长尾视觉概念上的表现,因为孟加拉手语不太可能出现在VLM预训练数据中。为提升零样本性能,我们提出手语参数融合(SPI)提示法,将标准手语参数(如手型、动作和方向)直接整合到文本提示中。将标准手语参数融入提示可使指令比传统自由文本生成更具结构性和可复现性。我们期望这项工作能促进资源匮乏群体在手语学习系统中的包容性与技术进步。
Summary / 总结
Sign Language (SL) enables two-way communication for the deaf and hard-of-hearing community, yet many sign languages remain under-resourced in the AI space.
MMAPG: A Training-Free Framework for Multimodal Multi-hop Question Answering via Adaptive Planning Graphs
Authors: Yiheng Hu, Xiaoyang Wang, Qing Liu, Xiwei Xu, Qian Fu, Wenjie Zhang, Liming Zhu
First: 2025-08-22T02:57:52+00:00 · Latest: 2025-08-22T02:57:52+00:00
Abstract
Multimodal Multi-hop question answering requires integrating information from
diverse sources, such as images and texts, to derive answers. Existing methods
typically rely on sequential retrieval and reasoning, where each step builds on
the previous output. However, this single-path paradigm makes them vulnerable
to errors due to misleading intermediate steps. Moreover, developing multimodal
models can be computationally expensive, often requiring extensive training. To
address these limitations, we propose a training-free framework guided by an
Adaptive Planning Graph, which consists of planning, retrieval and reasoning
modules. The planning module analyzes the current state of the Adaptive
Planning Graph, determines the next action and where to expand the graph, which
enables dynamic and flexible exploration of reasoning paths. To handle
retrieval of text to unspecified target modalities, we devise modality-specific
strategies that dynamically adapt to distinct data types. Our approach
preserves the characteristics of multimodal information without costly
task-specific training, enabling seamless integration with up-to-date models.
Finally, the experiments on MultimodalQA and WebQA show that our approach
matches or outperforms existing models that rely on training.
中文标题/摘要
标题:MMAPG:基于自适应规划图的多模态多跳问答无训练框架
多模态多跳问答需要整合图像与文本等多源信息以推导答案。现有方法通常依赖顺序检索与推理,每一步都基于前序输出构建,但这种单一路径范式易因误导性中间步骤而产生错误。此外,开发多模态模型计算成本高昂,常需大量训练。为此,我们提出一种由自适应规划图引导的无训练框架,包含规划、检索与推理模块。规划模块分析自适应规划图当前状态,决定下一步行动及图的扩展方向,实现动态灵活的逻辑路径探索。针对文本到未指定目标模态的检索,我们设计了能动态适配不同数据类型的模态特定策略。该方法无需昂贵的任务特定训练即可保持多模态信息特性,并能与最新模型无缝集成。在MultimodalQA和WebQA上的实验表明,我们的方法达到或超越了依赖训练的现有模型性能。
Summary / 总结
Multimodal Multi-hop question answering requires integrating information from diverse sources, such as images and texts, to derive answers.
Glo-VLMs: Leveraging Vision-Language Models for Fine-Grained Diseased Glomerulus Classification
Authors: Zhenhao Guo, Rachit Saluja, Tianyuan Yao, Quan Liu, Yuankai Huo, Benjamin Liechty, David J. Pisapia, Kenji Ikemura, Mert R. Sabuncu, Yihe Yang, Ruining Deng
First: 2025-08-21T21:05:44+00:00 · Latest: 2025-08-21T21:05:44+00:00
Abstract
Vision-language models (VLMs) have shown considerable potential in digital
pathology, yet their effectiveness remains limited for fine-grained,
disease-specific classification tasks such as distinguishing between glomerular
subtypes. The subtle morphological variations among these subtypes, combined
with the difficulty of aligning visual patterns with precise clinical
terminology, make automated diagnosis in renal pathology particularly
challenging. In this work, we explore how large pretrained VLMs can be
effectively adapted to perform fine-grained glomerular classification, even in
scenarios where only a small number of labeled examples are available. In this
work, we introduce Glo-VLMs, a systematic framework designed to explore the
adaptation of VLMs to fine-grained glomerular classification in
data-constrained settings. Our approach leverages curated pathology images
alongside clinical text prompts to facilitate joint image-text representation
learning for nuanced renal pathology subtypes. By assessing various VLMs
architectures and adaptation strategies under a few-shot learning paradigm, we
explore how both the choice of method and the amount of labeled data impact
model performance in clinically relevant scenarios. To ensure a fair
comparison, we evaluate all models using standardized multi-class metrics,
aiming to clarify the practical requirements and potential of large pretrained
models for specialized clinical research applications. As a result, fine-tuning
the VLMs achieved 0.7416 accuracy, 0.9045 macro-AUC, and 0.5277 F1-score with
only 8 shots per class, demonstrating that even with highly limited
supervision, foundation models can be effectively adapted for fine-grained
medical image classification.
中文标题/摘要
标题:Glo-VLMs:利用视觉语言模型进行细粒度病变肾小球分类
视觉语言模型(VLMs)在数字病理学中展现出巨大潜力,但在区分肾小球亚型等细粒度疾病特异性分类任务中效果仍有限。这些亚型间细微的形态学差异,加上视觉模式与精准临床术语对齐的困难,使得肾脏病理学的自动诊断尤为挑战。本研究探索了如何有效适配大型预训练VLMs执行细粒度肾小球分类,即使在仅有少量标注样本的情况下。我们提出Glo-VLMs系统框架,通过整合病理图像与临床文本提示,实现针对细微肾脏病理亚型的联合图文表征学习。在少样本学习范式下评估多种VLM架构与适配策略,分析方法选择与标注数据量对临床场景模型性能的影响。采用标准化多类指标评估显示,仅每类8个样本进行微调即可达到0.7416准确率、0.9045宏观AUC和0.5277 F1分数,证明基础模型即使在高限监督下也能有效适配细粒度医学图像分类。
Summary / 总结
Vision-language models (VLMs) have shown considerable potential in digital pathology, yet their effectiveness remains limited for fine-grained, disease-specific classification tasks such as distinguishing between glomerular subtypes.