arXiv 论文速递

2025-08-25 15:56
Snapshot: 20250825_1556
Flow Matching-Based Generative Modeling for Efficient and Scalable Data Assimilation
Authors: Taos Transue, Bohan Chen, So Takao, Bao Wang
First: 2025-08-18T19:00:45+00:00 · Latest: 2025-08-22T15:54:49+00:00
Comments: correcting authorship footnote, reformatting figures
Abstract
Data assimilation (DA) is the problem of sequentially estimating the state of a dynamical system from noisy observations. Recent advances in generative modeling have inspired new approaches to DA in high-dimensional nonlinear settings, especially the ensemble score filter (EnSF). However, these come at a significant computational burden due to slow sampling. In this paper, we introduce a new filtering framework based on flow matching (FM) -- called the ensemble flow filter (EnFF) -- to accelerate sampling and enable flexible design of probability paths. EnFF -- a training-free DA approach -- integrates MC estimators for the marginal FM vector field (VF) and a localized guidance to assimilate observations. EnFF has faster sampling and more flexibility in VF design compared to existing generative modeling for DA. Theoretically, we show that EnFF encompasses classical filtering methods such as the bootstrap particle filter and the ensemble Kalman filter as special cases. Experiments on high-dimensional filtering benchmarks demonstrate improved cost-accuracy tradeoffs and the ability to leverage larger ensembles than prior methods. Our results highlight the promise of FM as a scalable tool for filtering in high-dimensional applications that enable the use of large ensembles.
Summary / 总结
Data assimilation (DA) is the problem of sequentially estimating the state of a dynamical system from noisy observations.
Modular Embedding Recomposition for Incremental Learning
Authors: Aniello Panariello, Emanuele Frascaroli, Pietro Buzzega, Lorenzo Bonicelli, Angelo Porrello, Simone Calderara
First: 2025-08-22T15:25:40+00:00 · Latest: 2025-08-22T15:25:40+00:00
Comments: Accepted to the 36th British Machine Vision Conference (BMVC 2025), Sheffield, UK
Abstract
The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities. Such proficiency makes VLMs well-suited for real-world applications, enabling robust performance on novel unseen classes without requiring adaptation. However, fine-tuning remains essential when downstream tasks deviate significantly from the pre-training domain. Prior CL approaches primarily focus on preserving the zero-shot capabilities of VLMs during incremental fine-tuning on a downstream task. We take a step further by devising an approach that transforms preservation into enhancement of the zero-shot capabilities of VLMs. Our approach, named MoDular Embedding Recomposition (MoDER), introduces a modular framework that trains multiple textual experts, each specialized in a single seen class, and stores them in a foundational hub. At inference time, for each unseen class, we query the hub and compose the retrieved experts to synthesize a refined prototype that improves classification. We show the effectiveness of our method across two popular zero-shot incremental protocols, Class-IL and MTIL, comprising a total of 14 datasets. The codebase is available at https://github.com/aimagelab/mammoth.
Summary / 总结
The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities.
CAMA: Enhancing Multimodal In-Context Learning with Context-Aware Modulated Attention
Authors: Yanshu Li, Jianjiang Yang, Ziteng Yang, Bozheng Li, Hongyang He, Zhengtao Yao, Ligong Han, Yingjie Victor Chen, Songlin Fei, Dongfang Liu, Ruixiang Tang
First: 2025-05-21T04:25:23+00:00 · Latest: 2025-08-22T14:44:22+00:00
Comments: 14 pages, 8 figures, 5 tables
Abstract
Multimodal in-context learning (ICL) is emerging as a key capability that enables large vision-language models (LVLMs) to adapt to novel tasks without parameter updates, expanding their utility across various real-world applications. However, ICL remains unstable, even with well-matched in-context demonstrations (ICDs), suggesting that LVLMs struggle to fully utilize the provided context. While existing efforts focus on prompt engineering or post-hoc logit calibration, we instead investigate the underlying attention dynamics to overcome LVLMs' inherent limitations. We identify two critical deficits in their self-attention that impair effective ICL. To bridge the gap, we propose \textbf{Context-Aware Modulated Attention} (CAMA), a plug-and-play and training-free method that dynamically modulates LVLM's attention logits based on the input in-context sequence. CAMA employs a two-stage attention modulation to address both identified deficits, enhancing the focus on semantically significant tokens, particularly visual ones. Across four LVLMs and seven benchmarks, CAMA consistently outperforms vanilla models and baselines, demonstrating great effectiveness and generalization. It can also activate the desired effects of prompt engineering methods and remains robust under diverse sequence configurations. Thus, CAMA paves the way for deeper explorations of attention dynamics to advance multimodal reasoning.
Summary / 总结
Multimodal in-context learning (ICL) is emerging as a key capability that enables large vision-language models (LVLMs) to adapt to novel tasks without parameter updates, expanding their utility across various real-world applications.
Structuring GUI Elements through Vision Language Models: Towards Action Space Generation
Authors: Yi Xu, Yesheng Zhang, jiajia Liu, Jingdong Chen
First: 2025-08-22T10:14:15+00:00 · Latest: 2025-08-22T10:14:15+00:00
Comments: 10pageV0
Abstract
Multimodal large language models (MLLMs) have emerged as pivotal tools in enhancing human-computer interaction. In this paper we focus on the application of MLLMs in the field of graphical user interface (GUI) elements structuring, where they assist in processing user instructions based on screen contents. Despite the promise of MLLMs, their performance in precisely generating UI element coordinates, a critical aspect of GUI understanding, is hindered by the nature of next-token prediction training. This challenge arises from the semantic void surrounding numerical UI coordinates in language representation spaces, necessitating a substantial and diverse dataset to bolster visual module capabilities. To address these limitations, we introduce an IoU-Augmented Maximum Likelihood (IAML) training paradigm. Specifically, our approach involves a novel pipeline for IoU-based coordinate sampling to augment the training data, which considers the proximity to ground truth coordinates. This data augmentation strategy is then employed to fine-tune MLLMs under the IAML paradigm, which is designed to mitigate the exposure bias problem inherent in traditional maximum likelihood estimation. Through extensive experiments, we demonstrate the superior performance of our IAML training approach over traditional training paradigms.
Summary / 总结
Multimodal large language models (MLLMs) have emerged as pivotal tools in enhancing human-computer interaction.
Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
Authors: Konstantin Berestizshevsky, Renzo Andri, Lukas Cavigelli
First: 2025-02-12T12:50:15+00:00 · Latest: 2025-08-22T09:24:39+00:00
Comments: 11 pages, 11 figures + Appendix. work under submission
Abstract
We present Top-Theta (Top-$\theta$) Attention, a training-free method for sparsifying transformer attention during inference. Our key insight is that static, per-head thresholds can be calibrated to retain the desired constant number of significant elements per attention row. This approach enables content-based sparsity without retraining, and it remains robust across data domains. We further introduce compensation techniques to preserve accuracy under aggressive sparsification, establishing attention thresholding as a practical and principled alternative to top-k attention. We provide extensive evaluation on natural language processing tasks, showing that Top-$\theta$ achieves 3-10x reduction in V-cache usage and up to 10x fewer attention elements during inference while degrading no more than 1% in accuracy.
Summary / 总结
We present Top-Theta (Top-$\theta$) Attention, a training-free method for sparsifying transformer attention during inference.
OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models
Authors: Huanpeng Chu, Wei Wu, Guanyu Fen, Yutao Zhang
Venue: ICCV 2025
First: 2025-08-22T08:36:58+00:00 · Latest: 2025-08-22T08:36:58+00:00
Comments: Accepted by ICCV 2025
Abstract
Diffusion models have emerged as a powerful paradigm for generative tasks such as image synthesis and video generation, with Transformer architectures further enhancing performance. However, the high computational cost of diffusion Transformers-stemming from a large number of sampling steps and complex per-step computations-presents significant challenges for real-time deployment. In this paper, we introduce OmniCache, a training-free acceleration method that exploits the global redundancy inherent in the denoising process. Unlike existing methods that determine caching strategies based on inter-step similarities and tend to prioritize reusing later sampling steps, our approach originates from the sampling perspective of DIT models. We systematically analyze the model's sampling trajectories and strategically distribute cache reuse across the entire sampling process. This global perspective enables more effective utilization of cached computations throughout the diffusion trajectory, rather than concentrating reuse within limited segments of the sampling procedure.In addition, during cache reuse, we dynamically estimate the corresponding noise and filter it out to reduce its impact on the sampling direction.Extensive experiments demonstrate that our approach accelerates the sampling process while maintaining competitive generative quality, offering a promising and practical solution for efficient deployment of diffusion-based generative models.
Summary / 总结
Diffusion models have emerged as a powerful paradigm for generative tasks such as image synthesis and video generation, with Transformer architectures further enhancing performance.
SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
Authors: Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li
Venue: EMNLP 2025
First: 2025-08-22T08:23:09+00:00 · Latest: 2025-08-22T08:23:09+00:00
Comments: Accepted at EMNLP 2025
Abstract
Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning. Building on our novel finding that the draft model's speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens, enabling efficient speculation without sacrificing accuracy. To achieve this, it performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes remaining redundant ones in a spatially uniform manner. Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68$\times$ decoding speedup for LLaVA-OneVision-72B and 2.11$\times$ speedup for Qwen2.5-VL-32B.
Summary / 总结
Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content.
RAGSR: Regional Attention Guided Diffusion for Image Super-Resolution
Authors: Haodong He, Yancheng Bai, Rui Lan, Xu Duan, Lei Sun, Xiangxiang Chu, Gui-Song Xia
First: 2025-08-22T07:28:34+00:00 · Latest: 2025-08-22T07:28:34+00:00
Abstract
The rich textual information of large vision-language models (VLMs) combined with the powerful generative prior of pre-trained text-to-image (T2I) diffusion models has achieved impressive performance in single-image super-resolution (SISR). However, existing methods still face significant challenges in generating clear and accurate regional details, particularly in scenarios involving multiple objects. This challenge primarily stems from a lack of fine-grained regional descriptions and the models' insufficient ability to capture complex prompts. To address these limitations, we propose a Regional Attention Guided Super-Resolution (RAGSR) method that explicitly extracts localized fine-grained information and effectively encodes it through a novel regional attention mechanism, enabling both enhanced detail and overall visually coherent SR results. Specifically, RAGSR localizes object regions in an image and assigns fine-grained caption to each region, which are formatted as region-text pairs as textual priors for T2I models. A regional guided attention is then leveraged to ensure that each region-text pair is properly considered in the attention process while preventing unwanted interactions between unrelated region-text pairs. By leveraging this attention mechanism, our approach offers finer control over the integration of text and image information, thereby effectively overcoming limitations faced by traditional SISR techniques. Experimental results on benchmark datasets demonstrate that our approach exhibits superior performance in generating perceptually authentic visual details while maintaining contextual consistency compared to existing approaches.
Summary / 总结
The rich textual information of large vision-language models (VLMs) combined with the powerful generative prior of pre-trained text-to-image (T2I) diffusion models has achieved impressive performance in single-image super-resolution (SISR).
Beyond Human-prompting: Adaptive Prompt Tuning with Semantic Alignment for Anomaly Detection
Authors: Pi-Wei Chen, Jerry Chun-Wei Lin, Wei-Han Chen, Jia Ji, Zih-Ching Chen, Feng-Hao Yeh, Chao-Chun Chen
First: 2025-08-22T07:26:56+00:00 · Latest: 2025-08-22T07:26:56+00:00
Abstract
Pre-trained Vision-Language Models (VLMs) have recently shown promise in detecting anomalies. However, previous approaches are fundamentally limited by their reliance on human-designed prompts and the lack of accessible anomaly samples, leading to significant gaps in context-specific anomaly understanding. In this paper, we propose \textbf{A}daptive \textbf{P}rompt \textbf{T}uning with semantic alignment for anomaly detection (APT), a groundbreaking prior knowledge-free, few-shot framework and overcomes the limitations of traditional prompt-based approaches. APT uses self-generated anomaly samples with noise perturbations to train learnable prompts that capture context-dependent anomalies in different scenarios. To prevent overfitting to synthetic noise, we propose a Self-Optimizing Meta-prompt Guiding Scheme (SMGS) that iteratively aligns the prompts with general anomaly semantics while incorporating diverse synthetic anomaly. Our system not only advances pixel-wise anomaly detection, but also achieves state-of-the-art performance on multiple benchmark datasets without requiring prior knowledge for prompt crafting, establishing a robust and versatile solution for real-world anomaly detection.
Summary / 总结
Pre-trained Vision-Language Models (VLMs) have recently shown promise in detecting anomalies.
AutoSketch: VLM-assisted Style-Aware Vector Sketch Completion
Authors: Hsiao-Yuan Chin, I-Chao Shen, Yi-Ting Chiu, Ariel Shamir, Bing-Yu Chen
First: 2025-02-07T23:57:22+00:00 · Latest: 2025-08-22T06:58:44+00:00
Comments: 11 pages, Hsiao-Yuan Chin and I-Chao Shen contributed equally to the paper
Abstract
The ability to automatically complete a partial sketch that depicts a complex scene, e.g., "a woman chatting with a man in the park", is very useful. However, existing sketch generation methods create sketches from scratch; they do not complete a partial sketch in the style of the original. To address this challenge, we introduce AutoSketch, a styleaware vector sketch completion method that accommodates diverse sketch styles. Our key observation is that the style descriptions of a sketch in natural language preserve the style during automatic sketch completion. Thus, we use a pretrained vision-language model (VLM) to describe the styles of the partial sketches in natural language and replicate these styles using newly generated strokes. We initially optimize the strokes to match an input prompt augmented by style descriptions extracted from the VLM. Such descriptions allow the method to establish a diffusion prior in close alignment with that of the partial sketch. Next, we utilize the VLM to generate an executable style adjustment code that adjusts the strokes to conform to the desired style. We compare our method with existing methods across various sketch styles and prompts, performed extensive ablation studies and qualitative and quantitative evaluations, and demonstrate that AutoSketch can support various sketch scenarios.
Summary / 总结
The ability to automatically complete a partial sketch that depicts a complex scene, e.g., "a woman chatting with a man in the park", is very useful.
History