Flow Matching-Based Generative Modeling for Efficient and Scalable Data Assimilation
Authors: Taos Transue, Bohan Chen, So Takao, Bao Wang
First: 2025-08-18T19:00:45+00:00 · Latest: 2025-08-22T15:54:49+00:00
Comments: correcting authorship footnote, reformatting figures
Abstract
Data assimilation (DA) is the problem of sequentially estimating the state of
a dynamical system from noisy observations. Recent advances in generative
modeling have inspired new approaches to DA in high-dimensional nonlinear
settings, especially the ensemble score filter (EnSF). However, these come at a
significant computational burden due to slow sampling. In this paper, we
introduce a new filtering framework based on flow matching (FM) -- called the
ensemble flow filter (EnFF) -- to accelerate sampling and enable flexible
design of probability paths. EnFF -- a training-free DA approach -- integrates
MC estimators for the marginal FM vector field (VF) and a localized guidance to
assimilate observations. EnFF has faster sampling and more flexibility in VF
design compared to existing generative modeling for DA. Theoretically, we show
that EnFF encompasses classical filtering methods such as the bootstrap
particle filter and the ensemble Kalman filter as special cases. Experiments on
high-dimensional filtering benchmarks demonstrate improved cost-accuracy
tradeoffs and the ability to leverage larger ensembles than prior methods. Our
results highlight the promise of FM as a scalable tool for filtering in
high-dimensional applications that enable the use of large ensembles.
Summary / 总结
Data assimilation (DA) is the problem of sequentially estimating the state of a dynamical system from noisy observations.
Modular Embedding Recomposition for Incremental Learning
Authors: Aniello Panariello, Emanuele Frascaroli, Pietro Buzzega, Lorenzo Bonicelli, Angelo Porrello, Simone Calderara
First: 2025-08-22T15:25:40+00:00 · Latest: 2025-08-22T15:25:40+00:00
Comments: Accepted to the 36th British Machine Vision Conference (BMVC 2025),
Sheffield, UK
Abstract
The advent of pre-trained Vision-Language Models (VLMs) has significantly
transformed Continual Learning (CL), mainly due to their zero-shot
classification abilities. Such proficiency makes VLMs well-suited for
real-world applications, enabling robust performance on novel unseen classes
without requiring adaptation. However, fine-tuning remains essential when
downstream tasks deviate significantly from the pre-training domain. Prior CL
approaches primarily focus on preserving the zero-shot capabilities of VLMs
during incremental fine-tuning on a downstream task. We take a step further by
devising an approach that transforms preservation into enhancement of the
zero-shot capabilities of VLMs. Our approach, named MoDular Embedding
Recomposition (MoDER), introduces a modular framework that trains multiple
textual experts, each specialized in a single seen class, and stores them in a
foundational hub. At inference time, for each unseen class, we query the hub
and compose the retrieved experts to synthesize a refined prototype that
improves classification. We show the effectiveness of our method across two
popular zero-shot incremental protocols, Class-IL and MTIL, comprising a total
of 14 datasets. The codebase is available at
https://github.com/aimagelab/mammoth.
Summary / 总结
The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities.
PediatricsMQA: a Multi-modal Pediatrics Question Answering Benchmark
Authors: Adil Bahaj, Mounir Ghogho
First: 2025-08-22T14:50:55+00:00 · Latest: 2025-08-22T14:50:55+00:00
Abstract
Large language models (LLMs) and vision-augmented LLMs (VLMs) have
significantly advanced medical informatics, diagnostics, and decision support.
However, these models exhibit systematic biases, particularly age bias,
compromising their reliability and equity. This is evident in their poorer
performance on pediatric-focused text and visual question-answering tasks. This
bias reflects a broader imbalance in medical research, where pediatric studies
receive less funding and representation despite the significant disease burden
in children. To address these issues, a new comprehensive multi-modal pediatric
question-answering benchmark, PediatricsMQA, has been introduced. It consists
of 3,417 text-based multiple-choice questions (MCQs) covering 131 pediatric
topics across seven developmental stages (prenatal to adolescent) and 2,067
vision-based MCQs using 634 pediatric images from 67 imaging modalities and 256
anatomical regions. The dataset was developed using a hybrid manual-automatic
pipeline, incorporating peer-reviewed pediatric literature, validated question
banks, existing benchmarks, and existing QA resources. Evaluating
state-of-the-art open models, we find dramatic performance drops in younger
cohorts, highlighting the need for age-aware methods to ensure equitable AI
support in pediatric care.
Summary / 总结
Large language models (LLMs) and vision-augmented LLMs (VLMs) have significantly advanced medical informatics, diagnostics, and decision support.
CAMA: Enhancing Multimodal In-Context Learning with Context-Aware Modulated Attention
Authors: Yanshu Li, Jianjiang Yang, Ziteng Yang, Bozheng Li, Hongyang He, Zhengtao Yao, Ligong Han, Yingjie Victor Chen, Songlin Fei, Dongfang Liu, Ruixiang Tang
First: 2025-05-21T04:25:23+00:00 · Latest: 2025-08-22T14:44:22+00:00
Comments: 14 pages, 8 figures, 5 tables
Abstract
Multimodal in-context learning (ICL) is emerging as a key capability that
enables large vision-language models (LVLMs) to adapt to novel tasks without
parameter updates, expanding their utility across various real-world
applications. However, ICL remains unstable, even with well-matched in-context
demonstrations (ICDs), suggesting that LVLMs struggle to fully utilize the
provided context. While existing efforts focus on prompt engineering or
post-hoc logit calibration, we instead investigate the underlying attention
dynamics to overcome LVLMs' inherent limitations. We identify two critical
deficits in their self-attention that impair effective ICL. To bridge the gap,
we propose \textbf{Context-Aware Modulated Attention} (CAMA), a plug-and-play
and training-free method that dynamically modulates LVLM's attention logits
based on the input in-context sequence. CAMA employs a two-stage attention
modulation to address both identified deficits, enhancing the focus on
semantically significant tokens, particularly visual ones. Across four LVLMs
and seven benchmarks, CAMA consistently outperforms vanilla models and
baselines, demonstrating great effectiveness and generalization. It can also
activate the desired effects of prompt engineering methods and remains robust
under diverse sequence configurations. Thus, CAMA paves the way for deeper
explorations of attention dynamics to advance multimodal reasoning.
Summary / 总结
Multimodal in-context learning (ICL) is emerging as a key capability that enables large vision-language models (LVLMs) to adapt to novel tasks without parameter updates, expanding their utility across various real-world applications.
Retrieval Enhanced Feedback via In-context Neural Error-book
Authors: Jongyeop Hyun, Bumsoo Kim
Venue: EMNLP 2025
First: 2025-08-22T11:50:04+00:00 · Latest: 2025-08-22T11:50:04+00:00
Comments: Accepted at EMNLP 2025 main conference
Abstract
Recent advancements in Large Language Models (LLMs) have significantly
improved reasoning capabilities, with in-context learning (ICL) emerging as a
key technique for adaptation without retraining. While previous works have
focused on leveraging correct examples, recent research highlights the
importance of learning from errors to enhance performance. However, existing
methods lack a structured framework for analyzing and mitigating errors,
particularly in Multimodal Large Language Models (MLLMs), where integrating
visual and textual inputs adds complexity. To address this issue, we propose
REFINE: Retrieval-Enhanced Feedback via In-context Neural Error-book, a
teacher-student framework that systematically structures errors and provides
targeted feedback. REFINE introduces three systematic queries to construct
structured feedback -- Feed-Target, Feed-Check, and Feed-Path -- to enhance
multimodal reasoning by prioritizing relevant visual information, diagnosing
critical failure points, and formulating corrective actions. Unlike prior
approaches that rely on redundant retrievals, REFINE optimizes structured
feedback retrieval, improving inference efficiency, token usage, and
scalability. Our results demonstrate substantial speedup, reduced computational
costs, and successful generalization, highlighting REFINE's potential for
enhancing multimodal reasoning.
Summary / 总结
Recent advancements in Large Language Models (LLMs) have significantly improved reasoning capabilities, with in-context learning (ICL) emerging as a key technique for adaptation without retraining.
Structuring GUI Elements through Vision Language Models: Towards Action Space Generation
Authors: Yi Xu, Yesheng Zhang, jiajia Liu, Jingdong Chen
First: 2025-08-22T10:14:15+00:00 · Latest: 2025-08-22T10:14:15+00:00
Comments: 10pageV0
Abstract
Multimodal large language models (MLLMs) have emerged as pivotal tools in
enhancing human-computer interaction. In this paper we focus on the application
of MLLMs in the field of graphical user interface (GUI) elements structuring,
where they assist in processing user instructions based on screen contents.
Despite the promise of MLLMs, their performance in precisely generating UI
element coordinates, a critical aspect of GUI understanding, is hindered by the
nature of next-token prediction training. This challenge arises from the
semantic void surrounding numerical UI coordinates in language representation
spaces, necessitating a substantial and diverse dataset to bolster visual
module capabilities. To address these limitations, we introduce an
IoU-Augmented Maximum Likelihood (IAML) training paradigm. Specifically, our
approach involves a novel pipeline for IoU-based coordinate sampling to augment
the training data, which considers the proximity to ground truth coordinates.
This data augmentation strategy is then employed to fine-tune MLLMs under the
IAML paradigm, which is designed to mitigate the exposure bias problem inherent
in traditional maximum likelihood estimation. Through extensive experiments, we
demonstrate the superior performance of our IAML training approach over
traditional training paradigms.
Summary / 总结
Multimodal large language models (MLLMs) have emerged as pivotal tools in enhancing human-computer interaction.
Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
Authors: Konstantin Berestizshevsky, Renzo Andri, Lukas Cavigelli
First: 2025-02-12T12:50:15+00:00 · Latest: 2025-08-22T09:24:39+00:00
Comments: 11 pages, 11 figures + Appendix. work under submission
Abstract
We present Top-Theta (Top-$\theta$) Attention, a training-free method for
sparsifying transformer attention during inference. Our key insight is that
static, per-head thresholds can be calibrated to retain the desired constant
number of significant elements per attention row. This approach enables
content-based sparsity without retraining, and it remains robust across data
domains. We further introduce compensation techniques to preserve accuracy
under aggressive sparsification, establishing attention thresholding as a
practical and principled alternative to top-k attention. We provide extensive
evaluation on natural language processing tasks, showing that Top-$\theta$
achieves 3-10x reduction in V-cache usage and up to 10x fewer attention
elements during inference while degrading no more than 1% in accuracy.
Summary / 总结
We present Top-Theta (Top-$\theta$) Attention, a training-free method for sparsifying transformer attention during inference.
HPSv3: Towards Wide-Spectrum Human Preference Score
Authors: Yuhang Ma, Yunhao Shui, Xiaoshi Wu, Keqiang Sun, Hongsheng Li
First: 2025-08-05T17:17:13+00:00 · Latest: 2025-08-22T08:53:37+00:00
Comments: ICCV2025
Abstract
Evaluating text-to-image generation models requires alignment with human
perception, yet existing human-centric metrics are constrained by limited data
coverage, suboptimal feature extraction, and inefficient loss functions. To
address these challenges, we introduce Human Preference Score v3 (HPSv3). (1)
We release HPDv3, the first wide-spectrum human preference dataset integrating
1.08M text-image pairs and 1.17M annotated pairwise comparisons from
state-of-the-art generative models and low to high-quality real-world images.
(2) We introduce a VLM-based preference model trained using an
uncertainty-aware ranking loss for fine-grained ranking. Besides, we propose
Chain-of-Human-Preference (CoHP), an iterative image refinement method that
enhances quality without extra data, using HPSv3 to select the best image at
each step. Extensive experiments demonstrate that HPSv3 serves as a robust
metric for wide-spectrum image evaluation, and CoHP offers an efficient and
human-aligned approach to improve image generation quality. The code and
dataset are available at the HPSv3 Homepage.
Summary / 总结
Evaluating text-to-image generation models requires alignment with human perception, yet existing human-centric metrics are constrained by limited data coverage, suboptimal feature extraction, and inefficient loss functions.
OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models
Authors: Huanpeng Chu, Wei Wu, Guanyu Fen, Yutao Zhang
Venue: ICCV 2025
First: 2025-08-22T08:36:58+00:00 · Latest: 2025-08-22T08:36:58+00:00
Comments: Accepted by ICCV 2025
Abstract
Diffusion models have emerged as a powerful paradigm for generative tasks
such as image synthesis and video generation, with Transformer architectures
further enhancing performance. However, the high computational cost of
diffusion Transformers-stemming from a large number of sampling steps and
complex per-step computations-presents significant challenges for real-time
deployment. In this paper, we introduce OmniCache, a training-free acceleration
method that exploits the global redundancy inherent in the denoising process.
Unlike existing methods that determine caching strategies based on inter-step
similarities and tend to prioritize reusing later sampling steps, our approach
originates from the sampling perspective of DIT models. We systematically
analyze the model's sampling trajectories and strategically distribute cache
reuse across the entire sampling process. This global perspective enables more
effective utilization of cached computations throughout the diffusion
trajectory, rather than concentrating reuse within limited segments of the
sampling procedure.In addition, during cache reuse, we dynamically estimate the
corresponding noise and filter it out to reduce its impact on the sampling
direction.Extensive experiments demonstrate that our approach accelerates the
sampling process while maintaining competitive generative quality, offering a
promising and practical solution for efficient deployment of diffusion-based
generative models.
Summary / 总结
Diffusion models have emerged as a powerful paradigm for generative tasks such as image synthesis and video generation, with Transformer architectures further enhancing performance.
SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
Authors: Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li
Venue: EMNLP 2025
First: 2025-08-22T08:23:09+00:00 · Latest: 2025-08-22T08:23:09+00:00
Comments: Accepted at EMNLP 2025
Abstract
Video large language models (Vid-LLMs) have shown strong capabilities in
understanding video content. However, their reliance on dense video token
representations introduces substantial memory and computational overhead in
both prefilling and decoding. To mitigate the information loss of recent video
token reduction methods and accelerate the decoding stage of Vid-LLMs
losslessly, we introduce SpecVLM, a training-free speculative decoding (SD)
framework tailored for Vid-LLMs that incorporates staged video token pruning.
Building on our novel finding that the draft model's speculation exhibits low
sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens,
enabling efficient speculation without sacrificing accuracy. To achieve this,
it performs a two-stage pruning process: Stage I selects highly informative
tokens guided by attention signals from the verifier (target model), while
Stage II prunes remaining redundant ones in a spatially uniform manner.
Extensive experiments on four video understanding benchmarks demonstrate the
effectiveness and robustness of SpecVLM, which achieves up to 2.68$\times$
decoding speedup for LLaVA-OneVision-72B and 2.11$\times$ speedup for
Qwen2.5-VL-32B.
Summary / 总结
Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content.