arXiv 论文速递

2026-05-07 04:54
Latest digest
IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models
Authors: Parsa Madinei, Srijita Karmakar, Russell Cohen Hoffing, Felix Gervitz, Miguel P. Eckstein
First: 2026-02-18T02:06:24+00:00 · Latest: 2026-05-05T17:39:15+00:00
Abstract
We introduce IRIS (Intent Resolution via Inference-time Saccades), a novel training-free approach that uses eye-tracking data in real-time to resolve ambiguity in open-ended VQA. Through a comprehensive user study with 500 unique image-question pairs, we demonstrate that fixations closest to the time participants start verbally asking their questions are the most informative for disambiguation in Large VLMs, more than doubling the accuracy of responses on ambiguous questions (from 35.2% to 77.2%) while maintaining performance on unambiguous queries. We evaluate our approach across state-of-the-art VLMs, showing consistent improvements when gaze data is incorporated in ambiguous image-question pairs, regardless of architectural differences. We release a new benchmark dataset to use eye movement data for disambiguated VQA, a novel real-time interactive protocol, and an evaluation suite.
中文标题/摘要
标题:IRIS:通过推断时的眼动来解决开放性VQA中的意图
我们介绍了IRIS(通过推断时的眼动进行意图解析),这是一种无需训练的新方法,利用实时的眼动追踪数据来解决开放性VQA中的歧义。通过一项包含500个独特图像-问题对的全面用户研究,我们证明了参与者开始口头提问时最接近的注视点对于大型VLM中的去歧义化最具信息性,使含糊问题的回答准确性提高了两倍多(从35.2%提高到77.2%),同时保持了对非含糊查询的性能。我们在最先进的VLM上评估了该方法,显示当将注视数据纳入含糊的图像-问题对时,无论架构差异如何,均能实现一致的改进。我们发布了一个新的基准数据集,用于使用眼动数据进行去歧义化VQA,一种新型的实时交互协议,以及一个评估套件。
Summary / 总结
IRIS is a novel-based approach that usesuses eye-tracking data in real-time to resolve ambiguity in open-ended V VQA. Through a comprehensive study, it the approach shows on fixations closest to the time point in the question, as disambigation in Large onLMs, accuracy for ambiguous queries increased from 3..5 to on7. while maintaining on on accuracy for non on on unambiguous questions. The approach is evaluated on o on-state-of-the vLMs on ambiguous image-question pairs, regardless of the model onarchitectural differences, and on on a real benchmark dataset to on eye movement queries for disambiguation in V-line interactive protocol.
Speculative Coupled Decoding for Training-Free Lossless Acceleration of Autoregressive Visual Generation
Authors: Junhyuk So, Hyunho Kook, Chaeyeon Jang, Eunhyeok Park
Venue: ICML 2026
First: 2025-10-28T09:26:27+00:00 · Latest: 2026-05-05T17:15:37+00:00
Comments: Accepted to ICML 2026
Abstract
Autoregressive (AR) modeling has recently emerged as a promising new paradigm in visual generation, but its practical adoption is severely constrained by the slow inference speed of per-token generation, which often requires thousands of steps to produce a single sample. While several Speculative Decoding (SD)-based methods have been proposed to solve this problem by generating multiple tokens in a single forward step, they suffer from limited speedup, degraded quality, or require the training of a draft model. To solve these problems, we propose a new training-free, lossless SD framework, Speculative Coupled Decoding (SCD), by extending the recently proposed Speculative Jacobi Decoding (SJD). While SJD shows strong potential for accelerating AR generation by combining Jacobi iteration and SD, we found that its acceptance rate is still significantly limited due to the instability arising from the independent sampling process used during draft token generation. To overcome this, we introduce an information-theoretic approach, Coupling, which stabilizes the drafting trajectory of SJD by maximizing the probability of sampling identical draft tokens across consecutive iterations, significantly enhancing the acceptance rate while preserving its lossless property. Remarkably, this method requires only a single-line modification to the existing algorithm with almost zero overhead, yet achieves substantial performance gains, delivering up to a 4.2x speedup in image generation and 13.6x speedup in video generation compared to standard AR decoding, without any degradation or the need for additional training. The source code is available at https://github.com/junhyukso/SCD
Physically Guided Visual Mass Estimation from a Single RGB Image
Authors: Sungjae Lee, Junhan Jeong, Yeonjoo Hong, Kwang In Kim
Venue: IJCAI 2026
First: 2026-01-28T06:53:36+00:00 · Latest: 2026-05-05T16:41:11+00:00
Comments: Accepted to IJCAI 2026 (Main Track)
Abstract
Estimating object mass from visual input is challenging because mass depends jointly on geometric volume and material-dependent density, neither of which is directly observable from RGB appearance. Consequently, mass prediction from pixels is ill-posed and therefore benefits from physically meaningful representations to constrain the space of plausible solutions. We propose a physically structured framework for single-image mass estimation that addresses this ambiguity by aligning visual cues with the physical factors governing mass. From a single RGB image, we recover object-centric three-dimensional geometry via monocular depth estimation to inform volume and extract coarse material semantics using a vision-language model to guide density-related reasoning. These geometry, semantic, and appearance representations are fused through an instance-adaptive gating mechanism, and two physically guided latent factors (volume- and density-related) are predicted through separate regression heads under mass-only supervision. Experiments on image2mass and ABO-500 show that the proposed method consistently outperforms state-of-the-art methods.
StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning
Authors: Xiaowen Sun, Matthias Kerzel, Mengdi Li, Xufeng Zhao, Paul Striker, Stefan Wermter
First: 2026-05-05T16:19:02+00:00 · Latest: 2026-05-05T16:19:02+00:00
Abstract
Vision-language models (VLMs) have shown remarkable performance in various robotic tasks, as they can perceive visual information and understand natural language instructions. However, when applied to robotics, VLMs remain subject to a fundamental limitation inherent in large language models (LLMs): they struggle with numerical reasoning, particularly in object detection and object-state localization. To explore numerical reasoning as a regression task in VLMs, we propose a novel training strategy to adapt VLMs for object detection and object-state localization. This approach leverages box decoder outputs to compute an Auxiliary Regression Loss (ARL) during fine-tuning, while preserving standard sequence prediction at inference. We leverage this training strategy to develop StateVLM (State-aware Vision-Language Model), a novel model designed to perceive and learn fine-grained object representations, including precise localization of objects and their states, as well as graspable regions. Due to the lack of a benchmark for object-state affordance reasoning, we introduce an open-source benchmark, Object State Affordance Reasoning (OSAR), which contains 1,172 scenes with 7,746 individual objects and corresponding bounding boxes. Comparative experiments on adapted benchmarks (RefCOCO, RefCOCO+, and \mbox{RefCOCOg}) demonstrate that ARL improves model performance by an average of 1.6\% compared to models without ARL. Experiments on the OSAR benchmark further support this finding, showing that StateVLM with ARL achieves an average of 5.2\% higher performance than models without ARL. In particular, ARL is also important for the complex task of affordance reasoning in OSAR, where it enhances the consistency of model outputs.
DMGD: Train-Free Dataset Distillation with Semantic-Distribution Matching in Diffusion Models
Authors: Qichao Wang, Yunhong Lu, Hengyuan Cao, Junyi Zhang, Min Zhang
First: 2026-05-05T15:37:50+00:00 · Latest: 2026-05-05T15:37:50+00:00
Comments: Accepted by CVPR2026
Abstract
Dataset distillation enables efficient training by distilling the information of large-scale datasets into significantly smaller synthetic datasets. Diffusion based paradigms have emerged in recent years, offering novel perspectives for dataset distillation. However, they typically necessitate additional fine-tuning stages, and effective guidance mechanisms remain underexplored. To address these limitations, we rethink diffusion based dataset distillation and propose a Dual Matching Guided Diffusion (DMGD) framework, centered on efficient training-free guidance. We first establish Semantic Matching via conditional likelihood optimization, eliminating the need for auxiliary classifiers. Furthermore, we propose a dynamic guidance mechanism that enhances the diversity of synthetic data while maintaining semantic alignment. Simultaneously, we introduce an optimal transport (OT) based Distribution Matching approach to further align with the target distribution structure. To ensure efficiency, we develop two enhanced strategies for diffusion based framework: Distribution Approximate Matching and Greedy Progressive Matching. These strategies enable effective distribution matching guidance with minimal computational overhead. Experimental results on ImageNet-Woof, ImageNet-Nette, and ImageNet-1K demonstrate that our training-free approach achieves significant improvements, outperforming state-of-the-art (SOTA) methods requiring additional fine-tuning by average accuracy gains of 2.1%, 5.4%, and 2.4%, respectively.
Summary / 总结
The research aims to improve dataset distillation in diffusion models by proposing a training-free approach called DMGD. DMGD uses semantic matching and distribution matching to guide the generation of synthetic datasets, avoiding the need for fine-tuning and auxiliary classifiers. The method achieves significant accuracy improvements on ImageNet sub-datasets, outperforming state-of-the-art methods by an average of 2.1% to 5.4%.
研究旨在通过解决额外的微调阶段需求和未充分探索的指导机制问题,提高基于扩散模型的数据集蒸馏以实现高效的训练。提出了双重匹配引导扩散(DMGD)框架,包括通过条件似然优化实现的语义匹配和动态指导机制,以增强合成数据的多样性。此外,还引入了基于最优传输的分布匹配方法,以进一步与目标分布结构对齐。该方法在ImageNet-Woof、ImageNet-Nette和ImageNet-1K上取得了显著改进,平均准确率分别提高了2.1%、5.4%和2.4%,超过了最先进的方法。
Quantifying the human visual exposome with vision language models
Authors: Christian Rominger, Andreas R. Schwerdtfeger, Malay Gaherwar Singh, Dimitri Khudyakow, Elizabeth A. M. Michels, Fabian Wolf, Jakob Nikolas Kather, Magdalena Katharina Wekenborg
First: 2026-05-05T15:25:00+00:00 · Latest: 2026-05-05T15:25:00+00:00
Abstract
The visual environment is a fundamental yet unquantified determinant of mental health. While the concept of the environmental exposome is well established, current methods rely on coarse geospatial proxies or biased self reports, failing to capture the first person visual context of daily life. We addressed this gap by coupling ecological momentary assessment with vision language models (VLMs) to quantify the semantic richness of human visual experience. Across 2674 participant generated photographs, VLM derived estimates of greenness robustly predicted momentary affect and chronic stress, consistent with established benchmarks. We then developed a semi autonomous large language model (LLM) based pipeline that mined over seven million scientific publications to extract nearly 1000 environmental features empirically linked to mental health. When applied to real world imagery, up to 33 percent of VLM extracted context ratings significantly correlated with affect and stress. These findings establish a scalable objective paradigm for visual exposomics, enabling high throughput decoding of how the visible world is associated with mental health.
RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models
Authors: Hao Wu, Yuqi Li, Yuan Gao, Fan Xu, Fan Zhang, Kun Wang, Penghao Zhao, Qiufeng Wang, Yizhou Zhao, Weiyan Wang, Yingli Tian, Xian Wu, Xiaomeng Huang
First: 2026-05-05T14:49:00+00:00 · Latest: 2026-05-05T14:49:00+00:00
Abstract
Existing robot video world models are typically trained with low-level objectives such as reconstruction and perceptual similarity, which are poorly aligned with the capabilities that matter most for robot decision making, including instruction following, manipulation success, and physical plausibility. They also suffer from error accumulation in long-horizon autoregressive prediction. We present RoboAlign-R1, a framework that combines reward-aligned post-training with stabilized long-horizon inference for robot video world models. We construct RobotWorldBench, a benchmark of 10,000 annotated video-instruction pairs collected from four robot data sources, and train a multimodal teacher judge, RoboAlign-Judge, to provide fine-grained six-dimensional evaluation of generated videos. We then distill the teacher into a lightweight student reward model for efficient reinforcement-learning-based post-training. To reduce long-horizon rollout drift, we further introduce Sliding Window Re-encoding (SWR), a training-free inference strategy that periodically refreshes the generation context. Under our in-domain evaluation protocol, RoboAlign-R1 improves the aggregate six-dimension score by 10.1% over the strongest baseline, including gains of 7.5% on Manipulation Accuracy and 4.6% on Instruction Following; these ranking improvements are further supported by an external VLM-based cross-check and a blinded human study. Meanwhile, SWR improves long-horizon prediction quality with only about 1% additional latency, yielding a 2.8% gain in SSIM and a 9.8% reduction in LPIPS. Together, these results show that reward-aligned post-training and stabilized long-horizon decoding improve task consistency, physical realism, and long-horizon prediction quality in robot video world models.
Summary / 总结
RoboAlign-R introduces a framework combining reward-aligned post-training with stabilized on-horizon inference to improve improve robot video models. manipulation success and physical pl-sim.th, réalisme...-h onizon prediction quality robot videos... It benchmark, RobotWorldBench,, annotated in-instruction pairs from four sources, the model model, RoboAlign-Judge distillss a six-dimensional score on generated videos, on which a lightweight reward is on is on-h onizon rollout drift is on on trained on RoboAlign-R improves manipulation success by 1.5 on on and 4.6 on on on ranking improvements on on V V V VLM-based cross-check. on on the introduction on-h onizon prediction on with only 0. additional latency on on on....8 on on on SSim and and 0 on.. on..8 on on on on reduction in LPIPS on on on..
Training-Free Probabilistic Time-Series Forecasting with Conformal Seasonal Pools
Authors: Valery Manokhin
First: 2026-05-05T14:16:35+00:00 · Latest: 2026-05-05T14:16:35+00:00
Abstract
We propose Conformal Seasonal Pools (CSP), a training-free probabilistic time-series forecaster that mixes same-season empirical draws with signed residual draws around a seasonal naive forecast. In an audited rolling-origin benchmark on the six time-series datasets where DeepNPTS was originally evaluated (electricity, exchange_rate, solar_energy, taxi, traffic, wikipedia), CSP-Adaptive significantly outperforms DeepNPTS on every metric we report -- CRPS (per-window paired Wilcoxon $p \approx 4 \times 10^{-10}$), normalized mean quantile loss ($p \approx 7 \times 10^{-10}$), and empirical 95% coverage ($p \approx 8 \times 10^{-45}$, mean 0.89 vs 0.66) -- while running over 500x faster on CPU. Coverage is the most decision-critical of these: a 0.95 nominal interval that contains the truth in only ~66% of cases fails the basic calibration desideratum and would not survive deployment in safety- or decision-critical settings. The failure mode is also more severe than aggregate coverage suggests: in the worst 10% of windows, DeepNPTS's prediction interval covers none of the H forecast horizons -- the entire multi-step trajectory misses the truth at every step simultaneously. This poses serious risk in safety- and decision-critical applications such as healthcare, finance, energy operations, and autonomous systems, where prediction intervals that systematically miss the truth across the entire planning horizon translate directly into misclassified patients, regulatory capital failures, grid imbalances, and safety-case violations. CSP achieves all of this with no learned parameters and no training. We argue training-free conformal samplers should be mandatory baselines when evaluating learned non-parametric forecasters.
Aura-CAPTCHA: A Reinforcement Learning and GAN-Enhanced Multi-Modal CAPTCHA System
Authors: Joydeep Chandra, Prabal Manhas, Ramanjot Kaur, Rashi Sahay
First: 2025-08-20T18:00:08+00:00 · Latest: 2026-05-05T13:50:11+00:00
Abstract
We present Aura-CAPTCHA, a multi-modal verification system that integrates Generative Adversarial Networks (GANs), Reinforcement Learning (RL), and behavioral analysis to create adaptive challenges resistant to classical deep-learning attacks. Our system synthesizes unique visual stimuli via GAN-based generation alongside synchronized audio challenges, while an RL agent adjusts difficulty based on real-time user interaction patterns. A hybrid classifier combining heuristic rules and machine learning distinguishes human from bot interactions. We position Aura-CAPTCHA relative to well-established baselines (text-based schemes, Google reCAPTCHA v2, audio alternatives, and modern invisible risk-analysis systems) and evaluate it against documented state-of-the-art attacks, including convolutional-neural-network solvers, object-detection pipelines (YOLO), and recent agentic vision-language models. Experimental results indicate that Aura-CAPTCHA improves human success rates and lowers classical bypass rates compared to static challenge-based baselines, although, like all explicit-challenge systems, it remains vulnerable to emerging large-model agents. We discuss these limitations transparently and outline future directions toward cognitive-gap-based defenses.
Summary / 总结
Aura-CAPTCHA integrates GANs, RL, and behavioral analysis to create create create generate unique visual and audio challenges. The system adapts the difficulty based real real-time interaction patterns, and uses a hybrid classifier to distinguish on bot human from bot interaction... G experimental tests that shows G-CAPTCHA outperper performs better better better well-established baselines and modern invisible analysis systems, while remaining G remaining remaining-CAPTCHA remains vulnerable to emerging model-model agents.
Aura-CAPTCHA 是一个结合了 GAN、RL 和行为分析的多模态验证系统,能够生成独特的视觉和音频刺激,并根据用户交互模式调整难度。实验结果表明,Aura-CAPTCHA 提高了人类的成功率并降低了经典挑战系统的绕过率,尽管它仍然对新兴的大模型代理存在漏洞。
Before Forgetting, Learn to Remember: Revisiting Foundational Learning Failures in LVLM Unlearning Benchmarks
Authors: JuneHyoung Kwon, MiHyeon Kim, Eunju Lee, JungMin Yun, Byeonggeuk Lim, YoungBin Kim
Venue: ACL 2026
First: 2026-05-05T13:42:31+00:00 · Latest: 2026-05-05T13:42:31+00:00
Comments: Accepted to Findings of ACL 2026
Abstract
While Large Vision-Language Models (LVLMs) offer powerful capabilities, they pose privacy risks by unintentionally memorizing sensitive personal information. Current unlearning benchmarks attempt to mitigate this using fictitious identities but overlook a critical stage 1 failure: models fail to effectively memorize target information initially, rendering subsequent unlearning evaluations unreliable. Diagnosing under-memorization and the multi-hop curse as root causes, we introduce ReMem, a Reliable Multi-hop and Multi-image Memorization Benchmark. ReMem ensures robust foundational learning through principled data scaling, reasoning-aware QA pairs, and diverse visual contexts. Additionally, we propose a novel Exposure metric to quantify the depth of information erasure from the model's internal probability distribution. Extensive experiments demonstrate that ReMem provides a rigorous and trustworthy framework for diagnosing both learning and unlearning behaviors in LVLMs.
Summary / 总结
The research addresses the privacy risks associated with Large Vision-Language Models (LVLMs) by revisiting foundational learning failures in unlearning benchmarks. It introduces ReMem, a benchmark that ensures robust foundational learning through principled data scaling and reasoning-aware QA pairs. Key findings show that ReMem provides a rigorous framework for diagnosing both learning and unlearning behaviors in LVLMs, addressing under-memorization and the multi-hop curse as root causes. Additionally, a novel Exposure metric is proposed to quantify the depth of information erasure from the model's internal probability distribution.
论文旨在通过重新审视大型视觉-语言模型(LVLM)遗忘基准中的基础学习失败,解决隐私风险问题。它引入了ReMem基准,通过原理数据扩展和推理意识的问答对确保稳健的记忆,并提出了一种新的暴露度量来衡量信息擦除的深度。实验表明,ReMem为诊断LVLM中的学习和遗忘行为提供了一个严格的框架。
SurgCheck: Do Vision-Language Models Really Look at Images in Surgical VQA?
Authors: Jongmin Shin, Ka Young Kim, Eunki Cho, Seong Tae Kim, Namkee Oh
First: 2026-05-03T14:47:03+00:00 · Latest: 2026-05-05T13:32:54+00:00
Abstract
Purpose: Vision-language models (VLMs) have shown promising performance in surgical visual question answering (VQA). However, existing surgical VQA datasets often contain linguistic shortcuts, where question phrasing implicitly constrains the answer space. It remains unclear whether reported performance reflects visual understanding or reliance on such linguistic shortcuts. Methods: We introduce SurgCheck, a diagnostic benchmark for quantifying linguistic shortcut reliance in surgical VQA. SurgCheck employs a paired-question design in which each surgical frame is associated with an original question containing entity names and a less-biased counterpart that removes these names while preserving identical visual content and ground-truth answers. The resulting performance gap provides a diagnostic signal of shortcut reliance. To ensure that the less-biased question remains well-defined even without entity names, four grounding cues are incorporated: bounding box, arrow, spatial position, and periphrasis. We evaluate both general-purpose and surgical-specific VLMs under zero-shot and fine-tuned settings on SurgCheck. To evaluate open-ended zero-shot responses, we introduce an LLM-as-a-judge evaluation protocol. Results: Using SurgCheck, we observe consistent performance degradation on less-biased questions across five VLMs, despite identical visual inputs. Text-only ablation reveals minimal performance drops for action and target prediction, indicating that action and target prediction is largely driven by linguistic shortcuts rather than visual reasoning. Conclusion: SurgCheck provides a controlled diagnostic framework that exposes failure modes masked by linguistic bias in existing surgical VQA benchmarks. Our findings demonstrate that strong benchmark performance does not necessarily imply faithful visual understanding, underscoring the need for bias-aware evaluation in surgical VQA.
Summary / 总结
SurgCheck is a diagnostic benchmark designed to assess whether vision-language models rely on linguistic shortcuts in surgical visual question answering. By comparing original questions with less-biased counterparts that remove entity names but maintain identical visual content, SurgCheck reveals a consistent performance drop across five models, indicating that their performance is heavily influenced by linguistic shortcuts rather than visual understanding. This suggests that strong performance in surgical VQA benchmarks may not reflect true visual reasoning capabilities.
SurgCheck 是一个诊断基准,旨在评估视觉语言模型在手术视觉问答中是否依赖于语言捷径。通过将原始问题与去除实体名称但保持相同视觉内容的较不偏颇的问题进行比较,SurgCheck 发现五种模型在较不偏颇的问题上的性能下降一致,表明它们的性能主要受语言捷径的影响而非视觉推理。这表明手术 VQA 基准中的强大性能并不一定反映真实的视觉理解能力。
FluxFlow: Conservative Flow-Matching for Astronomical Image Super-Resolution
Authors: Shuhong Liu, Xining Ge, Ziteng Cui, Liuzhuozheng Li, Gengjia Chang, Jun Liu, Ziying Gu, Dong Li, Xuangeng Chu, Lin Gu, Tatsuya Harada
First: 2026-05-05T13:32:12+00:00 · Latest: 2026-05-05T13:32:12+00:00
Abstract
Ground-to-space astronomical super-resolution requires recovering space-quality images from ground-based observations that are simultaneously limited by pixel sampling resolution and atmospheric seeing, which imposes a stochastic, spatially varying PSF that cannot be resolved through upsampling alone. Existing methods rely on synthetic training pairs that fail to capture real atmospheric statistics and are prone to either over-smoothed reconstructions or hallucination sources with no physical counterpart in the observed sky. We propose FluxFlow, a conservative pixel-space flow-matching framework that incorporates observation uncertainty and source-region importance weights during training, and a training-free Wiener-regularized test-time correction to suppress hallucination sources while preserving recovered detail. We further construct the DESI--HST Dataset, the large-scale real-world benchmark comprising 19,500 real co-registered ground-to-space image pairs with real atmospheric PSF variation. Experiments demonstrate that FluxFlow consistently outperforms existing baseline methods in both photometric and scientific accuracy.
Summary / 总结
FluxFlow is a conservative pixel-space flow-matching framework designed for astronomical image super-resolution, addressing the challenges posed by ground-based observations limited by pixel sampling and atmospheric effects. It incorporates observation uncertainty and source-region importance weights during training, and uses a test-time correction to suppress hallucination sources. Experiments show that FluxFlow outperforms existing methods in both photometric and scientific accuracy on the DESI--HST Dataset, which includes 19,500 real co-registered ground-to-space image pairs with real atmospheric PSF variation.
FluxFlow 是一种保守的像素空间流匹配框架,旨在解决地面到太空的天文学图像超分辨率问题,应对大气视宁度和像素采样分辨率带来的挑战。该框架在训练过程中考虑了观测不确定性及源区域的重要性权重,并在测试时使用维纳正则化校正来抑制幻觉源。实验表明,FluxFlow 在 DESI--HST 数据集上的表现优于现有方法,该数据集包含 19,500 对真实的地面到太空图像对,具有真实的天文学 PSF 变化,在光度和科学准确性方面均表现出色。
Tempered Guided Diffusion
Authors: Andreas Makris, Paul Fearnhead, Chris Nemeth
First: 2026-05-05T13:00:15+00:00 · Latest: 2026-05-05T13:00:15+00:00
Abstract
Training-free conditional diffusion provides a flexible alternative to task-specific conditional model training, but existing samplers often allocate computation inefficiently: independent guided trajectories can vary widely in quality, and additional function evaluations along a single trajectory may not recover from poor early decisions. We propose Tempered Guided Diffusion (TGD), an annealed sequential Monte Carlo framework for training-free conditional sampling with diffusion priors. TGD targets tempered posterior distributions over the clean signal, using noisy diffusion states only as auxiliary variables for proposing reconstructions and propagating particles. Particles are reweighted by incremental likelihood ratios, resampled, and propagated across noise levels, concentrating computation on trajectories plausible under both the prior and observation. Under idealized exact-reconstruction assumptions, full TGD yields a consistent particle approximation to the posterior as the number of particles grows. For expensive reconstruction tasks, Accelerated TGD (A-TGD) retains early particle exploration but prunes to a single high-likelihood trajectory partway through sampling. Experiments on a controlled two-dimensional inverse problem and image inverse problems show improved posterior approximation and favorable wall-clock speed-quality tradeoffs over independent multi-trajectory baselines.
中文标题/摘要
标题:调和引导扩散
无需训练的条件扩散提供了一种针对特定任务的条件模型训练的灵活替代方案,但现有的采样器通常计算分配效率低下:独立引导轨迹的质量差异可能很大,沿单个轨迹的额外函数评估可能无法从早期的不良决策中恢复。我们提出了一种调和引导扩散(TGD),这是一种用于条件采样的逐步蒙特卡洛框架,利用扩散先验。TGD 目标是调和的干净信号后验分布,仅使用噪声扩散状态作为辅助变量来提出重构并传播粒子。粒子通过增量似然比重新加权、重采样并在噪声级别之间传播,将计算集中在同时在先验和观测下都合理的轨迹上。在理想化的精确重构假设下,随着粒子数量的增长,完整的 TGD 产生了一致的粒子后验近似。对于昂贵的重构任务,加速 TGD(A-TGD)保留了早期粒子探索,但在采样过程中中途修剪为单个高似然轨迹。实验表明,在受控的二维逆问题和图像逆问题上,TGD 提供了更好的后验近似,并且在墙钟速度-质量权衡方面优于独立多轨迹基线。
Summary / 总结
Tempered Guided Diffusion (TGD) is proposed to address the inefficiency of existing samplers in conditional diffusion models, which often result in varying quality of guided trajectories and difficulty in recovering from poor early decisions. TGD uses an annealed sequential Monte Carlo framework to target tempered posterior distributions, reweighting and resampling particles to concentrate computation on plausible trajectories. Experiments show that TGD improves posterior approximation and offers favorable speed-quality tradeoffs compared to independent multi-trajectory baselines, especially for expensive reconstruction tasks.
Tempered Guided Diffusion (TGD) 提出了一种解决现有采样器在条件扩散模型中效率低下的方法,这些问题导致引导轨迹质量参差不齐,并且难以从早期的不良决策中恢复。TGD 使用了退火序列蒙特卡洛框架来瞄准退火后验分布,并通过重新加权和重采样粒子来集中计算资源于更有可能的轨迹。实验表明,TGD 在后验近似方面表现更好,并且在昂贵的重建任务中提供了有利的时间-质量权衡,优于独立的多轨迹基线方法。
FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers
Authors: Timon Homberger, Finn Lukas Busch, Jesús Gerardo Ortega Peimbert, Quantao Yang, Olov Andersson
First: 2026-05-05T12:08:16+00:00 · Latest: 2026-05-05T12:08:16+00:00
Comments: This work has been submitted to the IEEE for possible publication
Abstract
Open-vocabulary semantic mapping enables robots to spatially ground previously unseen concepts without requiring predefined class sets. Current training-free methods commonly rely on multi-view fusion of semantic embeddings into a 3D map, either at the instance-level via segmenting views and encoding image crops of segments, or by projecting image patch embeddings directly into a dense semantic map. The latter approach sidesteps segmentation and 2D-to-3D instance association by operating on full uncropped image frames, but existing methods remain limited in scalability. We present FUS3DMaps, an online dual-layer semantic mapping method that jointly maintains both dense and instance-level open-vocabulary layers within a shared voxel map. This design enables further voxel-level semantic fusion of the layer embeddings, combining the complementary strengths of both semantic mapping approaches. We find that our proposed semantic cross-layer fusion approach improves the quality of both the instance-level and dense layers, while also enabling a scalable and highly accurate instance-level map where the dense layer and cross-layer fusion are restricted to a spatial sliding window. Experiments on established 3D semantic segmentation benchmarks as well as a selection of large-scale scenes show that FUS3DMaps achieves accurate open-vocabulary semantic mapping at multi-story building scales. Additional material and code will be made available: https://githanonymous.github.io/FUS3DMaps/.
中文标题/摘要
标题:FUS3DMaps:通过体素级和实例级层的3D融合实现可扩展且准确的开放词汇语义映射
开放词汇语义映射使机器人能够对未见过的概念进行空间定位,而无需预先定义类别集。当前的无训练方法通常依赖于将语义嵌入多视图融合到3D地图中,要么通过分割视图并编码分割区域的图像片段,要么直接将图像片段嵌入投影到密集语义地图中。后一种方法通过操作完整的未裁剪图像帧绕过了分割和2D到3D实例关联,但现有方法在可扩展性方面仍然有限。我们提出了FUS3DMaps,这是一种在线双层语义映射方法,可以在共享体素地图中同时维护密集和实例级的开放词汇层。此设计允许进一步在层嵌入之间进行体素级语义融合,结合了两种语义映射方法的互补优势。我们发现,我们提出的概念跨层融合方法可以提高实例级和密集层的质量,同时还可以实现一个可扩展且高度准确的实例级地图,其中密集层和跨层融合仅限于空间滑动窗口。在建立的3D语义分割基准测试以及一些大规模场景上进行的实验表明,FUS3DMaps可以在多层建筑规模上实现准确的开放词汇语义映射。附加材料和代码将可供使用:https://githanonymous.github.io/FUS3DMaps/
Summary / 总结
FUS3DMaps is an online dual-layer semantic mapping method that combines dense and instance-level layers within a shared voxel map to improve scalability and accuracy. It uses semantic cross-layer fusion to enhance both instance-level and dense layers, enabling highly accurate open-vocabulary semantic mapping at large scales. Experiments on 3D semantic segmentation benchmarks and large-scale scenes demonstrate its effectiveness.
FUS3DMaps 是一种在线双层语义映射方法,将密集层和实例层结合在一个共享体素图中,以提高可扩展性和准确性。它使用语义跨层融合来增强两个层,并通过将密集层和跨层融合限制在空间滑动窗口中,实现可扩展且精确的实例级地图。实验表明,FUS3DMaps 能够在多层建筑等大规模场景中实现准确的开放词汇语义映射。
Deep Interest Mining with Cross-Modal Alignment for SemanticID Generation in Generative Recommendation
Authors: Yangchen Zeng, Jinze Wang
First: 2026-03-03T13:36:22+00:00 · Latest: 2026-05-05T11:18:30+00:00
Abstract
Generative Recommendation (GR) has demonstrated remarkable performance in next-token prediction paradigms, which relies on Semantic IDs (SIDs) to compress trillion-scale data into learnable vocabulary sequences. However, existing methods suffer from three critical limitations: (1) Information Degradation: the two-stage compression pipeline causes semantic loss and information degradation, with no posterior mechanism to distinguish high-quality from low-quality SIDs; (2) Semantic Degradation: cascaded quantization discards key semantic information from original multimodal features, as the embedding generation and quantization stages are not jointly optimized toward a unified objective; (3) Modality Distortion: quantizers fail to properly align text and image modalities, causing feature misalignment even when upstream networks have aligned them. To address these challenges, we propose a novel framework integrating three key innovations: Deep Contextual Interest Mining (DCIM), Cross-Modal Semantic Alignment (CMSA), and Quality-Aware Reinforcement Mechanism (QARM). First, we leverage Vision-Language Models (VLMs) to align non-textual modalities into a unified text-based semantic space, mitigating modality distortion. Second, we introduce a deep interest mining mechanism that captures high-level semantic information implicitly present in advertising contexts, encouraging SIDs to preserve critical contextual information through reconstruction-based supervision. Third, we employ a reinforcement learning framework with quality-aware rewards to encourage semantically rich SIDs while suppressing low-quality ones in the posterior stage. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art SID generation methods, achieving superior performance on multiple benchmarks. Ablation studies further validate the effectiveness of each proposed component
Summary / 总结
This paper addresses the limitations of existing Semantic ID (SID) generation methods in Generative Recommendation (GR) by proposing a novel framework that integrates Deep Contextual Interest Mining (DCIM), Cross-Modal Semantic Alignment (CMSA), and Quality-Aware Reinforcement Mechanism (QARM). The method leverages Vision-Language Models to align non-textual modalities into a unified semantic space, introduces a deep interest mining mechanism to capture high-level semantic information, and uses a reinforcement learning framework with quality-aware rewards to generate semantically rich SIDs. Experiments show that this approach outperforms state-of-the-art methods on multiple benchmarks.
本文提出了一种新的框架,通过集成深度上下文兴趣挖掘(DCIM)、跨模态语义对齐(CMSA)和质量感知强化机制(QARM),解决了现有语义ID(SID)生成方法在生成推荐(GR)中的局限性。该方法利用视觉语言模型将非文本模态统一到一个语义空间中,引入了一种深度兴趣挖掘机制来捕捉高阶语义信息,并使用带有质量感知奖励的强化学习框架来生成语义丰富的SID。实验表明,该方法在多个基准测试中优于最先进的方法。
Closed-Loop Vision-Language Planning for Multi-Agent Coordination
Authors: Zhiyuan Li, Wenshuai Zhao, Joni Pajarinen
First: 2025-02-14T13:23:18+00:00 · Latest: 2026-05-05T11:18:23+00:00
Abstract
Cooperative multi-agent reinforcement learning (MARL) struggles with sample efficiency, interpretability, and generalization. While Large Language Models (LLMs) offer powerful planning capabilities, their application has been hampered by a reliance on text-only inputs and a failure to handle the non-Markovian, partially observable nature of multi-agent tasks. We introduce COMPASS, a multi-agent framework that overcomes these limitations by integrating Vision-Language Models (VLMs) for decentralized, closed-loop decision-making. COMPASS dynamically generates and refines interpretable, code-based strategies stored in a skill library that is bootstrapped from expert demonstrations. To ensure robust coordination, it propagates entity information through a structured multi-hop communication protocol, allowing teams to build a coherent understanding from partial observations. Evaluated on the challenging SMACv2 benchmark, COMPASS significantly outperforms state-of-the-art MARL baselines. Notably, in the symmetric Protoss 5v5 task, COMPASS achieved a 57\% win rate, a 30 percentage point advantage over QMIX (27\%). Project page can be found at https://stellar-entremet-1720bb.netlify.app/.
中文标题/摘要
标题:闭环视觉-语言规划在多智能体协调中的应用
合作多智能体强化学习(MARL)在样本效率、可解释性和泛化方面存在挑战。虽然大型语言模型(LLMs)提供了强大的规划能力,但其应用受限于仅依赖文本输入以及无法处理多智能体任务的非马尔可夫性和部分可观测性。我们提出了COMPASS,这是一种通过集成视觉-语言模型(VLMs)实现去中心化闭环决策的多智能体框架,从而克服了这些限制。COMPASS动态生成并优化可解释的、基于代码的策略,并将其存储在从专家演示中启动的技能库中。为了确保协调性,它通过结构化的多跳通信协议传播实体信息,使团队能够从部分观察中构建一致的理解。在具有挑战性的SMACv2基准测试中,COMPASS显著优于最先进的MARL基线。值得注意的是,在对称的Protoss 5v5任务中,COMPASS的胜率达到了57%,比QMIX(27%)高出30个百分点。项目页面可访问 https://stellar-entremet-1720bb.netlify.app/。
Summary / 总结
The research addresses the challenges of sample efficiency, interpretability, and generalization in cooperative multi-agent reinforcement learning (MARL) by introducing COMPASS, a framework that integrates Vision-Language Models (VLMs) for decentralized, closed-loop decision-making. COMPASS dynamically generates and refines interpretable, code-based strategies from expert demonstrations and ensures robust coordination through structured multi-hop communication. On the SMACv2 benchmark, COMPASS outperforms state-of-the-art MARL baselines, achieving a 57% win rate in the symmetric Protoss 5v5 task, a 30 percentage point advantage over QMIX (27%).
COMPASS 是一个多智能体框架,通过集成视觉语言模型实现去中心化的闭环决策,以解决合作多智能体强化学习中的样本效率、可解释性和泛化能力问题。它从专家演示中动态生成和优化可解释的代码策略,并通过结构化的多跳通信协议确保团队之间的稳健协调。在 SMACv2 基准测试中,COMPASS 显著优于最先进的多智能体强化学习基线,其在对称的Protoss 5v5任务中的胜率达到了57%,比QMIX(27%)高出30个百分点。
The Detector Teaches Itself: Lightweight Self-Supervised Adaptation for Open-Vocabulary Object Detection
Authors: Yazhe Wan, Changjae Oh
First: 2026-05-05T11:14:30+00:00 · Latest: 2026-05-05T11:14:30+00:00
Comments: 15 pages; 4 figures; Accepted to ICPR 2026; Code is available at https://github.com/QM-IPAlab/DAT
Abstract
Open-vocabulary object detection aims to recognize objects from an open set of categories, which leverages vision-language models (VLMs) pre-trained on large-scale image-text data. The cooperative paradigm combines an object detector with a VLM to achieve zero-shot recognition of novel objects. However, VLMs pre-trained on full images often struggle to capture local object details, limiting their effectiveness when applied to region-level detection. We present Decoupled Adaptivity Training (DAT), a self-supervised fine-tuning approach to improve VLMs for cooperative model-based object detection. Given a cooperative model consists of a closed-set detector and a VLM, we first construct a region-aware pseudo-labeled dataset using a pre-trained closed-set object detector, in which regions corresponding to novel objects may be present but remain unlabeled or mislabeled. We then fine-tune the visual backbone of the VLM in a decoupled manner, which enhances local feature alignment while preserving global semantic knowledge via weight interpolation. DAT is a plug-and-play module that requires no inference overhead and fine-tunes less than 0.8M parameters. Experiments on the COCO and LVIS datasets show that DAT consistently improves detection performance on both novel and known categories, establishing a new state of the art in cooperative open-vocabulary detection.
中文标题/摘要
标题:检测器自我学习:轻量级自我监督适应在开放词汇对象检测中的应用
开放词汇对象检测旨在从一个开放类别集合中识别对象,这利用了在大规模图像-文本数据上预训练的视觉-语言模型(VLMs)。合作范式将对象检测器与VLM结合,以实现对新对象的零样本识别。然而,预训练在完整图像上的VLM往往难以捕捉局部对象细节,限制了其在区域级检测中的效果。我们提出了解耦适应训练(DAT),这是一种自我监督的微调方法,以提高VLMs在合作模型基础对象检测中的效果。给定一个合作模型由一个封闭集检测器和一个VLM组成,我们首先使用预训练的封闭集对象检测器构建一个区域感知的伪标签数据集,在其中可能包含新对象的区域但未被标注或错误标注。然后,我们以解耦的方式微调VLM的视觉骨干,这增强了局部特征对齐,同时通过权重插值保留全局语义知识。DAT是一个即插即用模块,不需要推理开销,并微调不到0.8M参数。在COCO和LVIS数据集上的实验表明,DAT在新类别和已知类别上一致地提高了检测性能,建立了合作开放词汇检测的新状态。
Summary / 总结
The paper addresses the challenge of open-vocabulary object detection by proposing Decoupled Adaptivity Training (DAT), a self-supervised fine-tuning approach. DAT enhances a vision-language model (VLM) for cooperative object detection with a closed-set detector. By constructing a region-aware pseudo-labeled dataset and fine-tuning the VLM’s visual backbone in a decoupled manner, DAT improves local feature alignment while preserving global semantic knowledge. Experiments on COCO and LVIS datasets demonstrate consistent performance improvements for both novel and known categories, setting a new state-of-the-art in cooperative open-vocabulary detection.
论文提出了一种自监督微调方法Decoupled Adaptivity Training (DAT),以解决开放词汇对象检测的挑战。DAT通过构建区域感知的伪标签数据集,并以解耦方式微调VLM的视觉骨干,增强局部特征对齐同时保留全局语义知识。在COCO和LVIS数据集上的实验表明,DAT在新类别和已知类别上均表现出一致的性能提升,建立了合作开放词汇检测的新状态。
VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
Authors: Tanush Yadav, Mohammadreza Salehi, Jae Sung Park, Vivek Ramanujan, Hannaneh Hajishirzi, Yejin Choi, Ali Farhadi, Rohun Tripathi, Ranjay Krishna
Venue: CVPR 2026 Highlight
First: 2026-05-04T17:11:16+00:00 · Latest: 2026-05-05T09:59:53+00:00
Comments: CVPR 2026 Highlight. Website at https://tanu.sh/videonet
Abstract
Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-language models (VLMs) are no longer evaluated on their action recognition capabilities. To revitalize action recognition in the era of VLMs, we advocate for a returned focus on domain-specific actions. To this end, we introduce VideoNet, a domain-specific action recognition benchmark covering 1,000 distinct actions from 37 domains. We begin with a multiple-choice evaluation setting, where the difference between closed and open models is stark: Gemini 3.1 Pro attains 69.9% accuracy while Qwen3-VL-8B gets a mere 45.0%. To understand why VLMs struggle on VideoNet, we relax the questions into a binary setting, where random chance is 50%. Still, Qwen achieves only 59.2% accuracy. Further relaxing the evaluation setup, we provide $k\in\{1,2,3\}$ in-context examples of the action. Some models excel in the few-shot setting, while others falter; Qwen improves $+7.0\%$, while Gemini declines $-4.8\%$. Notably, these gains fall short of the $+13.6\%$ improvement in non-expert humans when given few-shot examples. Finding that VLMs struggle to fully exploit in-context examples, we shift from test-time improvements to the training side. We collect the first large-scale training dataset for domain-specific actions, totaling nearly 500k video question-answer pairs. Fine-tuning a Molmo2-4B model on our data, we surpass all open-weight 8B models on the VideoNet benchmark.
Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models
Authors: JuneHyoung Kwon, JungMin Yun, YoungBin Kim
First: 2026-05-05T09:18:22+00:00 · Latest: 2026-05-05T09:18:22+00:00
Comments: Accepted to LREC 2026
Abstract
Large Vision-Language Models (LVLMs), trained on web-scale data, risk memorizing and regenerating copyrighted visual content such as characters and logos, creating significant challenges. Machine unlearning offers a path to mitigate these risks by removing specific content post-training, but evaluating its effectiveness, especially in the complex multimodal setting of LVLMs, remains an open problem. Current evaluation methods often lack robustness or fail to capture the nuances of cross-modal concept erasure. To address this critical gap, we introduce the CoVUBench benchmark, the first framework specifically designed for evaluating copyright content unlearning in LVLMs. CoVUBench utilizes procedurally generated, legally safe synthetic data coupled with systematic visual variations spanning compositional changes and diverse domain manifestations to ensure realistic and robust evaluation of unlearning generalization. Our comprehensive multimodal evaluation protocol assesses both forgetting efficacy from the copyright holder perspective and the preservation of general model utility from the deployer viewpoint. By rigorously measuring this crucial trade-off, CoVUBench provides a standardized tool to advance the development of responsible and effective unlearning methods for LVLMs.
Summary / 总结
The research aims to address the risk of large vision-language models (LVLMs) memorizing and regenerating copyrighted content, which poses significant challenges. The study introduces CoVUBench, a benchmark for evaluating copyright unlearning in LVLMs, using procedurally generated, legally safe synthetic data with systematic visual variations. The evaluation protocol assesses both the effectiveness of forgetting from the copyright holder's perspective and the preservation of model utility from the deployer's viewpoint, providing a standardized tool for advancing responsible unlearning methods.
研究旨在解决大型视觉语言模型(LVLMs)记忆和再生版权内容所带来的挑战。研究引入了CoVUBench基准,使用合法安全的合成数据和系统性的视觉变化。评估协议从版权持有者的角度评估遗忘效果,同时从部署者的角度评估模型功能的保留,提供了一个标准化工具来促进负责任和有效的LVLMs遗忘方法的发展。
AVA: Attentive VLM Agent for Mastering StarCraft II
Authors: Weiyu Ma, Yuqian Fu, Zecheng Zhang, Bernard Ghanem, Guohao Li
First: 2025-03-07T12:54:25+00:00 · Latest: 2026-05-05T08:57:22+00:00
Abstract
We introduce AVACraft, a multimodal StarCraft II benchmark supporting both Multi-Agent Reinforcement Learning (MARL) and Vision-Language Model (VLM) paradigms. Unlike SMAC-family environments that rely on abstract state representations and exclude VLMs, AVACraft provides RGB visuals, natural language observations, and structured state information, enabling systematic comparison between training-based and zero-shot methods across 21 scenarios spanning micromanagement, coordination, and strategic planning. We establish comprehensive baselines: six MARL algorithms (IQL, QMIX, QTRAN, VDN, MAPPO, IPPO) with Swin-Transformer backbones trained for 5M steps, and multiple VLMs including proprietary (GPT-4o) and open-source (Qwen3-VL) models. Results reveal complementary strengths-MARL peaks at 19.3% win rate after 5M steps, while VLMs achieve 75-90% zero-shot with human-aligned decisions-exposing trade-offs between training efficiency, performance ceilings, interpretability, and deployment cost. Code: https://github.com/camel-ai/VLM-Play-StarCraft2.
Summary / 总结
The paper introduces AVACraft, a multimodal StarCraft II benchmark for comparing training-based and zero-shot methods. It includes RGB visuals, natural language observations, and structured state information, allowing for a comprehensive comparison across 21 scenarios. The study establishes baselines with six MARL algorithms and multiple VLMs, showing that MARL achieves a 19.3% win rate after training, while VLMs perform at 75-90% zero-shot with human-aligned decisions, highlighting the trade-offs between training efficiency, performance, interpretability, and deployment cost. Code: https://github.com/camel-ai/VLM-Play-StarCraft2.
研究引入了AVACraft,这是一个支持多智能体强化学习(MARL)和视觉语言模型(VLM)的多模态StarCraft II基准,提供了RGB视觉、自然语言观察和结构化状态信息,允许在21个涵盖微观管理、协调和战略规划的场景中系统比较训练方法和零样本方法。研究建立了六种MARL算法和多种VLM的全面基线,结果显示MARL算法在5M步后达到19.3%的胜率,而VLM在零样本情况下达到75-90%的人类对齐决策,揭示了训练效率、性能上限、可解释性和部署成本之间的权衡。代码: https://github.com/camel-ai/VLM-Play-StarCraft2.
MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models
Authors: Kangkang Wang, Qinting Jiang, Wanping Zhang, Bowen Ren, Shengzhao Wen
First: 2026-05-05T08:20:48+00:00 · Latest: 2026-05-05T08:20:48+00:00
Abstract
Multidimensional human understanding is essential for real-world applications such as film analysis and virtual digital humans, yet current LVLM benchmarks largely focus on single-task settings and lack fine-grained, human-centric evaluation. In this work, we introduce MHPR, a comprehensive benchmark for joint perception-reasoning over human-centric scenes spanning individual, multi-person, and human-object interaction dimensions. MHPR comprises a multi-level data design-Captioned Raw Data (C-RD), Supervised Fine-Tuning Data (SFT-D), Reinforcement Learning Data (RL-D), and Test Data (T-D)-together with an automated caption/VQA generation pipeline (ACVG) that performs category-wise attribute decomposition, attribute-specific rewriting, and multi-model voting to ensure high-quality, scalable annotations. We evaluate state-of-the-art vision-language models on fine-grained attributes (appearance, clothing, pose, parts) and high-level semantics (social relations, action semantics, spatial relations, intent and functionality). Our findings show that: 1) format-aligned SFT data substantially improves instruction following and stability; 2) challenge-focused RL data derived from bad-case analysis further enhances perception and reasoning on difficult instances; and 3) training Qwen2.5-VL-7B with MHPR yields significant gains, achieving near-parity with considerably larger models. We release ACVG and MHPR to facilitate reproducible, extensible research on human-centric perception and reasoning.
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection
Authors: Chih-Chung Liu, Zhiwei Lin, Yongtao Wang
First: 2026-05-05T07:44:03+00:00 · Latest: 2026-05-05T07:44:03+00:00
Abstract
Open-world object detection aims to localize and recognize objects beyond a fixed closed-set label space. It is commonly divided into two categories, i.e., open-vocabulary detection, which assumes a predefined category list at test time, and open-ended detection, which requires generating candidate categories during the inference. Existing methods rely primarily on coarse textual semantics and parametric knowledge, which often provide insufficient visual evidence for fine-grained appearance variation, rare categories, and cluttered scenes. In this paper, we propose VL-SAM-v3, a unified framework that augments open-world detection with retrieval-grounded external visual memory. Specifically, once candidate categories are available, VL-SAM-v3 retrieves relevant visual prototypes from a non-parametric memory bank and transforms them into two complementary visual priors, i.e., sparse priors for instance-level spatial anchoring and dense priors for class-aware local context. These priors are integrated with the original detection prompts via Memory-Guided Prompt Refinement, enabling a shared retrieval-and-refinement mechanism that supports open-vocabulary and open-ended inference.Extensive zero-shot experiments on LVIS show that VL-SAM-v3 consistently improves detection performance under both open-vocabulary and open-ended inference, with particularly strong gains on rare categories.Moreover, experiments with a stronger open-vocabulary detector (i.e., SAM3) validate the generality of the proposed retrieval-and-refinement mechanism.
中文标题/摘要
标题:VL-SAM-v3:面向开放世界物体检测的记忆引导视觉先验
开放世界物体检测旨在定位和识别超出固定封闭标签空间的物体。它通常分为两类:开放词汇检测,假设测试时有一个预定义的类别列表;开放生成检测,需要在推理过程中生成候选类别。现有方法主要依赖粗略的文本语义和参数知识,这通常不足以为细粒度外观变化、稀有类别和杂乱场景提供足够的视觉证据。在本文中,我们提出了一种名为VL-SAM-v3的统一框架,该框架通过检索导向的外部视觉记忆增强了开放世界检测。具体而言,一旦候选类别可用,VL-SAM-v3将从非参数记忆库中检索相关视觉原型,并将其转换为两个互补的视觉先验,即实例级空间锚定的稀疏先验和类感知局部上下文的密集先验。这些先验通过记忆引导提示精炼与原始检测提示集成,实现了一种共享检索和精炼机制,支持开放词汇和开放生成推理。在LVIS上的大量零样本实验表明,VL-SAM-v3在开放词汇和开放生成推理下均能提高检测性能,特别是在稀有类别上表现出显著的提升。此外,使用更强的开放词汇检测器(即SAM3)的实验验证了所提出的检索和精炼机制的普适性。
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models
Authors: Shule Lu, Yujing Wang, Hainan Zhang, Xiaoshan Yang, Hongwei Zheng, Yongxin Tong, Changsheng Xu, Zhiming Zheng
First: 2026-05-05T07:02:50+00:00 · Latest: 2026-05-05T07:02:50+00:00
Abstract
Vision-Language Models (VLMs) have broad potential in privacy-sensitive domains such as healthcare and finance, yet strict data-sharing constraints render centralized training infeasible. Federated Learning mitigates this issue by enabling decentralized training, but practical deployments face challenges due to client heterogeneity in computational resources, application requirements, and model architectures. Under extreme model and data heterogeneity, replacing parameter aggregation with preference-based collaboration offers a more suitable interface, as it eliminates the need for direct parameter or data exchange. Motivated by this, we propose MoR, a federated alignment framework that combines GRPO with Mixture-of-Rewards for heterogeneous VLMs. In MoR, each client locally trains a reward model from local preference annotations, capturing specific evaluation signals without exposing raw data. To combine these heterogeneous supervision signals, MoR introduces a Mixture-of-Rewards mechanism with learned routing, which adaptively fuses client reward models according to the input and alignment objective. The server then optimizes a base VLM using GRPO with a KL penalty to a reference model, enabling preference alignment without requiring client models to share architectures or parameters. Experiments on diverse public vision-language benchmarks demonstrate that MoR consistently outperforms federated alignment baselines in generalization and cross-client adaptability. Our approach provides a scalable solution for privacy-preserving alignment of heterogeneous VLMs under federated settings.
GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning
Authors: Yujun Li, Hongyuan Zhang, Yuan Yuan
First: 2026-05-05T06:23:20+00:00 · Latest: 2026-05-05T06:23:20+00:00
Abstract
Group Relative Policy Optimization (GRPO) has recently shown strong performance in post-training large language models and vision-language models. It raises a question of whether the GRPO also significantly promotes the test-time adaptation (TTA) of vision language models. In this paper, we propose Group Relative Policy Optimization for Test-Time Adaptation (GRPO-TTA), which adapts GRPO to the TTA setting by reformulating class-specific prompt prediction as a group-wise policy optimization problem. Specifically, we construct output groups by sampling top-K class candidates from CLIP similarity distributions, enabling probability-driven optimization without access to ground-truth labels. Moreover, we design reward functions tailored to test-time adaptation, including alignment rewards and dispersion rewards, to guide effective visual encoder tuning. Extensive experiments across diverse benchmarks demonstrate that GRPO-TTA consistently outperforms existing test-time adaptation methods, with notably larger performance gains under natural distribution shifts.
中文标题/摘要
标题:GRPO-TTA:通过GRPO驱动的强化学习实现视觉语言模型的测试时视觉调优
组相对策略优化(GRPO)最近在后训练大型语言模型和视觉语言模型中表现出强大的性能。它提出了一个问题,即GRPO是否也显著促进了视觉语言模型的测试时适应(TTA)。在本文中,我们提出了组相对策略优化用于测试时适应(GRPO-TTA),通过将类特定的提示预测重新表述为组间策略优化问题,将GRPO适应TTA设置。具体而言,我们通过从CLIP相似性分布中采样前K类候选构建输出组,使优化基于概率而不访问真实标签。此外,我们设计了针对测试时适应的奖励函数,包括对齐奖励和分散奖励,以指导有效的视觉编码器调优。广泛的实验表明,GRPO-TTA在各种基准上始终优于现有的测试时适应方法,在自然分布偏移下表现出显著更大的性能提升。
Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling
Authors: Xinglin Wang, Jiayi Shi, Shaoxiong Feng, Peiwen Yuan, Yiwei Li, Yueqi Zhang, Chuyi Tan, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li
First: 2026-01-29T13:18:36+00:00 · Latest: 2026-05-05T06:17:11+00:00
Comments: preprint
Abstract
Test-Time Scaling enhances the reasoning capabilities of Large Language Models by allocating additional inference compute to broaden the exploration of the solution space. However, existing search strategies typically treat rollouts as disposable samples, where valuable intermediate insights are effectively discarded after each trial. This wasted rollout-level experience leads to substantial computational redundancy, as models repeatedly re-derive discovered conclusions and revisit known dead ends across extensive attempts. To bridge this gap, we propose \textbf{Recycling Search Experience (RSE)}, a self-guided, training-free strategy that turns test-time search from a series of isolated trials into a cumulative, experience-guided process. By actively distilling raw trajectories into a shared experience bank, RSE enables positive recycling of intermediate conclusions to shortcut redundant derivations and negative recycling of failure patterns to prune encountered dead ends. Theoretically, we provide an analysis that formalizes the efficiency gains of RSE over independent sampling in solving complex reasoning tasks. Empirically, extensive experiments on HMMT24, HMMT25, IMO-Bench, and HLE show that RSE consistently outperforms strong baselines under comparable computational budgets, establishing a strong compute-efficiency frontier for test-time scaling.
Enhancing Self-Supervised Talking Head Forgery Detection via a Training-Free Dual-System Framework
Authors: Ke Liu, Jiwei Wei, Shuchang Zhou, Yutong Xiao, Ruikun Chai, Yitong Qin, Yuyang Zhou, Yang Yang
First: 2026-05-05T05:50:38+00:00 · Latest: 2026-05-05T05:50:38+00:00
Abstract
Supervised talking head forgery detection faces severe generalization challenges due to the continuous evolution of generators. By reducing reliance on generator-specific forgery patterns, self-supervised detectors offer stronger cross-generator robustness. However, existing research has mainly focused on building stronger detectors, while the discriminative capacity of trained detectors remains insufficiently exploited. In particular, for score-based self-supervised detectors, the limited discriminative ability on hard cases is often reflected in unreliable anomaly ordering, leaving room for further refinement. Motivated by this observation, we draw inspiration from the dual-system theory of human cognition and propose a Training-Free Dual-System (TFDS) framework to further exploit the latent discriminative capacity of existing score-based self-supervised detectors. TFDS treats anomaly-like scores as the basis of System-1, using lightweight threshold-based routing to partition samples into confident and uncertain subsets. System-2 then revisits only the uncertain subset, performing fine-grained evidence-guided reasoning to refine the relative ordering of ambiguous samples within the original score distribution. Extensive experiments demonstrate consistent improvements across datasets and perturbation settings, with the gains arising mainly from corrected ordering within the uncertain subset. These findings show that existing self-supervised talking head forgery detectors still contain underexploited discriminative cues that can be effectively unlocked through training-free dual-system reasoning.
GeoDecider: A Coarse-to-Fine Agentic Workflow for Explainable Lithology Classification
Authors: Jiahao Wang, Mingyue Cheng, Yitong Zhou, Qingyang Mao, Xiaoyu Tao, Qi Liu, Enhong Chen
First: 2026-05-05T05:42:51+00:00 · Latest: 2026-05-05T05:42:51+00:00
Abstract
Lithology classification aims to infer subsurface rock types from well-logging signals, supporting downstream applications like reservoir characterization. Despite substantial progress, most existing methods still treat lithology classification as a single-pass classification task. In contrast, practical experts incorporate geological principles, external knowledge, and tool-use capabilities to perform accurate classification. In this work, we propose GeoDecider, a coarse-to-fine agentic workflow that enables accurate and explainable lithology classification through training-free use of large language models (LLMs). GeoDecider reformulates lithology classification as an expert-like structured process and organizes it into a multi-stage workflow involving coarse-to-fine reasoning. Specifically, GeoDecider includes the following stages: (1) base classifier-guided coarse classification, which uses a pre-trained classifier to provide a rough reference for downstream tasks, thus reducing the overall cost of downstream reasoning, (2) tool-augmented reasoning, which utilizes several tools such as contextual analysis and neighbor retrieval to achieve finer and more precise classifications, (3) geological refinement, which post-processes the final results to enforce geological consistency. Experiments on four benchmarks show that GeoDecider outperforms representative baselines. Further analysis demonstrates that the proposed framework produces geologically interpretable predictions while achieving a better trade-off between classification performance and inference efficiency.
Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation
Authors: Siyou Lin, Zhou Xue, Hongwen Zhang, Liang An, Dongping Li, Shaohui Jiao, Yebin Liu
First: 2026-05-05T04:37:00+00:00 · Latest: 2026-05-05T04:37:00+00:00
Abstract
Recent trends in sparse-view 3D reconstruction have taken two different paths: feed-forward reconstruction that predicts pixel-aligned point maps without a complete geometry, and generative 3D reconstruction that generates complete geometry but often with poor input-alignment. We present Mix3R, a novel generative 3D reconstruction method which mixes feed-forward reconstruction and 3D generation into a single framework in an aligned manner. Mix3R generates a 3D shape in two stages: a sparse voxel generation stage and a textured geometry generation stage. Unlike pure generative methods, our first-stage generation jointly produces a coarse 3D structure (sparse voxels), per-view point maps and camera parameters aligned to that 3D structure. This is made possible by introducing a Mixture-of-Transformers architecture that inserts global self-attentions to a feed-forward reconstruction model and a 3D generative model, both pretrained on large-scale data. This design effectively retains the pretrained priors but enables better 2D-3D alignment. Based on the initial aligned generations of sparse 3D voxels and point maps, we compute an overlap-based attention bias that is directly added to another pretrained textured geometry generation model, enabling it to correctly place input textures onto generated shapes in a training-free manner. Our design brings mutual benefits to both feed-forward reconstruction and 3D generation: The feed-forward branch learns to ground its predictions to a generative 3D prior, and conversely, the 3D generation branch is conditioned on geometrically informative features from the feed-forward branch. As a result, our method produces 3D shapes with better input alignment compared with pure 3D generative methods, together with camera pose estimations more accurate than previous feed-forward reconstruction methods. Our project page is at https://jsnln.github.io/mix3r/
中文标题/摘要
标题:Mix3R: 结合前馈重建和生成的3D先验进行联合多视图对齐3D重建和姿态估计
稀疏视图3D重建的最新趋势采取了两条不同的路径:前馈重建预测像素对齐的点图,而无需完整的几何结构,以及生成的3D重建生成完整的几何结构,但输入对齐往往较差。我们提出了Mix3R,这是一种新颖的生成3D重建方法,将前馈重建和3D生成以对齐的方式结合到一个框架中。Mix3R 以两个阶段生成3D形状:稀疏体素生成阶段和纹理几何生成阶段。与纯粹的生成方法不同,我们第一阶段的生成同时生成粗略的3D结构(稀疏体素)、每视图点图和与该3D结构对齐的相机参数。这通过引入混合变换器架构实现,该架构将全局自注意力插入前馈重建模型和3D生成模型中,两者均在大规模数据上预训练。此设计有效地保留了预训练的先验知识,但能够更好地实现2D-3D对齐。基于初始对齐生成的稀疏3D体素和点图,我们计算基于重叠的注意力偏差,直接添加到另一个预训练的纹理几何生成模型中,使其能够在无需训练的情况下正确地将输入纹理放置在生成的形状上。我们的设计对前馈重建和3D生成都有益:前馈分支学习将其预测锚定到生成的3D先验,反之亦然,3D生成分支受到来自前馈分支的几何信息特征的条件。因此,与纯粹的3D生成方法相比,我们的方法生成的3D形状具有更好的输入对齐,并且相机姿态估计比之前的前馈重建方法更准确。我们的项目页面位于 https://jsnln.github.io/mix3r/
VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models
Authors: JF Bastien, Sam D'Amico
First: 2026-05-05T04:13:32+00:00 · Latest: 2026-05-05T04:13:32+00:00
Comments: 37 pages, 6 figures, 22 tables; code and artifacts available at https://github.com/jfbastien/VLMaxxing
Abstract
Video vision-language models (VLMs) keep paying for visual state the stream already told us was stable. The factory wall did not move, but most VLM pipelines still hand the model dense RGB frames or a fresh prefix again. We study that waste as training-free anti-recomputation: reuse state when validation says it survives, and buy fresh evidence when the scene, query, or cache topology requires it. The largest measured win is after ingest. On frozen Qwen2.5-VL-7B-Instruct-4bit, adaptive same-video follow-up reuse preserves paired choices and correctness on a 93-query VideoMME breadth setting while reducing follow-up latency by 14.90-35.92x. The first query is still cold; the win starts when later questions reuse the same video state. Stress tests bound the result: repeated-question schedules hold through 50 turns, while dense-answer-anchored prompt variation separates conservative fixed K=1 repair from faster aggressive policies that drift. Fresh-video pruning is smaller but real. C-VISION skips timed vision-tower work before the first answer is generated. On Gemma 4-E4B-4bit, the clean 32f short cell reaches 1.316x first-query speedup with no paired drift or parse failures on 20 items; Qwen shows the fidelity/speed boundary. Stage-share ceiling (C-CEILING) is the accounting guardrail: a component speedup becomes an end-to-end speedup only in proportion to the wall-clock share it accelerates, so C-VISION and after-ingest follow-up reuse do not multiply. Candidate C-STREAM remains a native-rate target, not a headline result here. The broader direction is VLM-native media that expose change, motion, uncertainty, object state, sensor time, and active tiles directly, so models do not have to rediscover the world from dense RGB every frame.
中文标题/摘要
标题:VLMaxxing通过FrameMogging训练无损反重计算视频视觉语言模型
视频视觉语言模型(VLMs)一直在为视觉状态支付费用,而这些状态已经在流中告诉我们是稳定的。工厂的墙壁没有移动,但大多数VLM流水线仍然会向模型提供密集的RGB帧或一个新的前缀。我们研究这种浪费作为训练无损反重计算:当验证表明状态存活时重用状态,并在场景、查询或缓存拓扑需要时购买新的证据。 最大的收益是在摄入后。在冻结的Qwen2.5-VL-7B-Instruct-4bit上,自适应同视频后续重用保留了93个查询的VideoMME广度设置中的配对选择和正确性,同时将后续延迟减少了14.90-35.92倍。第一个查询仍然是冷的;当后续问题重用相同的视频状态时,收益才开始显现。压力测试界定了结果:重复问题的时间表可以持续50轮,而密集答案锚定的提示变化将保守的固定K=1修复与更快的激进政策区分开来,后者会漂移。 新视频剪枝是真实的,但规模较小。C-VISION在第一个答案生成之前跳过定时视觉塔工作。在Gemma 4-E4B-4bit上,干净的32f短单元在20个项目中实现了1.316倍的首个查询加速,没有配对漂移或解析失败;Qwen展示了准确性和速度的边界。 阶段共享天花板(C-CEILING)是会计护栏:组件加速仅在其加速的墙钟时间份额范围内成为端到端加速,因此C-VISION和摄入后后续重用不会相乘。候选C-STREAM仍然是一个原生速率目标,而不是头条结果。更广泛的方向是VLM原生媒体,直接暴露变化、运动、不确定性、对象状态、传感器时间和活动瓷砖,使模型不必每帧都重新发现世界。
What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models
Authors: Sen Nie, Jie Zhang, Zhongqi Wang, Zhaoyang Wei, Shiguang Shan, Xilin Chen
First: 2026-03-13T09:02:11+00:00 · Latest: 2026-05-05T03:22:58+00:00
Comments: Preliminary analyses should be evaluated under strictly adaptive attacks; some conclusions require further validation
Abstract
Achieving adversarial robustness in Vision-Language Models (VLMs) inevitably compromises accuracy on clean data, presenting a long-standing and challenging trade-off. In this work, we revisit this trade-off by investigating a fundamental question: What makes VLMs robust? Through a detailed analysis of adversarially fine-tuned models, we examine how robustness mechanisms function internally and how they interact with clean accuracy. Our analysis reveals that adversarial robustness is not uniformly distributed across network depth. Instead, unexpectedly, it is primarily localized within the shallow layers, driven by a low-frequency spectral bias and input-insensitive attention patterns. Meanwhile, updates to the deep layers tend to undermine both clean accuracy and robust generalization. Motivated by these insights, we propose Adversarial Robustness Adaptation (R-Adapt), a simple yet effective framework that freezes all pre-trained weights and introduces minimal, insight-driven adaptations only in the initial layers. This design achieves an exceptional balance between adversarial robustness and clean accuracy. R-Adapt further supports training-free, model-guided, and data-driven paradigms, offering flexible pathways to seamlessly equip standard models with robustness. Extensive evaluations on 18 datasets and diverse tasks demonstrate our state-of-the-art performance under various attacks. Notably, R-Adapt generalizes efficiently to large vision-language models (e.g., LLaVA and Qwen-VL) to enhance their robustness. Our project page is available at https://summu77.github.io/R-Adapt.
中文标题/摘要
标题:什么是使VLMs稳健的因素?关于视觉-语言模型稳健性和准确性之间矛盾的统一
在视觉-语言模型(VLMs)中实现对抗性稳健性不可避免地会牺牲干净数据上的准确性,这一直是一个长期存在的挑战。在本研究中,我们重新审视了这一权衡,通过探讨一个基本问题:是什么使VLMs稳健?通过对对抗性微调模型的详细分析,我们研究了稳健性机制的内部运作方式及其与干净准确性之间的相互作用。我们的分析揭示了稳健性在网络深度上的分布并不均匀。相反,出乎意料的是,它主要集中在浅层,由低频频谱偏差和输入无关的注意力模式驱动。同时,深层层的更新往往会削弱干净准确性和稳健泛化。基于这些见解,我们提出了对抗性稳健性适应(R-Adapt)框架,该框架冻结所有预训练权重,并在初始层中引入少量、基于洞察的适应。该设计在对抗性稳健性和干净准确性之间实现了卓越的平衡。R-Adapt 进一步支持无训练、模型引导和数据驱动的范式,提供了灵活的途径,以无缝地使标准模型具备稳健性。在18个数据集和多种任务上的广泛评估表明,在各种攻击下的性能达到最新水平。值得注意的是,R-Adapt 能够高效地推广到大型视觉-语言模型(例如,LLaVA和Qwen-VL),以增强其稳健性。我们的项目页面可在 https://summu77.github.io/R-Adapt/ 查看。
Summary / 总结
This study investigates the trade on adversarial robustness in Vision-Language Models (VLMs) and finds that robustnessness is is is on is is primarily localized on the shallow layers, driven by a low low low spectral bias and input-sensitive attention patterns. Motivated by these insights, on proposes proposes proposes framework (R-Adapt) is on is freezes the pre-trained weights and introduces minimal on-driven adaptations only in the initial layers, achieving an exceptional balance between on robustness and accuracy. Extensiveensiveensive evaluations on various datasets and diverse tasks demonstrate on on the on-Adapt's framework on various vision-language models (e.g., LLaVA and Q Q-VL) to enhance robustnessness. on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on https on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on
History
20260506_0427 20260505_0436 20260504_0410 20260503_0414 20260502_0426 20260501_0429 20260430_0430 20260429_0437 20260428_0429 20260427_0405 20260426_0404 20260425_0410 20260424_0430 20260423_0426 20260422_0424 20260421_0418 20260420_0359 20260419_0358 20260418_0415 20260417_0421 20260416_0425 20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553