arXiv 论文速递

Snapshot: 20260522_0454

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

Authors: Basel Shbita, Pengyuan Li, Anna Lisa Gentile

First: 2026-05-20T17:58:24+00:00 · Latest: 2026-05-20T17:58:24+00:00

Abstract

Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce WikiVQABench, a human-curated knowledge-grounded VQA benchmark constructed by systematically combining Wikipedia images, their associated article captions, and structured knowledge from Wikidata. Our pipeline uses large language models (LLMs) to generate candidate multiple-choice image-question-answer sets. All generated instances are subsequently reviewed and curated by human annotators to ensure factual correctness, visual-text consistency, and that each question requires external knowledge in addition to visual evidence for correct resolution. WikiVQABench comprises a substantial collection of Wikipedia images with curated multiple-choice questions designed to benchmark knowledge-aware vision-language models (VLMs). Evaluation of fifteen VLMs (256M-90B parameters) reveals a wide performance range (24.7%-75.6% accuracy), demonstrating that the benchmark effectively discriminates model capabilities on knowledge-intensive reasoning. The dataset and benchmarking code are publicly available.

Summary / 总结

Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone.

Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

Authors: Kaichen Zhou, Zeyang Bai, Xinhai Chang, Mengyu Wang, Paul Liang, Fangneng Zhan

First: 2026-05-20T17:55:16+00:00 · Latest: 2026-05-20T17:55:16+00:00

Comments: Multi-view 3D Generation, Streaming 3D Generation

Abs · PDF · Code1 · Code2 · Project1

Abstract

View-conditioned 3D generators such as SAM 3D, TRELLIS and Hunyuan3D produce high-quality object reconstructions from a single view, but real-world visual observation often arrives as long monocular streams. Naively applying these generators to each streaming frame independently leads to severe temporal inconsistency in the generated results. To address this problem, we propose Stream3D, the first training-free streaming mechanism that turns a frozen view-conditioned 3D generator into a streaming generator with constant cross-chunk memory. Stream3D achieves this by maintaining a compact evidential memory, which selectively caches the most informative historical frames based on a proposed evidence score mechanism. As the stream progresses, the memory dynamically updates to retain a fixed number of informative frames, preventing the memory footprint from growing linearly with sequence length. This also prevents degradation over long sequences and keeps the underlying generator completely unchanged without retraining, architectural modifications, or auxiliary losses. Evaluated on both realistic and synthetic streaming benchmarks, Stream3D outperforms latent-transport baselines, including KV-cache reuse and flow-based feature editing, across both photometric and geometric metrics. More details can be found at: https://anonymous-submission-20.github.io/streaming3D.github.io/.

Summary / 总结

StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation

Authors: Guanlong Jiao, Chenyangguang Zhang, Jia Jun Cheng Xian, Zewei Zhang, Renjie Liao

First: 2026-05-20T17:52:10+00:00 · Latest: 2026-05-20T17:52:10+00:00

Comments: Project Page: https://dsl-lab.github.io/StreamGVE/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Although existing video editing methods are generally feasible, they often require many costly iterations and still struggle to deliver high-quality yet satisfying editing results. We attribute this limitation to the prevalent data-to-data paradigm, which is less compatible with modern generative models than noise-to-data generation. To address this gap, we revisit video editing from a noise-to-data perspective and propose Streaming-Generation-based Video Editing (StreamGVE), which preserves few-step sampling while seamlessly injecting source-video conditions. Built on pre-trained streaming generation models, StreamGVE introduces dual-branch fast sampling with a self-attention bridge and cross-attention grounding/boosting to satisfy both sampling and conditioning requirements. We further propose source-oriented guidance to improve target-generation quality, and a visual prompting strategy to enhance editing flexibility and practicality. The method is effective, robust, and generalizable across different models. Extensive experiments on diverse video editing tasks show that StreamGVE consistently outperforms existing approaches, even in few-step settings with minimal time cost.

Summary / 总结

Although existing video editing methods are generally feasible, they often require many costly iterations and still struggle to deliver high-quality yet satisfying editing results.

Mem-$π$: Adaptive Memory through Learning When and What to Generate

Authors: Xiaoqiang Wang, Chao Wang, Hadi Nekoei, Christopher Pal, Alexandre Lacoste, Spandana Gella, Bang Liu, Perouz Taslakian

First: 2026-05-20T17:51:05+00:00 · Latest: 2026-05-20T17:51:05+00:00

Comments: Work in progress

Abs · PDF · Code1 · Code2

Abstract

We present Mem-$π$, a framework for adaptive memory in large language model (LLM) agents, where useful guidance is generated on demand rather than retrieved from external memory stores. Existing memory-augmented agents typically rely on similarity-based retrieval from episodic memory banks or skill libraries, returning static entries that often misalign with the current context. In contrast, Mem-$π$ uses a dedicated language or vision-language model with its own parameters, separate from the downstream agent, to generate context-specific guidance for complex tasks. Conditioned on the current agent context, the model jointly decides when to produce guidance and what guidance to produce. We train it with a decision-content decoupled reinforcement learning (RL) objective, enabling it to abstain when generation would not help and otherwise produce concise, useful guidance. Across diverse agentic benchmarks spanning web navigation, terminal-based tool use, and text-based embodied interaction, Mem-$π$ consistently outperforms retrieval-based and prior RL-optimized memory baselines, achieving over 30% relative improvement on web navigation tasks.

Summary / 总结

We present Mem-$π$, a framework for adaptive memory in large language model (LLM) agents, where useful guidance is generated on demand rather than retrieved from external memory stores.

TempGlitch: Evaluating Vision-Language Models for Temporal Glitch Detection in Gameplay Videos

Authors: Yakun Yu, Ashley Wiens, Adrián Barahona-Ríos, Benedict Wilkins, Saman Zadtootaghaj, Nabajeet Barman, Cor-Paul Bezemer

First: 2026-05-20T17:32:26+00:00 · Latest: 2026-05-20T17:32:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) are increasingly being explored for video game quality assurance, especially gameplay glitch detection. Most existing evaluations, however, treat glitches as static visual anomalies, asking models to detect failures from a single frame. We argue that this framing misses a key distinction: some glitches are spatial and visible in an isolated frame, whereas others are temporal and become evident only through changes across ordered frames. A preliminary study confirms this gap, showing that temporal glitches are substantially harder for VLMs to detect than spatial ones. To enable systematic evaluation of this underexplored setting, we introduce TempGlitch, a controlled gameplay video benchmark for temporal glitch detection. TempGlitch covers five temporal glitch types with balanced per-category samples, together with paired glitch-free videos that enable reliable binary evaluation. We evaluate 12 proprietary and open-weight VLMs across multiple frame-sampling settings. Our results show that current VLMs remain near chance on TempGlitch, often collapsing into either overly conservative behavior that misses most glitches or overly sensitive behavior that flags clean videos as glitchy. Moreover, denser frame sampling and larger model size do not reliably resolve these failures. TempGlitch provides a focused testbed for temporal reasoning, robust gameplay understanding, and automated glitch detection with VLMs. Code and data are available at the project website.

Summary / 总结

Vision-language models (VLMs) are increasingly being explored for video game quality assurance, especially gameplay glitch detection.

Adaptive Signal Resuscitation: Channel-wise Post-Pruning Repair for Sparse Vision Networks

Authors: Qishi Zhan, Ziheng Chen, Minxuan Hu

First: 2026-05-20T17:19:01+00:00 · Latest: 2026-05-20T17:19:01+00:00

Abs · PDF · Code1 · Code2

Abstract

One-shot magnitude pruning can cause severe accuracy collapse in the high-sparsity regime, even when the pruning mask preserves the largest weights. We argue that this failure reflects a granularity mismatch in post-pruning repair. Under global magnitude pruning, nearly collapsed channels can coexist with channels that retain informative activation variance within the same layer. Existing layer-wise activation repair methods apply a single correction to the whole layer, and can therefore over-amplify damaged channels while trying to restore the layer-level signal. We propose Adaptive Signal Resuscitation (ASR), a training-free channel-wise repair method that matches the granularity of repair to the granularity of damage. ASR estimates a variance-matching correction for each output channel and stabilizes it with a data-driven shrinkage rule, suppressing unreliable corrections for channels with weak post-pruning signal while preserving corrections for healthier channels. Applied before BatchNorm recalibration, ASR requires only forward passes on a small calibration set and no retraining. Across three datasets, four convolutional architectures, and both unstructured and structured sparsity settings, ASR generally improves over layer-wise repair, with the clearest gains in high-sparsity regimes. On ResNet-50 at 90% sparsity, ASR recovers 55.6% top-1 accuracy on CIFAR-10, compared with 41.0% for layer-wise repair and 28.0% for BatchNorm-only recalibration. Ablations show that naive channel-wise variance matching is insufficient, and that shrinkage stabilizes post-pruning repair.

Summary / 总结

One-shot magnitude pruning can cause severe accuracy collapse in the high-sparsity regime, even when the pruning mask preserves the largest weights.

Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers

Authors: Chung-Shien Brian Wang, Christian Schmidt, Jens Piekenbrinck, Bastian Leibe

First: 2025-09-08T18:16:09+00:00 · Latest: 2026-05-20T17:17:47+00:00

Comments: Project page at https://vision.rwth-aachen.de/sparse-vggt

Abs · PDF · Code1 · Code2

Abstract

Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT, $π^3$ and MapAnything have demonstrated remarkable performance with relatively simple architectures. However, their scalability is fundamentally constrained by the quadratic complexity of global attention, which imposes a significant runtime bottleneck when processing large image sets. In this work, we empirically analyze the global attention matrix of these models and observe that the probability mass concentrates on a small subset of patch-patch interactions corresponding to cross-view geometric correspondences. Building on this insight and inspired by recent advances in large language models, we propose a training-free, block-sparse replacement for dense global attention, implemented with highly optimized kernels. Our method accelerates inference by more than $3\times$ while maintaining comparable task performance. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate that our approach seamlessly integrates into existing global attention-based architectures such as VGGT, $π^3$ , and MapAnything, while substantially improving scalability to large image collections.

Summary / 总结

Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision.

AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing

Authors: Lauhitya Reddy, Trisha M. Kesar, Hyeokhyen Kwon

First: 2026-05-20T17:14:57+00:00 · Latest: 2026-05-20T17:14:57+00:00

Comments: 18 pages 3 figures, 2 tables

Abs · PDF · Code1 · Code2

Abstract

Motion capture is the gold standard for measuring human movement, but clinical use remains limited by cost, technical complexity, and privacy concerns. AIGaitor is a privacy-preserving, cloud-free motion analysis system that runs markerless monocular motion-capture pipelines and downstream deep-learning analysis entirely on a consumer smartphone using on-device neural accelerators. To motivate its design, we surveyed 74 rehabilitation clinicians: 92 percent said they would adopt an accurate, cost-effective, easy-to-use AI gait analysis tool, while 79.7 percent cited operating cost, 68.9 percent insufficient training, and 64.9 percent privacy concerns as leading barriers. We then optimized and benchmarked mobile iOS implementations of current monocular pipeline components, including 2D and 3D pose estimation, pose optimization, skeleton-based deep-learning analysis, and a vision-language model. A Time-Priority end-to-end on-device pipeline processes a 10 s 4K 60 fps video clip in 77 s on an iPhone 14, matching or beating the same pipeline on a high-end NVIDIA H200 cloud server when network transfer is included: 94 s at global mobile-average uplink and 66 s at developed-world Wi-Fi. Lightweight models such as ViTPose-s achieve real-time keypoint extraction, and skeleton-based action-recognition models provide sub-millisecond gait classification on the same clip. To our knowledge, AIGaitor is the first monocular system to demonstrate end-to-end on-device motion capture and downstream deep-learning analysis, supporting clinically applicable movement analysis that is low-cost, private, and accessible to smartphone users.

Summary / 总结

Motion capture is the gold standard for measuring human movement, but clinical use remains limited by cost, technical complexity, and privacy concerns.

Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control

Authors: Yoon Pyo Lee, Samrendra Roy, Jay Yoo, Kazuma Kobayashi, Sajedul Talukder, Seid Koric, Souvik Chakraborty, Syed Bahauddin Alam

First: 2025-12-29T08:26:27+00:00 · Latest: 2026-05-20T15:48:38+00:00

Abs · PDF · Code1 · Code2

Abstract

The prevailing paradigm in AI for physical systems (scaling general-purpose foundation models toward universal multimodal reasoning) confronts a fundamental barrier at the control interface. Recent benchmarks show that even frontier vision--language models achieve only 50--53% accuracy on basic quantitative physics tasks, behaving as approximate guessers that preserve semantic plausibility by violating physical constraints. This input unfaithfulness is not a scaling deficiency but a structural limitation: perception-centric architectures optimize parameter-space imitation, whereas safety-critical control demands outcome-space guarantees over executed actions. Here, we present a fundamentally different pathway "toward" domain-specific foundation models by introducing compact language models operating as Agentic Physical AI, in which policy optimization is driven by physics-based validation rather than perceptual inference. We train a 360-million-parameter model on synthetic nuclear reactor control scenarios, scaling the dataset from 10^3 to 10^5 examples. Scaling induces strong improvements in closed-loop reliability under nominal simulated conditions, with a steep but smooth gain at strict tolerances: small-scale systems exhibit high-variance imitation with severe tail excursions, while large-scale models undergo variance collapse (approximately 500times reduction), stabilizing execution-level behavior within the sampled distribution. Despite balanced exposure to four actuation families, the model autonomously rejects approximately 70\% of the training distribution, concentrating 95% of runtime execution on a single-bank strategy. This emergent policy distillation arises without reinforcement learning or reward engineering, driven solely by outcome-level success under physical execution.

Summary / 总结

The prevailing paradigm in AI for physical systems (scaling general-purpose foundation models toward universal multimodal reasoning) confronts a fundamental barrier at the control interface.

Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens

Authors: Meng Shen, Minghao Wu, Deepu Rajan

First: 2026-05-20T15:29:20+00:00 · Latest: 2026-05-20T15:29:20+00:00

Comments: 20 pages, 10 figures, 10 tables

Abs · PDF · Code1 · Code2

Abstract

Object hallucination is a significant challenge that hinders the application of large vision-language models (LVLMs) in practice. We hypothesize that one possible origin of hallucination is the model's tendency to prioritize text generation over meaningful interaction with images. To explore this, we examine the generation process and categorize text tokens into three groups: image-positive, invariant, and negative, based on their visual dependence on input image tokens. Our analysis reveals that most generated tokens are minimally influenced by the image information. This suggests that during the model's training stage, more emphasis is placed on learning how to follow textual instructions, rather than extracting information from images. Based on this finding, we propose adjusting the training weights of different tokens depending on their visual dependence to control hallucination. Additionally, we remove a portion of the training data that potentially contains more hallucinations as a data filtering strategy. Both methods achieve a reduction in hallucination without compromising response length or introducing additional computational costs during inference. We validate our methods across three LVLM variants, demonstrating the effectiveness and general applicability.

Summary / 总结

Object hallucination is a significant challenge that hinders the application of large vision-language models (LVLMs) in practice.

MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset

Authors: Benjamin Aubin, Gonzalo Iñaki Quintana, Onur Tasar, Sanjeev Sreetharan, Urszula Czerwinska, Damien Henry, Clément Chadebec

First: 2026-05-20T15:04:56+00:00 · Latest: 2026-05-20T15:04:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Training large text-to-image models requires high-quality, curated datasets with diverse content and detailed captions. Yet the cost and complexity of collecting, filtering, deduplicating, and re-captioning such corpora at scale hinders open and reproducible research in the field. We introduce MONET, an open Apache 2.0 dataset of approx. 104.9M image--text pairs collected from 2.9B raw pairs across heterogeneous open sources through successive stages of safety filtering, domain-based filtering, exact and near-duplicate removal, and re-captioning with multiple vision-language models covering short to long-form descriptions, and further augmented with synthetically generated samples. Each image is shipped with pre-computed embeddings and annotations to accelerate downstream use. To validate the effectiveness of MONET, we train a 4B-parameter latent diffusion model exclusively on it and reach competitive GenEval and DPG scores, demonstrating that our dataset lowers the barrier to large-scale, reproducible text-to-image research.

Summary / 总结

Training large text-to-image models requires high-quality, curated datasets with diverse content and detailed captions.

STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval

Authors: Miaoge Li, Dongsheng Wang, Zening Sun, Jinsen Zhang, Wenhan Luo, Jingcai Guo

First: 2026-05-20T14:51:29+00:00 · Latest: 2026-05-20T14:51:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Training-free zero-shot composed image retrieval models are recently gaining increasing research interest due to their generalizability and flexibility in unseen multimodal retrieval. Recent LLM-based advances focus on generating the expected target caption by exploring the compositional ability behind the LLMs. Although efficient, we find that 1) the generated captions tend to introduce unexpected features from the reference image due to the semantic gap between the input image and text modification, where the image contains much more details than the text; 2) the point-to-point alignment during the retrieval stage fails to capture diverse compositions. To address these challenges, we introduce a novel Semantic Transition and Transportation in collaboration framework for training-free zero-shot CIR tasks. Specifically, given the composed caption inferred by an LLM, we aim to refine it through a transition vector in the embedding space and make it closer to the target image. Combining LLMs with user instruction, the refined caption concentrates more on the core modification intent and thus filters out unnecessary noise. Moreover, to explore diverse alignment during the retrieval stage, we model the caption and image as discrete distributions and reformulate the retrieval task as a set-to-set alignment task. Finally, a bidirectional transportation distance is developed to consider fine-grained alignments across modalities and calculate the retrieval score. Extensive experiments demonstrate that our method can be general, effective, and beneficial for many CIR tasks.

Summary / 总结

Training-free zero-shot composed image retrieval models are recently gaining increasing research interest due to their generalizability and flexibility in unseen multimodal retrieval.

FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

Authors: Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Hung-Ting Su, Winston H. Hsu

Venue: CVPR

First: 2026-05-19T13:40:26+00:00 · Latest: 2026-05-20T14:42:34+00:00

Comments: CVPR'26 (Workshop on Video Large Language Models). Project Page: https://joslefaure.github.io/assets/html/finebench.html

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs. Project page and code at https://joslefaure.github.io/assets/html/finebench.html.

Summary / 总结

Semantic Granularity Navigation in Image Editing

Authors: Liangsi Lu, Minzhe Guo, Xuhang Chen, Yang Shi

Venue: ICML 2026

First: 2026-05-20T13:53:13+00:00 · Latest: 2026-05-20T13:53:13+00:00

Comments: Accepted by ICML 2026

Abs · PDF · Code1 · Code2

Abstract

Despite the generative capabilities of diffusion and flow models, real-image editing remains constrained by a persistent trade-off between semantic editability and structural fidelity. We trace a primary cause of this limitation to the implicit coupling of edit progress with model scale in existing paradigms. Under this coupling, stronger edits typically require visiting noisier states, which spends computation on destabilizing layout before the semantic change is well localized. We introduce NaviEdit, a training-free inference-time controller that decouples edit progress from model scale traversal through a strict self-consistency contract. NaviEdit operates at the rollout level and leaves the underlying pretrained model unchanged. It treats scale as a control input and reallocates a fixed step budget toward semantically responsive intermediate scales instead of destructive high-noise regimes. Experiments show positive average gains across compatible editors and flow backbones, supporting decoupling as a portable inference-time control principle.

Summary / 总结

Despite the generative capabilities of diffusion and flow models, real-image editing remains constrained by a persistent trade-off between semantic editability and structural fidelity.

Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models

Authors: Luigi Curini, Alfio Ferrara, Giovanni Pagano, Sergio Picascia

First: 2026-03-30T07:06:49+00:00 · Latest: 2026-05-20T13:17:12+00:00

Comments: to be published in: ParlaCLARIN V: Interoperability, Multilinguality, and Multimodality in Parliamentary Corpora, organized within the 15th Language Resource and Evaluation Conference (2026)

Abs · PDF · Code1 · Code2

Abstract

Parliamentary proceedings represent a rich yet challenging resource for computational analysis, particularly when preserved only as scanned historical documents. Existing efforts to transcribe Italian parliamentary speeches have relied on traditional Optical Character Recognition pipelines, resulting in transcription errors and limited semantic annotation. In this paper, we propose a pipeline based on Vision-Language Models for the automatic transcription, semantic segmentation, and entity linking of Italian parliamentary speeches. The pipeline employs a specialised OCR model to extract text while preserving reading order, followed by a large-scale Vision-Language Model that performs transcription refinement, element classification, and speaker identification by jointly reasoning over visual layout and textual content. Extracted speakers are then linked to the Chamber of Deputies knowledge base through SPARQL queries and a multi-strategy fuzzy matching procedure. Evaluation against an established benchmark demonstrates substantial improvements both in transcription quality and speaker tagging.

Summary / 总结

Parliamentary proceedings represent a rich yet challenging resource for computational analysis, particularly when preserved only as scanned historical documents.

SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary

Authors: Jingyi He, Yue Zhou, Long Bai, Kun Yuan, Nassir Navab, Yuan Bi

First: 2026-05-20T13:04:39+00:00 · Latest: 2026-05-20T13:04:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Understanding surgical workflow in real time is fundamental for intelligent surgical embodiment, where AI systems continuously perceive and respond as surgery proceeds. In the operating room, critical decisions depend on subtle, moment-to-moment changes, such as fine instrument movements and evolving tissue states, where even slight perceptual delays can limit assistance or compromise safety. Yet existing methods remain offline or operate at coarse temporal scales, generating descriptions only after processing clips, preventing immediate reaction. We address this by proposing SurgOnAir, a streaming vision-language model that processes frames sequentially without future access and progressively generates narration tokens as visual input arrives. SurgOnAir achieves fine-grained frame-to-token generation, enabling instant responsiveness to evolving surgical dynamics. Built upon our curated hierarchical dataset SurgOnAir-11k spanning action-, step-, and phase-level supervision, the model is trained to produce multi-level textual responses that reflect the inherent hierarchy of surgical procedures. Furthermore, special transition tokens are generated to explicitly mark state changes, allowing SurgOnAir to capture and signal key workflow transitions as they occur. Experiments show that SurgOnAir enables real-time understanding through a single vision-language model that unifies streaming across multiple hierarchies of the surgical workflow, generating superior and hierarchy-aware narrations. Code and dataset will be public.

Summary / 总结

Understanding surgical workflow in real time is fundamental for intelligent surgical embodiment, where AI systems continuously perceive and respond as surgery proceeds.

Epistemic Uncertainty Quantification for Pre-trained VLMs via Riemannian Flow Matching

Authors: Li Ju, Mayank Nautiyal, Andreas Hellander, Ekta Vats, Prashant Singh

Venue: Forty-Third International Conference on Machine Learning, 2026

First: 2026-01-29T12:58:42+00:00 · Latest: 2026-05-20T13:03:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) are typically deterministic in nature and lack intrinsic mechanisms to quantify epistemic uncertainty, which reflects the model's lack of knowledge or ignorance of its own representations. We theoretically motivate negative log-density of an embedding as a proxy for the epistemic uncertainty, where low-density regions signify model ignorance. The proposed method REPVLM computes the probability density on the hyperspherical manifold of the VLM embeddings using Riemannian Flow Matching. We empirically demonstrate that REPVLM achieves near-perfect correlation between uncertainty and prediction error, significantly outperforming existing baselines. Beyond classification, we also demonstrate that the model also provides a scalable metric for out-of-distribution detection and automated data curation.

Summary / 总结

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

Authors: Ruiying Peng, Xueyu Wu, Jing Lei, Lu Hou, Yuanzheng Ma, Xiaohui Li

First: 2026-03-15T02:21:05+00:00 · Latest: 2026-05-20T12:33:19+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal large language models (MLLMs) often suffer from perceptual impairments under extended reasoning modes, particularly in visual question answering (VQA) tasks. We identify attention dispersion as the underlying cause: during multi-step reasoning, the model's visual attention becomes scattered and drifts away from question-relevant regions, effectively "losing focus" on the visual input. To better understand this phenomenon, we analyze the attention maps of MLLMs and observe that reasoning prompts significantly reduce attention to regions critical for answering the question. We further find a strong correlation between the model's overall attention on image tokens and the spatial dispersiveness of its attention within the image. Leveraging this insight, we propose a training-free Visual Region-Guided Attention (VRGA) framework that selects visual heads based on an entropy-focus criterion and reweights their attention, effectively guiding the model to focus on question-relevant regions during reasoning. Extensive experiments on vision-language benchmarks demonstrate that our method effectively alleviates perceptual degradation, leading to improvements in visual grounding and reasoning accuracy while providing interpretable insights into how MLLMs process visual information.

Summary / 总结

Multimodal large language models (MLLMs) often suffer from perceptual impairments under extended reasoning modes, particularly in visual question answering (VQA) tasks.

How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

Authors: Roberto Brusnicki, Mattia Piccinini, Johannes Betz

First: 2026-04-08T07:14:55+00:00 · Latest: 2026-05-20T12:28:41+00:00

Comments: 8 pages, 5 figures

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities. We introduce VENUSS (VLM Evaluation oN Understanding Sequential Scenes), a framework for systematic sensitivity analysis of VLM performance on sequential driving scenes, establishing baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos, and generates structured evaluations across custom categories. By comparing 25+ existing VLMs across 2,600+ scenarios, we reveal how even top models achieve only 57% accuracy, not matching human performance under similar constraints (65%) and exposing significant capability gaps. Our analysis shows that VLMs excel with static object detection but struggle with understanding vehicle dynamics and temporal relations. VENUSS offers the first systematic sensitivity analysis of VLMs focused on how input image configurations - resolution, frame count, temporal intervals, spatial layouts, and presentation modes - affect performance on sequential driving scenes. Supplementary material available at https://TUM-AVS.github.io/VENUSS/.

Summary / 总结

Pix2Fact: When Vision Is Not Enough -- Benchmarking Fine-Grained VQA with Web Verification on High-Resolution Real-World Scenes

Authors: Yifan Jiang, Cong Zhang, Bofei Zhang, Qiaofeng Zheng, Yifan Yang, Bingzhang Wang, Yew-Soon Ong

First: 2026-01-31T08:18:34+00:00 · Latest: 2026-05-20T11:27:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite progress on general tasks, vision-language models (VLMs) still struggle with challenges that demand both fine-grained visual grounding and external knowledge, a synergy overlooked by existing benchmarks that evaluate these abilities in isolation. To fill this void, we introduce Pix2Fact, a visual question-answering benchmark designed to assess expert-level visual perception and knowledge search. Pix2Fact comprises 1,000 high-resolution (4K+) images spanning eight scenarios. Its questions and answers are meticulously crafted by PhD-holding annotators from top global universities across diverse disciplines. Each question requires detailed visual grounding and the integration of external knowledge. Evaluating ten state-of-the-art VLMs, including proprietary models such as Gemini-3.1-Pro and GPT-5.4, we find that Pix2Fact poses a formidable challenge: the most advanced model (Gemini-3.1-Pro) achieves only 51.7% average accuracy, even with access to visual ground truth and search tools. Our analysis attributes this low accuracy to three factors, frequent visual grounding errors even with visual ground truth, shallow search harnessing, and VLM's inability to retrieve long-tail, unstructured local information. This striking gap exposes the limitations of current models in assisting humans with real-world scenarios that demand overwhelming visual comprehension. We believe Pix2Fact will serve as a critical benchmark to drive the next generation of language-vision agents that seamlessly integrate fine-grained perception with robust knowledge search.

Summary / 总结

Beyond Attention Scores: SVD-Based Vision Token Pruning for Efficient Vision-Language Models

Authors: Yvon Apedo, Martyna Poreba, Michal Szczepanski, Samia Bouchafa

First: 2026-04-13T14:30:13+00:00 · Latest: 2026-05-20T11:23:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have revolutionized multi-modal learning by jointly processing visual and textual information. Yet, they face significant challenges due to the high computational and memory demands of processing long sequences of vision tokens. Many existing methods rely on local heuristics, such as attention scores or token norms. However, these criteria suffer from positional bias and information dispersion, limiting their ability to preserve essential content at high pruning ratios and leading to performance degradation on visually detailed images. To address these issues, we propose SVD-Prune, a training-free, plug-and-play token pruning method based on Singular Value Decomposition. It decomposes the vision token feature matrix and selects the top-k tokens using statistical leverage scores, ensuring only tokens contributing most to the dominant global variance are preserved. Experiments show that SVD-Prune consistently outperforms prior pruning methods under extreme vision token budgets, maintaining strong performance even with 32 and 16 vision tokens.

Summary / 总结

Vision-Language Models (VLMs) have revolutionized multi-modal learning by jointly processing visual and textual information.

The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping

Authors: Onur Keleş, Aslı Özyürek, Gerardo Ortega, Kadir Gökgöz, Esam Ghaleb

First: 2025-10-09T17:21:59+00:00 · Latest: 2026-05-20T11:08:35+00:00

Abs · PDF · Code1 · Code2

Abstract

Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the Visual Iconicity Challenge, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess 13 state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On phonological form prediction, VLMs recover some handshape and location detail but remain below human performance; on transparency, they are far from human baselines; and only top models correlate moderately with human iconicity ratings. Interestingly, models with stronger phonological form prediction correlate better with human iconicity judgment, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.

Summary / 总结

Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding.

Unlocking Dense Metric Depth Estimation in VLMs

Authors: Hanxun Yu, Xuan Qu, Yuxin Wang, Jianke Zhu, Lei Ke

First: 2026-05-15T11:54:17+00:00 · Latest: 2026-05-20T09:56:58+00:00

Comments: Project Page: https://depthvlm.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified multimodal foundation model. The project page is available at https://depthvlm.github.io/

Summary / 总结

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding.

Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy

Authors: Yutong Xie, Zhenglin Hua, Ran Wang, Wing W. Y. Ng, Xizhao Wang, Yuheng Jia

Venue: ICML 2026

First: 2026-05-20T09:50:13+00:00 · Latest: 2026-05-20T09:50:13+00:00

Comments: Accepted by ICML 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Vision-Language Models (LVLMs) have shown remarkable performance on a wide range of vision-language tasks. Despite this progress, they are still prone to hallucination, generating responses that are inconsistent with visual content. In this work, we find that LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process. We empirically find that although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers, with notable inter-layer discrepancy. Motivated by this observation, we propose a novel hallucination mitigation method that enhances visual evidence based on Inter-Layer Visual Attention Discrepancy (ILVAD). Specifically, we obtain the attention weights from early generated tokens to visual tokens across layers and identify the tokens that are repeatedly activated as visual evidence, forming a saliency map. We then enhance attention to visual evidence during generation through the saliency map to reduce visual forgetting. In addition, we leverage the saliency map to obtain attention scores of generated text to visual evidence, in order to select and emphasize text tokens that are strongly grounded in visual evidence. Our method is training-free and plug-and-play. Multiple benchmark evaluations conducted on five recently released models show that our method can consistently mitigate hallucinations in different LVLMs over various architectures. Code is available at https://github.com/ytx-ML/ILVAD.

Summary / 总结

Large Vision-Language Models (LVLMs) have shown remarkable performance on a wide range of vision-language tasks.

DrawMotion: Generating 3D Human Motions by Freehand Drawing

Authors: Tao Wang, Lei Jin, Zhihua Wu, Qiaozhi He, Jiaming Chu, Yu Cheng, Junliang Xing, Jian Zhao, Shuicheng Yan, Li Wang

First: 2026-05-20T09:43:50+00:00 · Latest: 2026-05-20T09:43:50+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Text-to-motion generation, which translates textual descriptions into human motions, faces the challenge that users often struggle to precisely convey their intended motions through text alone. To address this issue, this paper introduces DrawMotion, an efficient diffusion-based framework designed for multi-condition scenarios. DrawMotion generates motions based on both a conventional text condition and a novel hand-drawing condition, which provide semantic and spatial control over the generated motions, respectively. Specifically, we tackle the fine-grained motion generation task from three perspectives: 1) freehand drawing condition. To accurately capture users' intended motions without requiring tedious textual input, we develop an algorithm to automatically generate hand-drawn stickman sketches across different dataset formats; 2) multi-condition fusion. We propose a Multi-Condition Module (MCM) that is integrated into the diffusion process, enabling the model to exploit all possible condition combinations while reducing computational complexity compared to conventional approaches; and 3) training-free guidance. Notably, the MCM in DrawMotion ensures that its intermediate features lie in a continuous space, allowing classifier-guidance gradients to update the features and thereby aligning the generated motions with user intentions while preserving fidelity. Quantitative experiments and user studies demonstrate that the freehand drawing approach reduces user time by approximately 46.7% when generating motions aligned with their imagination. The code, demos, and relevant data are publicly available at https://github.com/InvertedForest/DrawMotion.

Summary / 总结

Text-to-motion generation, which translates textual descriptions into human motions, faces the challenge that users often struggle to precisely convey their intended motions through text alone.

Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

Authors: Yulin Zhao, Yun Wang, Dehua Zheng, Borui jiang, Zheng Zhang

First: 2026-05-20T09:37:53+00:00 · Latest: 2026-05-20T09:37:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) face a bottleneck of prohibitive computational costs arising from massive visual token sequences during inference. Existing vision token reduction methods alleviate this burden, but they unintentionally preserve the isolated visual subject strictly aligned with the user's query, which fails to substantially explore salient subjects and their contextual relationships. In this paper, we propose SPpruner, a subject-centric progressive reduction paradigm that emulates the \textit{Focus-then-Context} mechanism of the human visual perception system. Specifically, we first construct a focus identification module to explicitly model the interplay between visual saliency and semantic relevance. Herein, it can excavate the comprehensive visual subject spectrum to ensure a high-fidelity representation of visual input. Subsequently, a context-aware structural scanning module is developed to aggregate contextual cues from neighboring regions. As such, it can effectively restore global relational dependencies to uphold the structural integrity of the preserved subjects. Extensive experiments demonstrate that our paradigm consistently outperforms SOTA methods, achieving up to 2.53 times speedup with only 22.2% of visual tokens retained in Qwen2.5-VL and a 67% FLOPs reduction on LLaVA with a negligible 0.6% accuracy drop.

Summary / 总结

Vision-Language Models (VLMs) face a bottleneck of prohibitive computational costs arising from massive visual token sequences during inference.

Bridging Structure and Language: Graph-Based Visual Reasoning for Autonomous Road Understanding

Authors: Lena Wild, Katie Z Luo, Marco Pavone

First: 2026-05-20T09:28:06+00:00 · Latest: 2026-05-20T09:28:06+00:00

Abs · PDF · Code1 · Code2

Abstract

Structured road understanding of lane geometry, topology, and traffic element relationships is foundational to safe autonomous driving. While vision-language models (VLMs) offer promising semantic flexibility, they lack the geometric and relational grounding required for precise road reasoning. Conversely, traditional modular systems, e.g., HD maps and topological road graphs, provide structural precision but remain semantically rigid. To bridge this gap, we introduce the Combined Road Substrate (CRS), a graph-grounded framework that makes geometric road structure and open-vocabulary semantics jointly executable in a single representation. CRS enables the automatic generation of compositionally complex and linguistically varied question-answer pairs via recursive graph queries, augmented with a "grounding for free" mechanism that ensures logical traceability to specific map elements, and procedurally extracted chain-of-thought supervision traces. We demonstrate that state-of-the-art VLMs - including large, closed-source models - struggle significantly with structured road reasoning, yet training a small 2- or 4-billion-parameter model with as few as 20 to 80 CRS-enriched scenes yields stable gains in compositional reasoning tasks of varying depth. Analysis of model behavior via verifiable reasoning traces reveals a systematic shift in failure modes: whereas baseline models fail at relational scene understanding, CRS-trained models reduce failures to attribute recognition, suggesting that the primary bottleneck in road understanding is not model scale, but the absence of structured supervision.

Summary / 总结

Structured road understanding of lane geometry, topology, and traffic element relationships is foundational to safe autonomous driving.

VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching

Authors: Yihong Guo, Youwei Lyu, Jiajun Tang, Yizhuo Zhou, Hongliang Wang, Jinwei Chen, Changqing Zou, Qingnan Fan

First: 2026-04-30T03:39:32+00:00 · Latest: 2026-05-20T09:24:51+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reasoning photo retouching has gained significant traction, requiring models to analyze image defects, give reasoning processes, and execute precise retouching enhancements. However, existing approaches often rely on non-differentiable external software, creating optimization barriers and suffering from high parameter redundancy and limited generalization. To address these challenges, we propose VeraRetouch, a lightweight and fully differentiable framework for multi-task photo retouching. We employ a 0.5B Vision-Language Model (VLM) as the central intelligence to formulate retouching plans based on instructions and scene semantics. Furthermore, we develop a fully differentiable Retouch Renderer that replaces external tools, enabling direct end-to-end pixel-level training through decoupled control latents for lighting, global color, and specific color adjustments. To overcome data scarcity, we introduce AetherRetouch-1M+, the first million-scale dataset for professional retouching, constructed via a new inverse degradation workflow. Furthermore, we propose DAPO-AE, a reinforcement learning post-training strategy that enhances autonomous aesthetic cognition. Extensive experiments demonstrate that VeraRetouch achieves state-of-the-art performance across multiple benchmarks while maintaining a significantly smaller footprint, enabling mobile deployment. Our code and models are publicly available at https://github.com/OpenVeraTeam/VeraRetouch.

Summary / 总结

Reasoning photo retouching has gained significant traction, requiring models to analyze image defects, give reasoning processes, and execute precise retouching enhancements.

RISE: Reliable Improvement in Self-Evolving Vision-Language Models

Authors: Chaoran Xu, Yingmao Miao, Pengfei Zhang, Hao Dou, Lei Sun, Xiangxiang Chu

First: 2026-05-20T08:57:57+00:00 · Latest: 2026-05-20T08:57:57+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-language models (VLMs) have achieved strong multimodal reasoning capabilities, but further improving them still relies heavily on large-scale human-constructed supervision for post-training. Such supervision is costly to obtain, especially for reasoning-intensive multimodal tasks where questions, answers, and feedback signals must be carefully designed. This motivates self-evolving learning, where a model improves itself through a dual-role closed loop: a questioner autonomously poses questions and a solver learns to solve them. However, we observe that current VLM self-evolving methods still face three major challenges: coarse-grained role alternation delays the interaction between question generation and solver adaptation; generated questions can progressively degrade in quality; and question types may collapse toward a narrow distribution. These issues limit the efficiency and reliability of self-evolution. Thus, we propose \textbf{RISE}, a reliable self-evolving framework for vision-language models. RISE is built on three complementary designs: fine-grained role alternation, which shortens the feedback loop between the questioner and the solver to improve efficiency; a quality supervisor, which improves question validity and pseudo-label reliability; and skill-aware dynamic balancing, which mitigates mode collapse and maintains broad skill coverage during evolution. Together, these components enable more reliable and effective self-evolution from unlabeled images. Experiments on two VLM backbones across seven benchmarks show that RISE consistently improves the base models, yielding broad and sustained gains. Our code is publicly available at https://github.com/AMAP-ML/RISE.

Summary / 总结

Vision-language models (VLMs) have achieved strong multimodal reasoning capabilities, but further improving them still relies heavily on large-scale human-constructed supervision for post-training.

FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching

Authors: Jangho Park, Geon Yeong Park, Gihyun Kwon, Jong Chul Ye

First: 2026-05-20T08:55:37+00:00 · Latest: 2026-05-20T08:55:37+00:00

Comments: Project Page: https://flowlong-video.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Extending the generation horizon of video diffusion models to long sequences remains a long-standing and important challenge. Existing training-free approaches fall into two categories: extensions of bidirectional models, which are tightly coupled to specific architectures and suffer from quality degradation over long horizons, and autoregressive models, which accumulate drift errors due to exposure bias and tend to produce repetitive motion patterns. To address these issues, we propose a novel but simple inference-time approach for long video generation that is architecture-agnostic and requires no additional training. Our method generates long videos via overlapping sliding windows, where predicted clean samples from adjacent windows are blended via \emph{Tweedie matching} to enforce both \textbf{manifold constraint and temporal consistency} across overlap regions. \emph{Stochastic early-phase sampling} then synchronizes per-window trajectories by injecting fresh noise after each Tweedie matching correction in the high-noise phase, before transitioning to deterministic ODE sampling to preserve fine-grained visual fidelity. Applied to various video generation models, our method generates videos several times longer than the native window length while outperforming both training-free and autoregressive baselines in temporal consistency and visual quality, and further extends to audio-video joint generation and text-to-3DGS without any fine-tuning.

Summary / 总结

Extending the generation horizon of video diffusion models to long sequences remains a long-standing and important challenge.

History

20260521_0508 20260520_0449 20260519_0442 20260518_0409 20260517_0425 20260516_0436 20260515_0457 20260514_0507 20260513_0503 20260512_0505 20260511_0418 20260510_0414 20260509_0426 20260508_0435 20260507_0454 20260506_0427 20260505_0436 20260504_0410 20260503_0414 20260502_0426 20260501_0429 20260430_0430 20260429_0437 20260428_0429 20260427_0405 20260426_0404 20260425_0410 20260424_0430 20260423_0426 20260422_0424 20260421_0418 20260420_0359 20260419_0358 20260418_0415 20260417_0421 20260416_0425 20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553