arXiv 论文速递

2025-10-24 03:29
Snapshot: 20251024_0329
Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization
Authors: Yuanli Wu, Long Zhang, Yue Du, Bin Li
First: 2025-10-20T12:54:32+00:00 · Latest: 2025-10-22T17:54:43+00:00
Abstract
We propose a rubric-guided, pseudo-labeled, and prompt-driven zero-shot video summarization framework that bridges large language models with structured semantic reasoning. A small subset of human annotations is converted into high-confidence pseudo labels and organized into dataset-adaptive rubrics defining clear evaluation dimensions such as thematic relevance, action detail, and narrative progression. During inference, boundary scenes, including the opening and closing segments, are scored independently based on their own descriptions, while intermediate scenes incorporate concise summaries of adjacent segments to assess narrative continuity and redundancy. This design enables the language model to balance local salience with global coherence without any parameter tuning. Across three benchmarks, the proposed method achieves stable and competitive results, with F1 scores of 57.58 on SumMe, 63.05 on TVSum, and 53.79 on QFVS, surpassing zero-shot baselines by +0.85, +0.84, and +0.37, respectively. These outcomes demonstrate that rubric-guided pseudo labeling combined with contextual prompting effectively stabilizes LLM-based scoring and establishes a general, interpretable, and training-free paradigm for both generic and query-focused video summarization.
中文标题/摘要
标题:基于上下文感知伪标签评分的零样本视频摘要
我们提出了一种基于评分准则、伪标签和提示驱动的零样本视频摘要框架,将大型语言模型与结构化语义推理相结合。一小部分人类注释被转换为高置信度的伪标签,并组织成数据集自适应的评分准则,定义清晰的评估维度,如主题相关性、动作细节和叙事进展。在推理过程中,边界场景,包括开头和结尾部分,根据自身的描述独立评分,而中间场景则结合相邻段落的简洁摘要来评估叙事连贯性和冗余性。这种设计使语言模型能够在不进行任何参数调整的情况下平衡局部显著性和全局一致性。在三个基准测试中,所提出的方法在SumMe上的F1得分为57.58,在TVSum上的得分为63.05,在QFVS上的得分为53.79,分别超越零样本基线0.85、0.84和0.37。这些结果表明,基于评分准则的伪标签结合上下文提示有效地稳定了基于LLM的评分,并为通用和查询导向的视频摘要建立了一种通用、可解释且无需训练的范式。
Summary / 总结
The paper proposes a rubric-guided pseudo-labeled and prompt-driven zero-shot video summarization framework that uses a small set of human annotations to generate high-confidence pseudo labels and evaluates video segments based on thematic relevance, action detail, and narrative progression. The method scores boundary scenes independently and intermediate scenes based on summaries of adjacent segments, allowing the language model to balance local salience and global coherence. The proposed method achieves competitive results on three benchmarks, outperforming zero-shot baselines on SumMe, TVSum, and QFVS by +0.85, +0.84, and +0.37 respectively in F1 scores.
论文提出了一种基于评分标准的伪标签和提示驱动的零样本视频摘要框架,利用少量的人工注释生成高置信度的伪标签,并根据主题相关性、动作细节和叙事进展来评估视频片段。该方法独立评分边界场景,并基于相邻片段的摘要评分中间场景,使语言模型能够在局部显著性和全局连贯性之间取得平衡。该方法在三个基准上取得了竞争力的结果,分别在SumMe、TVSum和QFVS上比零样本基线高出+0.85、+0.84和+0.37的F1分数。
Semantic World Models
Authors: Jacob Berg, Chuning Zhu, Yanda Bao, Ishan Durugkar, Abhishek Gupta
First: 2025-10-22T17:53:45+00:00 · Latest: 2025-10-22T17:53:45+00:00
Abstract
Planning with world models offers a powerful paradigm for robotic control. Conventional approaches train a model to predict future frames conditioned on current frames and actions, which can then be used for planning. However, the objective of predicting future pixels is often at odds with the actual planning objective; strong pixel reconstruction does not always correlate with good planning decisions. This paper posits that instead of reconstructing future frames as pixels, world models only need to predict task-relevant semantic information about the future. For such prediction the paper poses world modeling as a visual question answering problem about semantic information in future frames. This perspective allows world modeling to be approached with the same tools underlying vision language models. Thus vision language models can be trained as "semantic" world models through a supervised finetuning process on image-action-text data, enabling planning for decision-making while inheriting many of the generalization and robustness properties from the pretrained vision-language models. The paper demonstrates how such a semantic world model can be used for policy improvement on open-ended robotics tasks, leading to significant generalization improvements over typical paradigms of reconstruction-based action-conditional world modeling. Website available at https://weirdlabuw.github.io/swm.
中文标题/摘要
标题:语义世界模型
使用世界模型进行规划为机器人控制提供了一种强大的范式。 传统方法训练模型根据当前帧和动作预测未来帧,然后用于规划。然而,预测未来像素的目标往往与实际规划目标相矛盾;强大的像素重建并不总是与良好的规划决策相关。本文提出,世界模型不需要重建未来的像素,而是只需要预测与任务相关的未来语义信息。对于这种预测,本文将世界建模视为关于未来帧中语义信息的视觉问答问题。这种视角使得世界建模可以使用支撑视觉语言模型的相同工具来处理。因此,视觉语言模型可以通过监督微调过程在图像-动作-文本数据上训练为“语义”世界模型,从而在继承预训练视觉语言模型的泛化和鲁棒性特性的同时,实现决策制定中的规划。本文展示了如何使用这样的语义世界模型在开放性机器人任务中改进策略,从而在基于重建的条件动作世界建模的典型范式中实现显著的泛化改进。有关网站:https://weirdlabuw.github.io/swm/
Summary / 总结
This paper addresses the challenge of using world models for robotic planning by proposing a semantic approach. Instead of predicting future pixel frames, the model focuses on predicting task-relevant semantic information. This is achieved by framing the task as a visual question answering problem and training vision-language models as 'semantic' world models through supervised fine-tuning. The key finding is that this semantic world model outperforms traditional pixel reconstruction-based models in generalizing to new tasks, demonstrating significant improvements in policy improvement for open-ended robotics tasks.
本文提出了一种语义方法来解决使用世界模型进行机器人规划的问题。该方法不预测未来的像素帧,而是专注于预测与任务相关的语义信息。这通过将任务视为视觉问答问题并使用监督微调训练视觉语言模型来实现“语义”世界模型。关键发现是,这种语义世界模型在泛化到新任务方面优于传统的基于像素重建的动作条件世界模型,在开放性机器人任务中的策略改进中表现出显著的性能提升。
olmOCR 2: Unit Test Rewards for Document OCR
Authors: Jake Poznanski, Luca Soldaini, Kyle Lo
First: 2025-10-22T17:53:02+00:00 · Latest: 2025-10-22T17:53:02+00:00
Comments: https://olmocr.allen.ai/
Abstract
We present olmOCR 2, the latest in our family of powerful OCR systems for converting digitized print documents, like PDFs, into clean, naturally ordered plain text. olmOCR 2 is powered by olmOCR-2-7B-1025, a specialized, 7B vision language model (VLM) trained using reinforcement learning with verifiable rewards (RLVR), where our rewards are a diverse set of binary unit tests. To scale unit test creation, we develop a pipeline for generating synthetic documents with diverse and challenging layouts, known ground-truth HTML source code, and extracted test cases. We show that RL training on these test cases results in state-of-the-art performance on olmOCR-Bench, our English-language OCR benchmark, with the largest improvements in math formula conversion, table parsing, and multi-column layouts compared to previous versions. We release our model, data and code under permissive open licenses.
中文标题/摘要
标题:olmOCR 2:文档OCR单元测试奖励
我们介绍了olmOCR 2,这是我们家族中最新的一款强大的OCR系统,用于将数字化印刷文档,如PDF,转换为干净、自然排序的纯文本。olmOCR 2由olmOCR-2-7B-1025驱动,这是一种专门的7B视觉语言模型(VLM),使用可验证奖励(RLVR)的强化学习进行训练,其中我们的奖励是一组多样化的二元单元测试。为了扩展单元测试的创建,我们开发了一条生成具有多样性和挑战性布局的合成文档的管道,已知的地面真实HTML源代码和提取的测试用例。我们展示了在这些测试用例上进行RL训练的结果,在olmOCR-Bench,我们的英文OCR基准测试中达到了最先进的性能,与之前的版本相比,在数学公式转换、表格解析和多列布局方面取得了最大的改进。我们以宽松的开源许可发布了我们的模型、数据和代码。
Class-Aware Prototype Learning with Negative Contrast for Test-Time Adaptation of Vision-Language Models
Authors: Xiaozhen Qiao, Jingkai Zhao, Yuqiu Jiang, Xianda Guo, Zhe Sun, Hongyuan Zhang, Xuelong Li
First: 2025-10-22T17:38:35+00:00 · Latest: 2025-10-22T17:38:35+00:00
Abstract
Vision-Language Models (VLMs) demonstrate impressive zero-shot generalization through large-scale image-text pretraining, yet their performance can drop once the deployment distribution diverges from the training distribution. To address this, Test-Time Adaptation (TTA) methods update models using unlabeled target data. However, existing approaches often ignore two key challenges: prototype degradation in long-tailed distributions and confusion between semantically similar classes. To tackle these issues, we propose \textbf{C}lass-Aware \textbf{P}rototype \textbf{L}earning with \textbf{N}egative \textbf{C}ontrast(\textbf{CPL-NC}), a lightweight TTA framework designed specifically for VLMs to enhance generalization under distribution shifts. CPL-NC introduces a \textit{Class-Aware Prototype Cache} Module that dynamically adjusts per-class capacity based on test-time frequency and activation history, with a rejuvenation mechanism for inactive classes to retain rare-category knowledge. Additionally, a \textit{Negative Contrastive Learning} Mechanism identifies and constrains hard visual-textual negatives to improve class separability. The framework employs asymmetric optimization, refining only textual prototypes while anchoring on stable visual features. Experiments on 15 benchmarks show that CPL-NC consistently outperforms prior TTA methods across both ResNet-50 and ViT-B/16 backbones.
中文标题/摘要
标题:面向类别的原型学习与负对比度测试时适应性调整视觉-语言模型
视觉-语言模型(VLMs)通过大规模图像-文本预训练展示了令人印象深刻的零样本泛化能力,但在部署分布与训练分布不一致时,其性能会下降。为解决这一问题,测试时适应性(TTA)方法使用未标记的目标数据更新模型。然而,现有方法往往忽视了两个关键挑战:长尾分布中的原型退化和语义相似类别之间的混淆。为应对这些问题,我们提出了**C**类**A**意识**P**原型**L**学习与**N**负**C**对比(**CPL-NC**),这是一种专为VLMs设计的轻量级TTA框架,旨在在分布转移下增强泛化能力。CPL-NC引入了一个**类意识原型缓存**模块,根据测试时频率和激活历史动态调整每个类别的容量,并通过一种不活跃类的再生机制保留稀有类别的知识。此外,一种**负对比学习**机制识别并限制难以区分的视觉-文本负样本,以提高类别可分性。该框架采用非对称优化,仅细化文本原型,同时锚定稳定的视觉特征。在15个基准测试上的实验表明,CPL-NC在ResNet-50和ViT-B/16两个骨干网络上均优于先前的TTA方法。
Summary / 总结
The research aims to improve the zero-shot generalization of Vision-Language Models (VLMs) by addressing prototype degradation and class confusion during test-time adaptation. The proposed CPL-NC framework introduces a Class-Aware Prototype Cache Module and a Negative Contrastive Learning Mechanism to dynamically adjust prototype capacity and improve class separability. Experiments on 15 benchmarks demonstrate that CPL-NC outperforms existing TTA methods across different backbone models.
研究旨在通过解决原型退化和类别混淆问题来提高Vision-Language Models (VLMs)的零样本泛化能力。提出的CPL-NC框架引入了Class-Aware Prototype Cache模块和Negative Contrastive Learning机制,以动态调整原型容量并提高类别可分性。实验结果显示,CPL-NC在不同骨干模型上均优于现有TTA方法。
Training-Free Constrained Generation With Stable Diffusion Models
Authors: Stefano Zampini, Jacob K. Christopher, Luca Oneto, Davide Anguita, Ferdinando Fioretto
Venue: NeurIPS 2025 Spotlight
First: 2025-02-08T16:11:17+00:00 · Latest: 2025-10-22T17:02:03+00:00
Comments: Spotlight at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract
Stable diffusion models represent the state-of-the-art in data synthesis across diverse domains and hold transformative potential for applications in science and engineering, e.g., by facilitating the discovery of novel solutions and simulating systems that are computationally intractable to model explicitly. While there is increasing effort to incorporate physics-based constraints into generative models, existing techniques are either limited in their applicability to latent diffusion frameworks or lack the capability to strictly enforce domain-specific constraints. To address this limitation this paper proposes a novel integration of stable diffusion models with constrained optimization frameworks, enabling the generation of outputs satisfying stringent physical and functional requirements. The effectiveness of this approach is demonstrated through material design experiments requiring adherence to precise morphometric properties, challenging inverse design tasks involving the generation of materials inducing specific stress-strain responses, and copyright-constrained content generation tasks. All code has been released at https://github.com/RAISELab-atUVA/Constrained-Stable-Diffusion.
中文标题/摘要
标题:无训练约束生成与稳定扩散模型
稳定扩散模型在不同领域数据合成方面代表了最先进的技术水平,并且具有在科学和工程应用中变革性的潜力,例如通过促进新型解决方案的发现和模拟计算上难以明确建模的系统。虽然越来越多的努力致力于将基于物理的约束纳入生成模型中,但现有技术要么局限于潜在扩散框架的应用范围,要么缺乏严格实施特定领域约束的能力。为解决这一局限性,本文提出了一种将稳定扩散模型与约束优化框架相结合的新颖方法,从而能够生成满足严格物理和功能要求的输出。通过材料设计实验、需要严格形态学属性的逆向设计任务以及受版权约束的内容生成任务,证明了该方法的有效性。所有代码已发布在 https://github.com/RAISELab-atUVA/Constrained-Stable-Diffusion。
Summary / 总结
This paper addresses the challenge of incorporating physical constraints into generative models, particularly stable diffusion models, which are state-of-the-art in data synthesis. The authors propose integrating these models with constrained optimization frameworks to generate outputs that strictly adhere to specific physical and functional requirements. The effectiveness of this approach is shown through experiments in material design, inverse design tasks, and content generation with copyright constraints.
本文解决了将物理约束融入生成模型,特别是稳定扩散模型的问题,这些模型在数据合成方面处于领先地位。作者提出将这些模型与约束优化框架结合,以生成严格符合特定物理和功能要求的输出。该方法的有效性通过材料设计实验、逆向设计任务和版权受限的内容生成任务得到了验证。
A Survey on Cache Methods in Diffusion Models: Toward Efficient Multi-Modal Generation
Authors: Jiacheng Liu, Xinyu Wang, Yuqi Lin, Zhikai Wang, Peiru Wang, Peiliang Cai, Qinming Zhou, Zhengan Yan, Zexuan Yan, Zhengyi Shi, Chang Zou, Yue Ma, Linfeng Zhang
First: 2025-10-22T16:46:05+00:00 · Latest: 2025-10-22T16:46:05+00:00
Comments: 22 pages,2 figures
Abstract
Diffusion Models have become a cornerstone of modern generative AI for their exceptional generation quality and controllability. However, their inherent \textit{multi-step iterations} and \textit{complex backbone networks} lead to prohibitive computational overhead and generation latency, forming a major bottleneck for real-time applications. Although existing acceleration techniques have made progress, they still face challenges such as limited applicability, high training costs, or quality degradation. Against this backdrop, \textbf{Diffusion Caching} offers a promising training-free, architecture-agnostic, and efficient inference paradigm. Its core mechanism identifies and reuses intrinsic computational redundancies in the diffusion process. By enabling feature-level cross-step reuse and inter-layer scheduling, it reduces computation without modifying model parameters. This paper systematically reviews the theoretical foundations and evolution of Diffusion Caching and proposes a unified framework for its classification and analysis. Through comparative analysis of representative methods, we show that Diffusion Caching evolves from \textit{static reuse} to \textit{dynamic prediction}. This trend enhances caching flexibility across diverse tasks and enables integration with other acceleration techniques such as sampling optimization and model distillation, paving the way for a unified, efficient inference framework for future multimodal and interactive applications. We argue that this paradigm will become a key enabler of real-time and efficient generative AI, injecting new vitality into both theory and practice of \textit{Efficient Generative Intelligence}.
中文标题/摘要
标题:关于扩散模型中缓存方法的综述:朝向高效的多模态生成
扩散模型已成为现代生成AI的基石,因其卓越的生成质量和可控性。然而,它们固有的\textit{多步迭代}和\textit{复杂骨干网络}导致了巨大的计算开销和生成延迟,成为实时应用的主要瓶颈。尽管现有加速技术取得了一定进展,但仍面临适用性有限、高训练成本或质量下降等问题。 在此背景下,\textbf{扩散缓存}提供了一种无需训练、架构无关且高效的推理范式。其核心机制识别并重用了扩散过程中的内在计算冗余。通过在特征级别实现跨步重用和跨层调度,它减少了计算量而不修改模型参数。本文系统地回顾了扩散缓存的理论基础及其演变,并提出了一种统一的分类和分析框架。 通过对代表性方法的比较分析,我们表明扩散缓存从\textit{静态重用}发展到\textit{动态预测}。这一趋势增强了缓存的灵活性,使其适用于各种任务,并能够与其他加速技术如采样优化和模型蒸馏集成,为未来的多模态和交互应用提供统一、高效的推理框架。我们认为,这一范式将成为实时和高效生成AI的关键使能器,为\textit{高效生成智能}的理论和实践注入新的活力。
Summary / 总结
The paper addresses the computational challenges of diffusion models, which are known for their high-quality and controllable generation but suffer from significant computational overhead. It introduces diffusion caching as a training-free, architecture-agnostic method that reuses computational redundancies to reduce inference latency. The study reviews the evolution of diffusion caching from static reuse to dynamic prediction, demonstrating its effectiveness in enhancing flexibility and integration with other acceleration techniques, thus paving the way for efficient multimodal generation.
论文探讨了使用扩散缓存来解决扩散模型的计算挑战,尽管这些模型在生成质量和可控性方面表现出色,但计算开销巨大。方法通过识别和重用扩散过程中的计算冗余来减少计算量,而不改变模型参数。关键发现表明,扩散缓存从静态重用发展到动态预测,增强了灵活性并能够与其他加速技术(如采样优化和模型蒸馏)集成,从而为多模态和交互式应用中的高效推理铺平了道路。
3D Visual Illusion Depth Estimation
Authors: Chengtang Yao, Zhidan Liu, Jiaxi Zeng, Lidong Yu, Yuwei Wu, Yunde Jia
Venue: NeurIPS 2025
First: 2025-05-19T12:51:03+00:00 · Latest: 2025-10-22T16:13:49+00:00
Comments: NeurIPS 2025, Project: https://github.com/YaoChengTang/3D-Visual-Illusion-Depth-Estimation
Abstract
3D visual illusion is a perceptual phenomenon where a two-dimensional plane is manipulated to simulate three-dimensional spatial relationships, making a flat artwork or object look three-dimensional in the human visual system. In this paper, we reveal that the machine visual system is also seriously fooled by 3D visual illusions, including monocular and binocular depth estimation. In order to explore and analyze the impact of 3D visual illusion on depth estimation, we collect a large dataset containing almost 3k scenes and 200k images to train and evaluate SOTA monocular and binocular depth estimation methods. We also propose a 3D visual illusion depth estimation framework that uses common sense from the vision language model to adaptively fuse depth from binocular disparity and monocular depth. Experiments show that SOTA monocular, binocular, and multi-view depth estimation approaches are all fooled by various 3D visual illusions, while our method achieves SOTA performance.
中文标题/摘要
标题:3D视觉错觉深度估计
3D视觉错觉是一种知觉现象,通过操纵二维平面来模拟三维空间关系,使平面的艺术作品或物体在人类视觉系统中看起来具有三维效果。在本文中,我们揭示了机器视觉系统也会被3D视觉错觉严重欺骗,包括单眼和双眼深度估计。为了探索和分析3D视觉错觉对深度估计的影响,我们收集了一个包含近3000个场景和20万张图像的大规模数据集,用于训练和评估最先进的单眼和双眼深度估计方法。我们还提出了一种3D视觉错觉深度估计框架,该框架利用视觉语言模型中的常识,自适应地融合来自双眼视差和单眼深度的深度信息。实验表明,最先进的单眼、双眼和多视图深度估计方法都会被各种3D视觉错觉欺骗,而我们的方法则达到了最先进的性能。
Summary / 总结
This paper investigates how 3D visual illusions affect depth estimation in both human and machine visual systems. To explore this, the authors collected a large dataset of 3k scenes and 200k images, including monocular and binocular depth estimation methods. They found that state-of-the-art (SOTA) monocular, binocular, and multi-view depth estimation approaches are all deceived by 3D visual illusions. The proposed framework, which uses common sense from a vision language model to fuse depth from binocular disparity and monocular depth, outperforms existing methods in depth estimation accuracy.
该研究探讨了3D视觉幻象对人类和机器视觉系统中深度估计的影响。为了探索这一点,作者收集了一个包含3k场景和200k图像的大规模数据集,其中包括单目和双目深度估计方法。研究发现,现有的单目、双目和多视图深度估计方法都会被3D视觉幻象欺骗。提出的框架利用视觉语言模型中的常识,将来自双目视差和单目深度的深度信息进行融合,其深度估计性能优于现有方法。
AgentSense: LLMs Empower Generalizable and Explainable Web-Based Participatory Urban Sensing
Authors: Xusen Guo, Mingxing Peng, Xixuan Hao, Xingchen Zou, Qiongyan Wang, Sijie Ruan, Yuxuan Liang
First: 2025-10-22T15:06:26+00:00 · Latest: 2025-10-22T15:06:26+00:00
Comments: 13 pages, 10 pages
Abstract
Web-based participatory urban sensing has emerged as a vital approach for modern urban management by leveraging mobile individuals as distributed sensors. However, existing urban sensing systems struggle with limited generalization across diverse urban scenarios and poor interpretability in decision-making. In this work, we introduce AgentSense, a hybrid, training-free framework that integrates large language models (LLMs) into participatory urban sensing through a multi-agent evolution system. AgentSense initially employs classical planner to generate baseline solutions and then iteratively refines them to adapt sensing task assignments to dynamic urban conditions and heterogeneous worker preferences, while producing natural language explanations that enhance transparency and trust. Extensive experiments across two large-scale mobility datasets and seven types of dynamic disturbances demonstrate that AgentSense offers distinct advantages in adaptivity and explainability over traditional methods. Furthermore, compared to single-agent LLM baselines, our approach outperforms in both performance and robustness, while delivering more reasonable and transparent explanations. These results position AgentSense as a significant advancement towards deploying adaptive and explainable urban sensing systems on the web.
中文标题/摘要
标题:AgentSense:大规模语言模型赋能的通用化和可解释的基于网络的参与式城市感知
基于网络的参与式城市感知已成为现代城市管理的重要方法,通过利用移动个体作为分布式传感器。然而,现有的城市感知系统在多种城市场景下的泛化能力有限,并且在决策中的可解释性较差。在本工作中,我们引入了AgentSense,这是一种无需训练的混合框架,通过多智能体演化系统将大规模语言模型(LLMs)集成到参与式城市感知中。AgentSense 初始使用经典规划器生成基线解决方案,然后迭代优化它们,以适应动态城市条件和异质工作者偏好,并生成自然语言解释以增强透明度和信任。在两个大规模移动数据集和七种动态干扰类型上的广泛实验表明,AgentSense 在适应性和可解释性方面优于传统方法。此外,与单智能体LLM基线相比,我们的方法在性能和鲁棒性方面表现更优,同时提供更合理和透明的解释。这些结果使AgentSense 成为部署适应性和可解释的城市感知系统的重大进展。
Summary / 总结
AgentSense is a hybrid framework that integrates large language models (LLMs) into participatory urban sensing through a multi-agent evolution system. It uses classical planners to generate baseline solutions and iteratively refines them to adapt to dynamic urban conditions and worker preferences, providing natural language explanations for enhanced transparency. Experiments show that AgentSense outperforms traditional methods in adaptivity and explainability, and performs better than single-agent LLM baselines in both performance and robustness, with more reasonable explanations. This positions AgentSense as a significant advancement in adaptive and explainable urban sensing systems.
AgentSense 是一个将大型语言模型集成到参与式城市感知中的混合框架,旨在增强适应性和可解释性。它使用多智能体进化系统根据动态城市条件和工人偏好来优化感知任务分配,并提供自然语言解释。实验表明,AgentSense 在适应性和鲁棒性方面优于传统方法,并且与单智能体 LLM 基线相比,提供了更透明和合理的解释。
MedReason-R1: Learning to Reason for CT Diagnosis with Reinforcement Learning and Local Zoom
Authors: Yifan Li, Fenghe Tang, Yingtai Li, Shaohua Kevin Zhou
First: 2025-10-22T14:21:59+00:00 · Latest: 2025-10-22T14:21:59+00:00
Comments: The code, checkpoints, and dataset are available at: https://github.com/Leevan001/MedReason-R1
Abstract
General-purpose large Vision-Language Models (VLMs) demonstrate strong capabilities in generating detailed descriptions for natural images. However, their performance in the medical domain remains suboptimal, even for relatively straightforward tasks, primarily due to the lack of large-scale, high-quality, specialized medical imaging datasets and the neglect of the diagnostic process that progresses from coarse to fine-grained. To address the first issue, we construct the CT-RATE-VQA dataset, which has 84K QA pairs. For the second issue, we propose MedReason-R1, a medical VLM with explicit reasoning process for disease diagnosis. MedReason-R1 incorporates a novel strategy that embeds zoom-in disease region-of-interest areas into the image, highlighting the crucial role of both global localization and disease-specific details in enhancing the model's diagnostic performance. Furthermore, we introduce the GRPO reinforcement learning framework to MedReason-R1, which enables effective reasoning without relying on costly manual annotations. Compared to recent general-purpose and medical VLMs, MedReason-R1 achieves state-of-the-art performance in CT disease diagnosis while retaining generalization. The code, checkpoints, and dataset are available at: https://github.com/Leevan001/MedReason-R1
中文标题/摘要
标题:MedReason-R1:利用强化学习和局部放大进行CT诊断推理
通用大型视觉-语言模型(VLMs)在生成自然图像详细描述方面表现出强大的能力。然而,在医疗领域,即使是相对简单的任务,其性能仍然不尽如人意,主要原因是缺乏大规模、高质量的专业医学成像数据集,以及忽视了从粗略到精细的诊断过程。为了解决第一个问题,我们构建了CT-RATE-VQA数据集,包含84K问答对。为了解决第二个问题,我们提出了MedReason-R1,这是一种具有明确推理过程的医疗VLM,用于疾病诊断。MedReason-R1结合了一种新颖的策略,即将疾病区域的局部放大嵌入到图像中,强调了全局定位和疾病特定细节在提高模型诊断性能中的关键作用。此外,我们引入了GRPO强化学习框架到MedReason-R1中,这使得有效的推理不再依赖于昂贵的手动注释。与最近的通用和医疗VLMs相比,MedReason-R1在CT疾病诊断方面取得了最先进的性能,同时保持了泛化能力。代码、检查点和数据集可在以下链接获取:https://github.com/Leevan001/MedReason-R1
Summary / 总结
The research aims to improve the performance of Vision-Language Models (VLMs) in the medical domain, particularly for CT diagnosis. To address the lack of specialized medical datasets and the need for diagnostic reasoning, the authors created the CT-RATE-VQA dataset and proposed MedReason-R1, a VLM with an explicit reasoning process. MedReason-R1 uses a zoom-in strategy to highlight disease-specific details and a GRPO reinforcement learning framework for effective reasoning. The model outperforms existing VLMs in CT disease diagnosis while maintaining generalization capabilities.
MedReason-R1通过构建CT-RATE-VQA数据集和提出具有明确推理过程的新型医疗VLM来解决通用Vision-Language模型在医疗领域的局限性。该模型采用区域放大策略突出疾病关键细节,并使用GRPO强化学习框架提高诊断性能,无需人工标注。MedReason-R1在CT疾病诊断中表现出色,同时保持了泛化能力。
XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography
Authors: Haozhe Luo, Shelley Zixin Shu, Ziyu Zhou, Sebastian Otalora, Mauricio Reyes
First: 2025-10-22T13:52:19+00:00 · Latest: 2025-10-22T13:52:19+00:00
Abstract
Vision-language models (VLMs) have recently shown remarkable zero-shot performance in medical image understanding, yet their grounding ability, the extent to which textual concepts align with visual evidence, remains underexplored. In the medical domain, however, reliable grounding is essential for interpretability and clinical adoption. In this work, we present the first systematic benchmark for evaluating cross-modal interpretability in chest X-rays across seven CLIP-style VLM variants. We generate visual explanations using cross-attention and similarity-based localization maps, and quantitatively assess their alignment with radiologist-annotated regions across multiple pathologies. Our analysis reveals that: (1) while all VLM variants demonstrate reasonable localization for large and well-defined pathologies, their performance substantially degrades for small or diffuse lesions; (2) models that are pretrained on chest X-ray-specific datasets exhibit improved alignment compared to those trained on general-domain data. (3) The overall recognition ability and grounding ability of the model are strongly correlated. These findings underscore that current VLMs, despite their strong recognition ability, still fall short in clinically reliable grounding, highlighting the need for targeted interpretability benchmarks before deployment in medical practice. XBench code is available at https://github.com/Roypic/Benchmarkingattention
中文标题/摘要
标题:XBench:胸部X光视觉-语言解释的综合基准
视觉-语言模型(VLMs)在医学图像理解方面最近展示了令人瞩目的零样本性能,但它们的定位能力,即文本概念与视觉证据的对齐程度,仍处于未被充分探索的状态。然而,在医学领域,可靠的定位对于可解释性和临床应用至关重要。在本研究中,我们首次系统地提出了一个针对七种CLIP风格的VLM变体在胸部X光片上跨模态可解释性的基准测试。我们使用跨注意力和相似性定位图生成视觉解释,并定量评估其与放射科医生标注区域在多种病理学上的对齐程度。我们的分析表明:(1)尽管所有VLM变体在大型和明确的病理学上表现出合理的定位能力,但它们在小或弥漫性病灶上的表现显著下降;(2)在胸部X光特定数据集上预训练的模型相比在通用领域数据上训练的模型,其对齐程度有所提高;(3)模型的整体识别能力和定位能力之间存在强烈的相关性。这些发现强调,尽管当前的VLMs在识别能力方面很强,但在临床可靠的定位方面仍然存在不足,突显了在医学实践中部署前需要有针对性的可解释性基准测试的必要性。XBench代码可在https://github.com/Roypic/Benchmarkingattention获取
Summary / 总结
XBench is a comprehensive benchmark for evaluating the interpretability of vision-language models in chest radiography. It assesses the alignment between textual descriptions and visual evidence across seven VLM variants using cross-attention and similarity-based localization maps. The study finds that while VLMs perform reasonably well for large pathologies, their performance drops significantly for small or diffuse lesions. Additionally, models pretrained on chest X-ray data show better alignment with radiologist annotations compared to general-domain models, and the overall recognition ability is closely related to grounding ability, indicating the need for targeted interpretability benchmarks in medical applications.
XBench 是一个用于评估胸部X光图像中视觉-语言模型(VLMs)可解释性的基准。研究使用交叉注意力和相似性定位图生成视觉解释,并将其与放射科医生标注的区域进行比较。主要发现包括:VLMs 对于大病灶表现良好,但对于小或弥漫性病灶表现较差;以及在胸部X光数据上预训练的模型比在通用领域数据上训练的模型有更好的对齐效果。这表明当前的VLMs尽管在识别能力上很强,但在临床可靠的定位方面仍然不足,强调了在医疗实践中部署前需要更多针对性的可解释性基准。
Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation
Authors: Su Ho Han, Jeongseok Hyun, Pilhyeon Lee, Minho Shim, Dongyoon Wee, Seon Joo Kim
Venue: www
First: 2025-10-22T13:42:59+00:00 · Latest: 2025-10-22T13:42:59+00:00
Comments: Project page: https://www.jshyun.me/projects/decaf
Abstract
Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates entirely without retraining. DecAF outperforms training-free methods and achieves performance comparable to training-based methods on both referring and reasoning VOS benchmarks. The code will be available at https://github.com/HYUNJS/DecAF.
中文标题/摘要
标题:MLLMs中的分解注意力融合用于训练-free视频推理分割
多模态大型语言模型(MLLMs)通过关注与文本查询相关的视觉标记,展示了强大的视频理解能力。为了以训练-free的方式直接适应这一能力进行定位,我们将视频推理分割重新定义为视频问答任务,并通过展开机制提取注意力图。然而,原始的注意力图是嘈杂的,并且与对象区域对齐不良。我们提出了分解注意力融合(DecAF),通过两种机制对这些图进行细化:(1)对比对象-背景融合和(2)互补视频帧融合。该方法抑制了无关激活,并增强了对象聚焦的线索,使注意力图可以直接转换为粗略的分割掩码。此外,我们引入了注意力引导的SAM2提示,以获得细粒度掩码。与现有方法联合训练MLLMs和SAM不同,我们的方法完全不需要重新训练。DecAF在训练-free方法中表现出色,并在引用和推理VOS基准上达到了与训练基于方法相当的性能。代码将在https://github.com/HYUNJS/DecAF上提供。
Summary / 总结
The research aims to enhance video reasoning and segmentation using decomposed attention fusion in multimodal large language models (MLLMs) without retraining. The method, Decomposed Attention Fusion (DecAF), refines raw attention maps through contrastive object-background fusion and complementary video-frame fusion, enabling direct conversion into coarse segmentation masks. DecAF outperforms existing training-free methods and achieves performance comparable to training-based methods on referring and reasoning video object segmentation benchmarks.
该论文利用分解注意力融合(DecAF)在多模态大型语言模型(MLLMs)中解决训练免费的视频推理和分割问题。该方法通过回放机制提取注意力图,并使用对比对象背景融合和互补视频帧融合来改进与对象区域的对齐。DecAF 在引用和推理视频对象分割基准测试中优于现有训练免费的方法,并达到了与训练基线方法相当的性能。
Can You Trust What You See? Alpha Channel No-Box Attacks on Video Object Detection
Authors: Ariana Yi, Ce Zhou, Liyang Xiao, Qiben Yan
First: 2025-10-22T13:27:02+00:00 · Latest: 2025-10-22T13:27:02+00:00
Abstract
As object detection models are increasingly deployed in cyber-physical systems such as autonomous vehicles (AVs) and surveillance platforms, ensuring their security against adversarial threats is essential. While prior work has explored adversarial attacks in the image domain, those attacks in the video domain remain largely unexamined, especially in the no-box setting. In this paper, we present {\alpha}-Cloak, the first no-box adversarial attack on object detectors that operates entirely through the alpha channel of RGBA videos. {\alpha}-Cloak exploits the alpha channel to fuse a malicious target video with a benign video, resulting in a fused video that appears innocuous to human viewers but consistently fools object detectors. Our attack requires no access to model architecture, parameters, or outputs, and introduces no perceptible artifacts. We systematically study the support for alpha channels across common video formats and playback applications, and design a fusion algorithm that ensures visual stealth and compatibility. We evaluate {\alpha}-Cloak on five state-of-the-art object detectors, a vision-language model, and a multi-modal large language model (Gemini-2.0-Flash), demonstrating a 100% attack success rate across all scenarios. Our findings reveal a previously unexplored vulnerability in video-based perception systems, highlighting the urgent need for defenses that account for the alpha channel in adversarial settings.
中文标题/摘要
标题:你能信任你看到的吗?基于alpha通道无框攻击的视频对象检测
随着目标检测模型在自动驾驶车辆(AVs)和监控平台等网络物理系统中的广泛应用,确保其在对抗性威胁下的安全性变得至关重要。尽管先前的工作已经探索了图像域中的对抗性攻击,但在视频域中的攻击,尤其是在无框设置下,仍然鲜有研究。在本文中,我们提出了α-Cloak,这是第一个完全通过RGBA视频的alpha通道对目标检测器进行的无框对抗性攻击。α-Cloak 利用alpha通道将恶意目标视频与良性视频融合,生成对人类观众看似无害但能一致欺骗目标检测器的融合视频。我们的攻击无需访问模型架构、参数或输出,且不会引入任何可感知的伪影。我们系统地研究了常见视频格式和播放应用中alpha通道的支持情况,并设计了一种融合算法,确保视觉隐身和兼容性。我们在五种最先进的目标检测器、一个视觉语言模型以及一个多模态大语言模型(Gemini-2.0-Flash)上评估了α-Cloak,证明了在所有场景中100%的攻击成功率。我们的研究结果揭示了视频感知系统中一个此前未被探索的漏洞,强调了在对抗性环境中需要考虑alpha通道的防御措施的迫切性。
Summary / 总结
This paper introduces {\alpha}-Cloak, the first no-box adversarial attack on object detectors using the alpha channel of RGBA videos. Motivated by the need to ensure security of object detection models in cyber-physical systems like autonomous vehicles, the method exploits the alpha channel to fuse a malicious target video with a benign one, fooling object detectors without perceptible artifacts. The attack achieves a 100% success rate across five state-of-the-art object detectors and other models, revealing a new vulnerability in video-based perception systems.
论文通过引入{\alpha}-Cloak攻击,展示了如何通过操纵RGBA视频的alpha通道来欺骗对象检测器而不产生可见的伪影。该方法不需要访问模型细节,并在各种对象检测器和模型中实现了100%的攻击成功率。这项工作揭示了基于视频的感知系统中的新漏洞,强调了在对抗性环境中需要针对alpha通道的防御措施。
A Matter of Time: Revealing the Structure of Time in Vision-Language Models
Authors: Nidham Tekaya, Manuela Waldner, Matthias Zeppelzauer
First: 2025-10-22T13:14:02+00:00 · Latest: 2025-10-22T13:14:02+00:00
Abstract
Large-scale vision-language models (VLMs) such as CLIP have gained popularity for their generalizable and expressive multimodal representations. By leveraging large-scale training data with diverse textual metadata, VLMs acquire open-vocabulary capabilities, solving tasks beyond their training scope. This paper investigates the temporal awareness of VLMs, assessing their ability to position visual content in time. We introduce TIME10k, a benchmark dataset of over 10,000 images with temporal ground truth, and evaluate the time-awareness of 37 VLMs by a novel methodology. Our investigation reveals that temporal information is structured along a low-dimensional, non-linear manifold in the VLM embedding space. Based on this insight, we propose methods to derive an explicit ``timeline'' representation from the embedding space. These representations model time and its chronological progression and thereby facilitate temporal reasoning tasks. Our timeline approaches achieve competitive to superior accuracy compared to a prompt-based baseline while being computationally efficient. All code and data are available at https://tekayanidham.github.io/timeline-page/.
中文标题/摘要
标题:时间之谜:揭示视觉语言模型中的时间结构
大规模的视觉语言模型(VLMs)如CLIP因其泛化和表达能力强而广受欢迎。通过利用包含多样文本元数据的大规模训练数据,VLMs获得了开放词汇能力,能够解决超出其训练范围的任务。本文研究了VLMs的时间意识,评估了它们将视觉内容定位在时间中的能力。我们引入了TIME10k,这是一个包含超过10,000张具有时间真实性的图像基准数据集,并通过一种新的方法评估了37个VLMs的时间意识。我们的研究揭示了时间信息在VLM嵌入空间中沿着一个低维度的非线性流形结构化。基于这一洞察,我们提出了从嵌入空间中推导出显式的“时间线”表示的方法。这些表示模型时间及其时间顺序,从而促进时间推理任务。我们的时间线方法在计算效率方面优于基于提示的基线方法,同时具有竞争力甚至更优的准确性。所有代码和数据可在https://tekayanidham.github.io/timeline-page/获取。
Summary / 总结
This paper investigates the temporal awareness of large-scale vision-language models (VLMs) by introducing TIME10k, a benchmark dataset with temporal ground truth. The study evaluates 37 VLMs and finds that temporal information is structured in a low-dimensional, non-linear manifold within the VLM embedding space. The authors propose methods to derive explicit timeline representations, which improve temporal reasoning tasks with competitive to superior accuracy compared to a prompt-based approach, while maintaining computational efficiency. All code and data are available online.
本文通过引入包含时间真实标注的TIME10k基准数据集,评估了37个大型视觉-语言模型(VLMs)的时间感知能力。研究发现,时间信息在VLM嵌入空间中以低维度、非线性的流形结构存在。作者提出了从嵌入空间中提取显式时间线表示的方法,这些表示能够建模时间及其时间顺序,从而促进时间推理任务。这些方法在与提示基线相比时,表现出了竞争力甚至更优的准确性,同时计算效率较高。所有代码和数据可在https://tekayanidham.github.io/timeline-page/获取。
[De|Re]constructing VLMs' Reasoning in Counting
Authors: Simone Alghisi, Gabriel Roccabruna, Massimo Rizzoli, Seyed Mahed Mousavi, Giuseppe Riccardi
First: 2025-10-22T13:08:47+00:00 · Latest: 2025-10-22T13:08:47+00:00
Comments: This work has been submitted to the IEEE for possible publication
Abstract
Vision-Language Models (VLMs) have recently gained attention due to their competitive performance on multiple downstream tasks, achieved by following user-input instructions. However, VLMs still exhibit several limitations in visual reasoning, such as difficulties in identifying relations (e.g., spatial, temporal, and among objects), understanding temporal sequences (e.g., frames), and counting objects. In this work, we go beyond score-level benchmark evaluations of VLMs by investigating the underlying causes of their failures and proposing a targeted approach to improve their reasoning capabilities. We study the reasoning skills of seven state-of-the-art VLMs in the counting task under controlled experimental conditions. Our experiments show that VLMs are highly sensitive to the number and type of objects, their spatial arrangement, and the co-occurrence of distractors. A layer-wise analysis reveals that errors are due to incorrect mapping of the last-layer representation into the output space. Our targeted training shows that fine-tuning just the output layer improves accuracy by up to 21%. We corroborate these findings by achieving consistent improvements on real-world datasets.
中文标题/摘要
标题:[德|雷]重构VLMs的计数推理
视觉-语言模型(VLMs)由于在多个下游任务中表现出竞争力,最近引起了关注,这些任务是通过遵循用户输入的指令实现的。然而,VLMs在视觉推理方面仍然存在一些局限性,例如在识别关系(例如空间、时间以及对象之间)方面存在困难,理解时间序列(例如帧)以及计数物体。在本文中,我们超越了VLMs的得分级别基准评估,通过调查其失败的根本原因并提出有针对性的方法来提高其推理能力。我们在受控实验条件下研究了七个最先进的VLMs在计数任务中的推理技能。我们的实验表明,VLMs对物体的数量和类型、它们的空间排列以及干扰物的共现非常敏感。逐层分析表明,错误是由于最后一层表示错误映射到输出空间造成的。我们的针对性训练表明,仅微调输出层可以提高准确率高达21%。我们通过在真实世界数据集上取得一致的改进来证实这些发现。
Summary / 总结
This study investigates the reasoning capabilities of Vision-Language Models (VLMs) in the counting task, focusing on their limitations in visual reasoning. By analyzing seven state-of-the-art VLMs under controlled conditions, the research reveals that VLMs are sensitive to the number and type of objects, their spatial arrangement, and the presence of distractors. The experiments show that errors are primarily due to incorrect mapping in the last layer. Targeted fine-tuning of the output layer improves accuracy by up to 21%, demonstrating the effectiveness of this approach in enhancing VLMs' reasoning abilities.
本研究通过在受控条件下分析Vision-Language Models (VLMs)的性能,探讨了其在计数任务中视觉推理的局限性。研究发现,VLMs 对物体的数量和类型、空间排列以及干扰物的存在非常敏感。通过仅微调输出层,VLMs 的准确性可以提高多达 21%。这些发现表明,定向训练可以增强 VLMs 在计数任务中的推理能力。
RoboGPT-R1: Enhancing Robot Planning with Reinforcement Learning
Authors: Jinrui Liu, Bingyan Nie, Boyu Li, Yaran Chen, Yuze Wang, Shunsen He, Haoran Li
First: 2025-10-16T16:04:35+00:00 · Latest: 2025-10-22T13:03:47+00:00
Abstract
Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long-view manipulation tasks successfully. Despite the success of large language models and vision language models based on Supervised Fine-Tuning (SFT) in planning tasks, they continue facing challenges in performing long-horizon manipulation tasks in complex real-world environments, owing to their restricted common sense and reasoning capabilities. Considering that aligning general-purpose vision language models to robotic planning tasks via supervised fine-tuning suffers from poor generalization and insufficient physical understanding, we propose RoboGPT-R1, a two-stage fine-tuning framework for embodied planning. In this framework, supervised training acquires foundational knowledge through expert sequences, followed by RL to address the model's shortcomings in visual-spatial understanding and reasoning. To achieve physical understanding and action sequence consistency in multi-step reasoning tasks, we design a rule-based reward function that simultaneously considers long-horizon performance and action constraint in the environment. The reasoning model, trained on Qwen2.5-VL-3B, significantly outperforms the larger-scale model, GPT-4o-mini, by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33% on the EmbodiedBench benchmark.
中文标题/摘要
标题:RoboGPT-R1:增强机器人规划的强化学习
提高具身代理的推理能力对于机器人在长期操作任务中成功完成复杂的人类指令至关重要。尽管基于监督微调(SFT)的大语言模型和视觉语言模型在规划任务中取得了成功,但在复杂现实环境中的长期操作任务中,它们仍然面临挑战,因为它们的常识和推理能力有限。鉴于通过监督微调将通用视觉语言模型对齐到机器人规划任务在泛化能力和物理理解方面存在不足,我们提出RoboGPT-R1,一种两阶段微调框架,用于具身规划。在这个框架中,监督训练通过专家序列获取基础知识,随后通过强化学习解决模型在视觉空间理解和推理方面的不足。为了在多步推理任务中实现物理理解和动作序列一致性,我们设计了一种基于规则的奖励函数,同时考虑长期性能和环境中的动作约束。基于Qwen2.5-VL-3B训练的推理模型显著优于更大规模的模型GPT-4o-mini,高出21.33%,并在具身Bench基准测试中超越其他基于Qwen2.5-VL-7B训练的工作20.33%。
Summary / 总结
The research aims to enhance robots' reasoning capabilities for complex manipulation tasks. It introduces RoboGPT-R1, a two-stage fine-tuning framework combining supervised training and reinforcement learning. The model, trained on Qwen2.5-VL-3B, shows a 21.33% improvement over GPT-4o-mini and a 20.33% improvement over Qwen2.5-VL-7B on the EmbodiedBench benchmark, demonstrating better performance in long-horizon tasks and physical understanding.
研究旨在通过增强机器人的推理能力来完成复杂的操作任务。RoboGPT-R1 是一种两阶段微调框架,结合监督训练和强化学习,以提高视觉空间理解和推理能力。该模型在 Qwen2.5-VL-3B 上训练,分别比 GPT-4o-mini 和其他在 Qwen2.5-VL-7B 上训练的工作在 EmbodiedBench 基准上高出 21.33% 和 20.33%。
Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition
Authors: Yu Li, Jin Jiang, Jianhua Zhu, Shuai Peng, Baole Wei, Yuxuan Zhou, Liangcai Gao
Venue: NeurIPS 2025 spotlight
First: 2025-05-29T15:41:00+00:00 · Latest: 2025-10-22T12:23:23+00:00
Comments: Accepted by NeurIPS 2025 as a spotlight
Abstract
Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR) due to the inherent freedom of symbol layouts and variability in handwriting styles. Prior methods have faced performance bottlenecks by proposing isolated architectural modifications, making them difficult to integrate coherently into a unified framework. Meanwhile, recent advances in pretrained vision-language models (VLMs) have demonstrated strong cross-task generalization, offering a promising foundation for developing unified solutions. In this paper, we introduce Uni-MuMER, which fully fine-tunes a VLM for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions. Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves super state-of-the-art performance, outperforming the best lightweight specialized model SSAN by 16.31\% and the top-performing VLM Gemini2.5-flash by 24.42\% under zero-shot setting. Our datasets, models, and code are open-sourced at: {https://github.com/BFlameSwift/Uni-MuMER
中文标题/摘要
标题:Uni-MuMER:统一多任务微调视觉语言模型进行手写数学表达式识别
手写数学表达式识别(HMER)仍然是光学字符识别(OCR)中的一个持久性挑战,由于符号布局的固有自由度和手写风格的多样性。先前的方法通过提出孤立的架构修改来应对性能瓶颈,这使得它们难以以连贯的方式整合到统一框架中。同时,预训练视觉语言模型(VLMs)的最新进展展示了强大的跨任务泛化能力,为开发统一解决方案提供了有希望的基础。在本文中,我们介绍了Uni-MuMER,它完全微调了一个VLM以进行HMER任务,而不修改其架构,有效地将领域特定知识注入到通用框架中。我们的方法整合了三个数据驱动的任务:树感知链式思维(Tree-CoT)用于结构化空间推理,错误驱动学习(EDL)用于减少视觉相似字符之间的混淆,以及符号计数(SC)以提高长表达式识别的一致性。在CROHME和HME100K数据集上的实验表明,Uni-MuMER在零样本设置下实现了超越现有最佳轻量级专门模型SSAN 16.31%和顶级VLM Gemini2.5-flash 24.42%的性能。我们的数据集、模型和代码已开源于:{https://github.com/BFlameSwift/Uni-MuMER
Summary / 总结
Uni-MuMER is a unified multi-task fine-tuning approach for Handwritten Mathematical Expression Recognition (HMER) using a vision-language model. It integrates three tasks: Tree-CoT for structured reasoning, EDL for character differentiation, and SC for expression consistency. Experiments show Uni-MuMER outperforms previous models by 16.31% and 24.42% on the CROHME and HME100K datasets, respectively, under a zero-shot setting.
Uni-MuMER 是一种使用视觉语言模型的统一多任务微调方法,用于手写数学表达式识别 (HMER)。它结合了三个数据驱动的任务:Tree-CoT 用于结构化空间推理,EDL 用于减少相似字符之间的混淆,SC 用于提高长表达式的识别一致性。实验结果显示,Uni-MuMER 在 CROHME 和 HME100K 数据集上的表现分别比现有模型高出 16.31% 和 24.42%,在零样本设置下。
Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment
Authors: Youjia Zhang, Youngeun Kim, Young-Geun Choi, Hongyeob Kim, Huiling Liu, Sungeun Hong
First: 2025-08-21T13:42:49+00:00 · Latest: 2025-10-22T12:05:11+00:00
Abstract
Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.
中文标题/摘要
标题:无需反向传播的测试时自适应通过概率高斯对齐
测试时自适应(TTA)通过在推理过程中利用未标记的测试数据来增强零样本鲁棒性,从而在分布偏移下提高鲁棒性。尽管取得了显著进展,但几个挑战仍然限制了其更广泛的适用性。首先,大多数方法依赖于反向传播或迭代优化,这限制了可扩展性并阻碍了实时部署。其次,它们缺乏对类条件特征分布的显式建模。这种建模对于生成可靠决策边界和校准预测至关重要,但由于缺乏源数据和测试时的监督,这种建模仍然未被充分探索。在本文中,我们提出了一种无需反向传播的高级分布感知测试时自适应方法ADAPT。我们将TTA重新定义为一个高斯概率推理任务,通过使用逐渐更新的类均值和共享协方差矩阵来建模类条件似然性。这使得可以进行闭式、无需训练的推理。为了纠正潜在的似然偏差,我们引入了由CLIP先验和历史知识库引导的轻量级正则化。ADAPT不需要源数据、不需要梯度更新,也不需要完全访问目标数据,支持在线和归纳设置。在多种基准上的广泛实验表明,我们的方法在各种分布偏移下实现了最先进的性能,具有更好的可扩展性和鲁棒性。
Summary / 总结
The paper addresses the challenges of test-time adaptation (TTA) by proposing ADAPT, which avoids backpropagation and iterative optimization, thus enhancing scalability and real-time deployment. ADAPT models class-conditional feature distributions using Gaussian probabilistic inference, enabling closed-form inference without training data. It introduces regularization using CLIP priors and a historical knowledge bank to correct likelihood bias. Experiments show that ADAPT outperforms existing methods across various benchmarks, offering superior scalability and robustness under distribution shifts.
论文提出了一种名为ADAPT的方法,通过将TTA重新定义为高斯概率推断任务来解决其挑战。ADAPT避免了反向传播和迭代优化,从而实现可扩展性和实时部署。它使用更新后的类均值和共享协方差矩阵来建模类条件似然性,允许闭式推理。ADAPT还引入了使用CLIP先验和历史知识库的正则化来纠正似然偏差。实验表明,ADAPT在各种基准测试中表现出色,提供了在分布偏移下更优的可扩展性和鲁棒性。
CARES: Context-Aware Resolution Selector for VLMs
Authors: Moshe Kimhi, Nimrod Shabtay, Raja Giryes, Chaim Baskin, Eli Schwartz
First: 2025-10-22T11:44:31+00:00 · Latest: 2025-10-22T11:44:31+00:00
Abstract
Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emph{CARES}-a \textbf{C}ontext-\textbf{A}ware \textbf{R}esolution \textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM's response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.
中文标题/摘要
标题:CARES:面向VLMs的上下文感知分辨率选择器
大型视觉-语言模型(VLMs)通常以原生或高分辨率处理图像,以在各种任务中保持有效性。这导致视觉标记通常占总标记的97-99%,即使低分辨率图像足以使用时也会导致高计算量和延迟。我们引入了\emph{CARES}-一个\textbf{上下文感知} \textbf{分辨率} \textbf{选择器},这是一个轻量级的预处理模块,给定图像-查询对,预测\emph{最小}的足够输入分辨率。CARES使用一个紧凑的VLM(350M)提取特征,并预测目标预训练VLM响应收敛到其峰值正确回答能力时的分辨率。尽管作为离散分类器在一组可选分辨率上进行训练,但在推理时CARES可以插值连续分辨率以实现细粒度控制。在涵盖文档和自然图像的五个跨模态基准以及各种目标VLMs中,CARES在减少高达80%的计算量的同时保持了任务性能。
Summary / 总结
CARES is a lightweight preprocessing module that predicts the minimal sufficient input resolution for vision-language models to achieve peak performance, reducing compute by up to 80% without compromising task performance across various benchmarks and VLMs.
CARES 是一种上下文感知的分辨率选择器,用于 VLMs,它预测图像-查询对的最小必要输入分辨率,通过一个紧凑的 VLM 提取特征并预测目标 VLM 的响应何时达到最佳水平。CARES 在各种基准测试中将计算量最多减少 80%,同时保持任务性能。
Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes
Authors: Zhiyuan Feng, Zhaolu Kang, Qijie Wang, Zhiying Du, Jiongrui Yan, Shubin Shi, Chengbo Yuan, Huizhi Liang, Yu Deng, Qixiu Li, Rushuai Yang, Arctanx An, Leqi Zheng, Weijie Wang, Shawn Chen, Sicheng Xu, Yaobo Liang, Jiaolong Yang, Baining Guo
First: 2025-10-22T09:20:09+00:00 · Latest: 2025-10-22T09:20:09+00:00
Comments: The project and benchmark are publicly available at https://github.com/microsoft/MV-RoboBench
Abstract
Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question. To bridge this gap, we introduce MV-RoboBench, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open-source and closed-source models, along with enhanced versions incorporating CoT-inspired techniques. The results show that state-of-the-art models remain far below human performance, underscoring the substantial challenges VLMs face in multi-view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task execution are positively correlated in multi-view robotic scenarios; and (ii) strong performance on existing general-purpose single-view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark. We release MV-RoboBench as an open resource to foster progress in spatially grounded VLMs and VLAs, providing not only data but also a standardized evaluation protocol for multi-view embodied reasoning.
中文标题/摘要
标题:跨越视角:视觉语言模型在机器人场景中空间推理能力基准测试
视觉语言模型(VLMs)是嵌入式人工智能的关键,使机器人能够在复杂环境中感知、推理和行动。它们也是最近视觉语言行动(VLA)模型的基础。然而,大多数对VLMs的评估集中在单视角设置上,而对其整合多视角信息的能力则鲜有探索。同时,多摄像头设置在机器人平台上越来越普遍,因为它们提供了互补的视角,以减轻遮挡和深度模糊。因此,VLMs能否有效利用此类多视角输入进行机器人推理仍是一个开放问题。为了弥合这一差距,我们引入了MV-RoboBench,这是一个专门设计来评估VLMs在机器人操作中多视角空间推理能力的基准测试。MV-RoboBench包含1700个手工策划的问答项,分为八个子任务,分为两大类:空间理解与机器人执行。我们评估了一组现有的多样化VLMs,包括开源和闭源模型,以及结合了CoT启发技术的增强版本。结果显示,最先进的模型仍远低于人类表现,突显了VLMs在多视角机器人感知方面面临的巨大挑战。此外,我们的分析揭示了两个关键发现:(i)在多视角机器人场景中,空间智能与机器人任务执行呈正相关;(ii)在现有通用单视角空间理解基准测试中表现出色并不一定能成功完成我们的基准测试中评估的机器人空间任务。我们公开发布MV-RoboBench作为资源,以促进空间化视觉语言模型和VLAs的发展,不仅提供数据,还提供多视角嵌入式推理的标准评估协议。
Summary / 总结
The research aims to evaluate the spatial reasoning capabilities of vision-language models (VLMs) in multi-view robotic scenes, addressing the underexplored aspect of integrating multi-view information. The study introduces MV-RoboBench, a benchmark consisting of 1,700 QA items across eight subtasks, evaluating various VLMs including enhanced versions with CoT-inspired techniques. The results indicate that state-of-the-art models perform significantly below human levels, highlighting the challenges in multi-view robotic perception. Additionally, the study finds that spatial intelligence and robotic task execution are positively correlated, and strong performance on single-view benchmarks does not reliably translate to success in robotic spatial tasks.
研究旨在评估视觉语言模型(VLMs)在多视角机器人场景中的空间推理能力,解决多视角信息整合不足的问题。研究引入了MV-RoboBench基准,包含1,700个问答项,覆盖八个子任务,评估了包括增强版的VLMs在内的多种模型。结果表明,最先进的模型在多视角机器人感知方面的表现远低于人类水平,突显了多视角机器人感知的挑战。此外,研究发现空间智能和机器人任务执行在多视角场景中呈正相关,且在单视角基准上的强大表现并不一定能成功完成我们的基准评估的机器人空间任务。
Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey
Authors: Weifan Guan, Qinghao Hu, Aosheng Li, Jian Cheng
First: 2025-10-20T02:59:45+00:00 · Latest: 2025-10-22T09:13:46+00:00
Abstract
Vision-Language-Action (VLA) models extend vision-language models to embodied control by mapping natural-language instructions and visual observations to robot actions. Despite their capabilities, VLA systems face significant challenges due to their massive computational and memory demands, which conflict with the constraints of edge platforms such as on-board mobile manipulators that require real-time performance. Addressing this tension has become a central focus of recent research. In light of the growing efforts toward more efficient and scalable VLA systems, this survey provides a systematic review of approaches for improving VLA efficiency, with an emphasis on reducing latency, memory footprint, and training and inference costs. We categorize existing solutions into four dimensions: model architecture, perception feature, action generation, and training/inference strategies, summarizing representative techniques within each category. Finally, we discuss future trends and open challenges, highlighting directions for advancing efficient embodied intelligence.
中文标题/摘要
标题:高效视觉-语言-行动模型在嵌入式操作中的应用:系统综述
视觉-语言-行动(VLA)模型将视觉-语言模型扩展到嵌入式控制,通过将自然语言指令和视觉观察映射到机器人行动。尽管它们具有这些能力,但VLA系统由于其巨大的计算和内存需求而面临重大挑战,这与边缘平台(如车载移动操作器)的实时性能要求相冲突。解决这一矛盾已成为最近研究的中心焦点。鉴于对更高效和可扩展的VLA系统的日益努力,本文综述了提高VLA效率的方法,重点在于减少延迟、内存占用和训练及推理成本。我们按照模型架构、感知特征、行动生成和训练/推理策略四个维度对现有解决方案进行了分类,总结了每个类别中的代表性技术。最后,我们讨论了未来趋势和开放挑战,指出了推进高效嵌入式智能的方向。
Training-Free Label Space Alignment for Universal Domain Adaptation
Authors: Dujin Lee, Sojung An, Jungmyung Wi, Kuniaki Saito, Donghyun Kim
First: 2025-09-22T07:46:10+00:00 · Latest: 2025-10-22T09:13:29+00:00
Comments: 22 pages, 12 figures
Abstract
Universal domain adaptation (UniDA) transfers knowledge from a labeled source domain to an unlabeled target domain, where label spaces may differ and the target domain may contain private classes. Previous UniDA methods primarily focused on visual space alignment but often struggled with visual ambiguities due to content differences, which limited their robustness and generalizability. To overcome this, we introduce a novel approach that leverages the strong \textit{zero-shot capabilities} of recent vision-language foundation models (VLMs) like CLIP, concentrating solely on label space alignment to enhance adaptation stability. CLIP can generate task-specific classifiers based only on label names. However, adapting CLIP to UniDA is challenging because the label space is not fully known in advance. In this study, we first utilize generative vision-language models to identify unknown categories in the target domain. Noise and semantic ambiguities in the discovered labels -- such as those similar to source labels (e.g., synonyms, hypernyms, hyponyms) -- complicate label alignment. To address this, we propose a training-free label-space alignment method for UniDA (\ours). Our method aligns label spaces instead of visual spaces by filtering and refining noisy labels between the domains. We then construct a \textit{universal classifier} that integrates both shared knowledge and target-private class information, thereby improving generalizability under domain shifts. The results reveal that the proposed method considerably outperforms existing UniDA techniques across key DomainBed benchmarks, delivering an average improvement of \textcolor{blue}{+7.9\%}in H-score and \textcolor{blue}{+6.1\%} in H$^3$-score. Furthermore, incorporating self-training further enhances performance and achieves an additional (\textcolor{blue}{+1.6\%}) increment in both H- and H$^3$-scores.
Summary / 总结
The paper addresses the challenge of universal domain adaptation (UniDA) by focusing on label space alignment rather than visual space alignment. It proposes a training-free method that leverages the zero-shot capabilities of vision-language models like CLIP to identify and align label spaces between source and target domains. The method filters and refines noisy labels to construct a universal classifier that integrates shared and private class information. Experimental results show that this approach significantly outperforms existing UniDA techniques, improving H-score and H$^3$-score by 7.9% and 6.1% respectively, and further enhancing performance with self-training.
论文针对通用领域适应(UniDA)中的挑战,专注于标签空间对齐而非视觉空间对齐。它提出了一种无需训练的方法,利用如CLIP这样的视觉语言模型的零样本能力来识别和对齐源和目标域之间的标签。该方法过滤和精炼噪声标签,构建一个综合共享和私有类信息的通用分类器。实验结果表明,这种方法显著优于现有UniDA技术,H-score提高了7.9%,H$^3$-score提高了6.1%,并在关键基准测试中进一步通过自训练提升了性能。
DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection
Authors: Chiara Cappellino, Gianluca Mancusi, Matteo Mosconi, Angelo Porrello, Simone Calderara, Rita Cucchiara
Venue: NeurIPS 2025
First: 2025-03-12T11:15:34+00:00 · Latest: 2025-10-22T09:08:21+00:00
Comments: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract
Open-Vocabulary object detectors can generalize to an unrestricted set of categories through simple textual prompting. However, adapting these models to rare classes or reinforcing their abilities on multiple specialized domains remains essential. While recent methods rely on monolithic adaptation strategies with a single set of weights, we embrace modular deep learning. We introduce DitHub, a framework designed to build and maintain a library of efficient adaptation modules. Inspired by Version Control Systems, DitHub manages expert modules as branches that can be fetched and merged as needed. This modular approach allows us to conduct an in-depth exploration of the compositional properties of adaptation modules, marking the first such study in Object Detection. Our method achieves state-of-the-art performance on the ODinW-13 benchmark and ODinW-O, a newly introduced benchmark designed to assess class reappearance. For more details, visit our project page: https://aimagelab.github.io/DitHub/
中文标题/摘要
标题:DitHub:增量开放词汇目标检测的模块化框架
开放词汇的目标检测器可以通过简单的文本提示泛化到不受限制的类别集合。然而,将这些模型适应稀有类别或增强其在多个专门领域的能力仍然是必不可少的。虽然最近的方法依赖于单一权重的庞大适应策略,但我们采用了模块化深度学习。我们引入了DitHub,一个旨在构建和维护高效适应模块库的框架。受到版本控制系统的影响,DitHub将专家模块视为分支,可以根据需要获取和合并。这种模块化方法使我们能够深入探索适应模块的组合特性,这是目标检测领域的首次此类研究。我们的方法在ODinW-13基准和新引入的ODinW-O基准上达到了最先进的性能,后者旨在评估类别的再现性。欲了解更多信息,请访问我们的项目页面:https://aimagelab.github.io/DitHub/
Summary / 总结
DitHub is a modular framework for incremental open-vocabulary object detection, which allows for efficient adaptation to new or rare classes through a library of expert modules managed like branches in a version control system. It achieves state-of-the-art performance on ODinW-13 and ODinW-O benchmarks, demonstrating its effectiveness in handling class reappearance and multiple specialized domains.
DitHub 是一个模块化框架,用于增量开放词汇目标检测,通过类似版本控制系统的方法,实现对稀有类别或专业领域的高效适应。它引入了一个适应模块库,可以根据需要拉取和合并这些模块,从而对这些模块的组合特性进行了详细研究。该方法在评估新类别泛化能力的 ODinW-13 和类再现的 ODinW-O 基准测试中达到了最先进的性能。
A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP
Authors: Ying Dai, Wei Yu Chen
First: 2025-10-22T07:54:18+00:00 · Latest: 2025-10-22T07:54:18+00:00
Abstract
This paper presents a novel training-free framework for open-vocabulary image segmentation and object recognition (OVSR), which leverages EfficientNetB0, a convolutional neural network, for unsupervised segmentation and CLIP, a vision-language model, for open-vocabulary object recognition. The proposed framework adopts a two stage pipeline: unsupervised image segmentation followed by segment-level recognition via vision-language alignment. In the first stage, pixel-wise features extracted from EfficientNetB0 are decomposed using singular value decomposition to obtain latent representations, which are then clustered using hierarchical clustering to segment semantically meaningful regions. The number of clusters is adaptively determined by the distribution of singular values. In the second stage, the segmented regions are localized and encoded into image embeddings using the Vision Transformer backbone of CLIP. Text embeddings are precomputed using CLIP's text encoder from category-specific prompts, including a generic something else prompt to support open set recognition. The image and text embeddings are concatenated and projected into a shared latent feature space via SVD to enhance cross-modal alignment. Recognition is performed by computing the softmax over the similarities between the projected image and text embeddings. The proposed method is evaluated on standard benchmarks, including COCO, ADE20K, and PASCAL VOC, achieving state-of-the-art performance in terms of Hungarian mIoU, precision, recall, and F1-score. These results demonstrate the effectiveness, flexibility, and generalizability of the proposed framework.
中文标题/摘要
标题:一种基于EfficientNet和CLIP的无训练框架,用于开放词汇图像分割和识别
本文提出了一种新颖的无训练框架,用于开放词汇图像分割和对象识别(OVSR),该框架利用EfficientNetB0卷积神经网络进行无监督分割,并利用CLIP视觉语言模型进行开放词汇对象识别。所提出的框架采用两阶段管道:无监督图像分割,随后是通过视觉语言对齐的分割级别识别。在第一阶段,从EfficientNetB0提取的像素级特征通过奇异值分解进行分解,以获得潜在表示,然后使用层次聚类进行聚类以分割语义有意义的区域。聚类的数量通过奇异值的分布自适应确定。在第二阶段,分割的区域被定位并使用CLIP的视觉变换器主干编码为图像嵌入。文本嵌入使用CLIP的文本编码器从类别特定的提示中预计算,包括一个通用的其他提示以支持开放集识别。图像和文本嵌入通过奇异值分解连接并投影到共享的潜在特征空间以增强跨模态对齐。识别通过计算投影图像和文本嵌入之间的相似性的softmax来进行。所提出的方法在标准基准上进行了评估,包括COCO、ADE20K和PASCAL VOC,从匈牙利mIoU、精确度、召回率和F1分数方面达到了最先进的性能。这些结果表明所提出框架的有效性、灵活性和泛化能力。
Summary / 总结
The paper introduces a training-free framework for open-vocabulary image segmentation and recognition (OVSR) that uses EfficientNetB0 for unsupervised segmentation and CLIP for open-vocabulary object recognition. The method consists of two stages: first, unsupervised image segmentation using EfficientNetB0 and singular value decomposition, followed by segment-level recognition via vision-language alignment with CLIP. The framework achieves state-of-the-art performance on COCO, ADE20K, and PASCAL VOC benchmarks, as measured by Hungarian mIoU, precision, recall, and F1-score, highlighting its effectiveness and generalizability.
本文提出了一种无需训练的框架,用于开放词汇量的图像分割和识别(OVSR),该框架结合了EfficientNetB0进行无监督分割和CLIP进行开放词汇量对象识别。该框架分为两个阶段:无监督图像分割和通过视觉-语言对齐进行的分割级别识别。第一阶段中,从EfficientNetB0提取的像素级特征被分解和聚类,以分割出语义上有意义的区域。第二阶段中,CLIP的Vision Transformer将分割出的区域编码成图像嵌入,然后使用SVD与预计算的文本嵌入进行对齐。该方法在COCO、ADE20K和PASCAL VOC基准测试上取得了最先进的性能,表明其有效性和普适性。
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
Authors: Byung-Kwan Lee, Ryo Hachiuma, Yu-Chiang Frank Wang, Yong Man Ro, Yueh-Hua Wu
Venue: CVPR 2025
First: 2024-12-02T18:58:25+00:00 · Latest: 2025-10-22T07:28:21+00:00
Comments: CVPR 2025, Project page: https://byungkwanlee.github.io/VLsI-page/
Abstract
The recent surge in high-quality visual instruction tuning samples from closed-source vision-language models (VLMs) such as GPT-4V has accelerated the release of open-source VLMs across various model sizes. However, scaling VLMs to improve performance using larger models brings significant computational challenges, especially for deployment on resource-constrained devices like mobile platforms and robots. To address this, we propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes, which prioritizes efficiency without compromising accuracy. VLsI leverages a unique, layer-wise distillation process, introducing intermediate "verbalizers" that map features from each layer to natural language space, allowing smaller VLMs to flexibly align with the reasoning processes of larger VLMs. This approach mitigates the training instability often encountered in output imitation and goes beyond typical final-layer tuning by aligning the small VLMs' layer-wise progression with that of the large ones. We validate VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V without the need for model scaling, merging, or architectural changes.
中文标题/摘要
标题:VLsI:从大到小的视觉语言模型的口头化层到交互
最近,来自闭源视觉语言模型(VLMs)如GPT-4V的高质量视觉指令调优样本激增,加速了各种模型规模的开源VLMs的发布。然而,使用更大规模的模型来扩展VLMs以提高性能带来了重大的计算挑战,特别是在移动平台和机器人等资源受限设备上的部署。为了解决这一问题,我们提出了VLsI:口头化层到交互,这是一种新的2B和7B模型规模的VLM家族,它优先考虑效率而不牺牲准确性。VLsI利用了一种独特的、逐层的蒸馏过程,引入了中间的“口头化器”,将每层的特征映射到自然语言空间,使较小的VLMs能够灵活地与较大VLMs的推理过程对齐。这种方法缓解了输出模仿中常见的训练不稳定性,并超越了典型的最终层调优,通过使小VLMs的逐层进展与大VLMs的对齐来实现。我们通过十个具有挑战性的视觉语言基准验证了VLsI,实现了显著的性能提升(2B模型为11.0%,7B模型为17.4%),而无需进行模型扩展、合并或架构更改。
Summary / 总结
VLsI is a new family of vision-language models in 2B and 7B model sizes that prioritize efficiency while maintaining accuracy. It uses a layer-wise distillation process with intermediate 'verbalizers' to map features from each layer to natural language space, enabling smaller models to align with larger ones. Experiments on ten benchmarks show significant performance gains of 11.0% for 2B and 17.4% for 7B models over GPT-4V without requiring model scaling or architectural changes.
VLsI 是一种新的 2B 和 7B 模型大小的视觉语言模型系列,注重效率同时保持准确性。它使用逐层蒸馏过程和中间的‘语言化器’将每一层的特征映射到自然语言空间,使较小的模型能够与较大的模型对齐。VLsI 在十个具有挑战性的视觉语言基准测试中实现了显著的性能提升(2B 模型提升 11.0%,7B 模型提升 17.4%),无需进行模型缩放、合并或架构更改。
Unified Reinforcement and Imitation Learning for Vision-Language Models
Authors: Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu
Venue: NeurIPS 2025
First: 2025-10-22T07:12:14+00:00 · Latest: 2025-10-22T07:12:14+00:00
Comments: NeurIPS 2025, Project page: https://byungkwanlee.github.io/RIL-page
Abstract
Vision-Language Models (VLMs) have achieved remarkable progress, yet their large scale often renders them impractical for resource-constrained environments. This paper introduces Unified Reinforcement and Imitation Learning (RIL), a novel and efficient training algorithm designed to create powerful, lightweight VLMs. RIL distinctively combines the strengths of reinforcement learning with adversarial imitation learning. This enables smaller student VLMs not only to mimic the sophisticated text generation of large teacher models but also to systematically improve their generative capabilities through reinforcement signals. Key to our imitation framework is an LLM-based discriminator that adeptly distinguishes between student and teacher outputs, complemented by guidance from multiple large teacher VLMs to ensure diverse learning. This unified learning strategy, leveraging both reinforcement and imitation, empowers student models to achieve significant performance gains, making them competitive with leading closed-source VLMs. Extensive experiments on diverse vision-language benchmarks demonstrate that RIL significantly narrows the performance gap with state-of-the-art open- and closed-source VLMs and, in several instances, surpasses them.
中文标题/摘要
标题:统一强化学习和模仿学习的视觉语言模型
视觉语言模型(VLMs)已经取得了显著的进步,但其大规模往往使其在资源受限的环境中不切实际。本文介绍了统一强化学习和模仿学习(RIL),这是一种新颖且高效的训练算法,旨在创建强大的轻量级VLMs。RIL独特地结合了强化学习和对抗性模仿学习的优势。这使得较小的学生VLM不仅能模仿大型教师模型的复杂文本生成,还能通过强化信号系统地提高其生成能力。我们模仿框架的关键在于基于LLM的判别器,它能够熟练地区分学生和教师的输出,并由多个大型教师VLM的指导确保多样化的学习。这种结合强化学习和模仿学习的统一学习策略,使学生模型能够实现显著的性能提升,使其与领先的闭源VLMs竞争。在多种视觉语言基准上的广泛实验表明,RIL显著缩小了与最先进的开源和闭源VLMs的性能差距,并在某些情况下超越了它们。
Summary / 总结
This paper introduces Unified Reinforcement and Imitation Learning (RIL), a novel training algorithm for Vision-Language Models (VLMs) that combines reinforcement learning with adversarial imitation learning to create lightweight yet powerful VLMs. The key method involves using a large language model-based discriminator to distinguish between student and teacher outputs, with guidance from multiple large teacher VLMs. Experiments show that RIL significantly improves the performance of student models, making them competitive with leading VLMs.
本文提出了一种名为统一强化和模仿学习(RIL)的新颖训练算法,该算法结合了强化学习和对抗性模仿学习,旨在创建既轻量级又强大的视觉-语言模型(VLMs)。关键组件是一个基于LLM的判别器,它帮助学生模型从模仿和强化信号中学习,从而实现显著的性能提升。实验表明,RIL在各种视觉-语言基准测试中缩小了与最先进的VLMs之间的性能差距,并在某些情况下超越了它们。
Hierarchical DLO Routing with Reinforcement Learning and In-Context Vision-language Models
Authors: Mingen Li, Houjian Yu, Yixuan Huang, Youngjin Hong, Changhyun Choi
First: 2025-10-22T05:57:23+00:00 · Latest: 2025-10-22T05:57:23+00:00
Comments: 8 pages, 6 figures, 3 tables
Abstract
Long-horizon routing tasks of deformable linear objects (DLOs), such as cables and ropes, are common in industrial assembly lines and everyday life. These tasks are particularly challenging because they require robots to manipulate DLO with long-horizon planning and reliable skill execution. Successfully completing such tasks demands adapting to their nonlinear dynamics, decomposing abstract routing goals, and generating multi-step plans composed of multiple skills, all of which require accurate high-level reasoning during execution. In this paper, we propose a fully autonomous hierarchical framework for solving challenging DLO routing tasks. Given an implicit or explicit routing goal expressed in language, our framework leverages vision-language models~(VLMs) for in-context high-level reasoning to synthesize feasible plans, which are then executed by low-level skills trained via reinforcement learning. To improve robustness in long horizons, we further introduce a failure recovery mechanism that reorients the DLO into insertion-feasible states. Our approach generalizes to diverse scenes involving object attributes, spatial descriptions, as well as implicit language commands. It outperforms the next best baseline method by nearly 50% and achieves an overall success rate of 92.5% across long-horizon routing scenarios.
中文标题/摘要
标题:基于强化学习和上下文视觉语言模型的分层DLO路由
工业装配线和日常生活中常见的可变形线性对象(DLO)的长时距路由任务,如电缆和绳索,具有挑战性,因为它们要求机器人进行长时距规划和可靠的技能执行。成功完成这些任务需要适应非线性动力学、分解抽象的路由目标并生成由多个技能组成的多步计划,所有这些都需要在执行过程中进行准确的高层次推理。本文提出了一种完全自主的分层框架,用于解决具有挑战性的DLO路由任务。给定语言表达的隐式或显式路由目标,我们的框架利用上下文视觉语言模型(VLMs)进行高层次推理以合成可行的计划,然后由通过强化学习训练的低级技能执行。为了提高长时距中的鲁棒性,我们进一步引入了一种故障恢复机制,重新定向DLO进入插入可行状态。我们的方法适用于涉及对象属性、空间描述以及隐式语言命令的多种场景。它在长时距路由场景中的性能优于下一个最佳基线方法近50%,总体成功率为92.5%。
Summary / 总结
This paper addresses the challenge of long-horizon routing tasks for deformable linear objects (DLOs) using a hierarchical framework that combines reinforcement learning and in-context vision-language models. The framework synthesizes feasible plans from high-level reasoning provided by VLMs, which are then executed by low-level skills. A failure recovery mechanism is introduced to handle long-horizon challenges. The approach demonstrates strong generalization and outperforms existing methods by nearly 50%, achieving a 92.5% success rate in long-horizon routing scenarios.
研究提出了一种结合强化学习和上下文视觉语言模型的层次框架,以解决变形线性物体(DLO)的长期路径规划任务。方法利用视觉语言模型进行高层次推理,合成可行的计划,然后由低级技能执行。该方法包含一个故障恢复机制,以提高长期过程中的鲁棒性。该框架在各种场景中表现出强大的泛化能力,并在长期路径规划场景中比现有方法高出近50%,成功率达到92.5%。
Semi-off-Policy Reinforcement Learning for Vision-Language Slow-Thinking Reasoning
Authors: Junhao Shen, Haiteng Zhao, Yuzhe Gu, Songyang Gao, Kuikun Liu, Haian Huang, Jianfei Gao, Dahua Lin, Wenwei Zhang, Kai Chen
First: 2025-07-22T17:59:34+00:00 · Latest: 2025-10-22T05:34:50+00:00
Abstract
Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy reinforcement learning (RL) to develop the slow thinking ability because the rollout space is restricted by its initial abilities. Off-policy RL offers a way to go beyond the current policy, but directly distilling trajectories from external models may cause visual hallucinations due to mismatched visual perception abilities across models. To address these issues, this paper proposes SOPHIA, a simple and scalable Semi-Off-Policy RL for vision-language slow-tHInking reAsoning. SOPHIA builds a semi-off-policy behavior model by combining on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model, assigns outcome-based rewards to reasoning, and propagates visual rewards backward. Then LVLM learns slow-thinking reasoning ability from the obtained reasoning trajectories using propagated rewards via off-policy RL algorithms. Extensive experiments with InternVL2.5 and InternVL3.0 with 8B and 38B sizes show the effectiveness of SOPHIA. Notably, SOPHIA improves InternVL3.0-38B by 8.50% in average, reaching state-of-the-art performance among open-source LVLMs on multiple multimodal reasoning benchmarks, and even outperforms some closed-source models (e.g., GPT-4.1) on the challenging MathVision and OlympiadBench, achieving 49.08% and 49.95% pass@1 accuracy, respectively. Analysis shows SOPHIA outperforms supervised fine-tuning and direct on-policy RL methods, offering a better policy initialization for further on-policy training.
中文标题/摘要
标题:半离策强化学习在视觉语言慢思考推理中的应用
增强大型视觉语言模型(LVLM)的视觉慢思考推理能力对于解决复杂的多模态任务至关重要。然而,由于LVLM主要通过视觉语言对齐进行训练,采用离策强化学习(RL)来发展慢思考能力是困难的,因为其展开空间受限于其初始能力。离策RL提供了一种超越当前策略的方法,但直接从外部模型提取轨迹可能会由于模型间视觉感知能力的不匹配而导致视觉幻觉。为了解决这些问题,本文提出SOPHIA,一种简单且可扩展的半离策RL方法,用于视觉语言慢思考推理。SOPHIA通过结合可训练LVLM的在策视觉理解与语言模型的离策慢思考推理构建半离策行为模型,基于推理结果分配奖励,并反向传播视觉奖励。然后,LVLM使用离策RL算法中的传播奖励学习慢思考推理能力。使用InternVL2.5和InternVL3.0(8B和38B大小)进行的大量实验表明SOPHIA的有效性。值得注意的是,SOPHIA将InternVL3.0-38B的性能提高了8.50%,在多个多模态推理基准测试中达到开源LVLM的最先进性能,并且在MathVision和OlympiadBench等具有挑战性的任务中甚至超过了某些闭源模型(如GPT-4.1),分别达到49.08%和49.95%的pass@1准确率。分析表明,SOPHIA优于监督微调和直接离策RL方法,为后续的离策训练提供了更好的策略初始化。
Summary / 总结
This paper addresses the challenge of enhancing vision-language models (LVLMs) with visual slow-thinking reasoning by proposing SOPHIA, a semi-off-policy reinforcement learning method. SOPHIA combines on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model, and uses outcome-based rewards to train the LVLM via off-policy RL algorithms. Experiments with InternVL2.5 and InternVL3.0 show that SOPHIA significantly improves the slow-thinking reasoning ability of LVLMs, reaching state-of-the-art performance on multiple multimodal reasoning benchmarks and outperforming some closed-source models on challenging tasks.
SOPHIA 是一种半离策强化学习方法,旨在增强大型视觉语言模型(LVLM)的慢思考推理能力。它结合了可训练的 LVLM 的在线策略视觉理解以及语言模型的离策慢思考推理,并使用基于结果的奖励通过离策 RL 算法训练 LVLM。实验表明,SOPHIA 使 InternVL3.0-38B 的慢思考推理能力提高了 8.50%,在多个跨模态推理基准上达到了最先进的性能,并且在 MathVision 和 OlympiadBench 上甚至超过了某些闭源模型。
Probing Perceptual Constancy in Large Vision-Language Models
Authors: Haoran Sun, Bingyang Wang, Suyang Yu, Yijiang Li, Qingying Gao, Haiyun Lyu, Hokin Deng, Dezhi Luo
Venue: ICML 2025
First: 2025-02-14T16:31:43+00:00 · Latest: 2025-10-22T03:57:59+00:00
Comments: Accepted by ICML 2025 Workshop Building Physically Plausible World Models
Abstract
Perceptual constancy is the ability to maintain stable perceptions of objects despite changes in sensory input, such as variations in distance, angle, or lighting. This ability is crucial for visual understanding in a dynamic world. Here, we explored such ability in current Vision Language Models (VLMs). In this study, we evaluated 155 VLMs using 236 experiments across three domains: color, size, and shape constancy. The experiments included single-image and video adaptations of classic cognitive tasks, along with novel tasks in in-the-wild conditions. We found significant variability in VLM performance across these domains, with model performance in shape constancy clearly dissociated from that of color and size constancy.
中文标题/摘要
标题:探究大型视觉语言模型的知觉恒常性
知觉恒常性是指在感官输入发生变化(如距离、角度或光照变化)时,维持对物体稳定感知的能力。这种能力对于动态世界中的视觉理解至关重要。在这里,我们探讨了当前视觉语言模型(VLMs)的这种能力。在本研究中,我们使用236项实验评估了155个VLMs,涵盖三个领域:颜色、大小和形状恒常性。实验包括单张图像和视频改编的经典认知任务,以及野外条件下的新型任务。我们发现这些领域中VLM的表现存在显著差异,形状恒常性的模型表现与颜色和大小恒常性明显不同。
Summary / 总结
This study investigates the perceptual constancy abilities of Vision Language Models (VLMs) across color, size, and shape constancy through 236 experiments involving 155 models. The research reveals that VLMs exhibit varying performance across different constancy domains, with particularly distinct results in shape constancy compared to color and size constancy.
研究通过评估155个模型在236项实验中的表现,探讨了大型视觉语言模型(VLM)在颜色、大小和形状恒常性方面的能力。研究结果表明,VLMs在这三个领域中的表现存在差异,特别是在形状恒常性方面的表现与其他两个领域有明显区别。
PruneHal: Reducing Hallucinations in Multi-modal Large Language Models through Adaptive KV Cache Pruning
Authors: Fengyuan Sun, Hui Chen, Xinhao Xu, Dandan Zheng, Jingdong Chen, Jun Zhou, Jungong Han, Guiguang Ding
First: 2025-10-22T02:41:07+00:00 · Latest: 2025-10-22T02:41:07+00:00
Abstract
While multi-modal large language models (MLLMs) have made significant progress in recent years, the issue of hallucinations remains a major challenge. To mitigate this phenomenon, existing solutions either introduce additional data for further training or incorporate external or internal information during inference. However, these approaches inevitably introduce extra computational costs. In this paper, we observe that hallucinations in MLLMs are strongly associated with insufficient attention allocated to visual tokens. In particular, the presence of redundant visual tokens disperses the model's attention, preventing it from focusing on the most informative ones. As a result, critical visual cues are often under-attended, which in turn exacerbates the occurrence of hallucinations. Building on this observation, we propose \textbf{PruneHal}, a training-free, simple yet effective method that leverages adaptive KV cache pruning to enhance the model's focus on critical visual information, thereby mitigating hallucinations. To the best of our knowledge, we are the first to apply token pruning for hallucination mitigation in MLLMs. Notably, our method don't require additional training and incurs nearly no extra inference cost. Moreover, PruneHal is model-agnostic and can be seamlessly integrated with different decoding strategies, including those specifically designed for hallucination mitigation. We evaluate PruneHal on several widely used hallucination evaluation benchmarks using four mainstream MLLMs, achieving robust and outstanding results that highlight the effectiveness and superiority of our method. Our code will be publicly available.
中文标题/摘要
标题:PruneHal:通过自适应KV缓存剪枝减少多模态大型语言模型中的幻觉
尽管多模态大型语言模型(MLLMs)在近年来取得了显著进展,幻觉问题仍然是一个重大挑战。为减轻这一现象,现有解决方案要么引入额外数据进行进一步训练,要么在推理过程中引入外部或内部信息。然而,这些方法不可避免地引入了额外的计算成本。在本文中,我们观察到MLLMs中的幻觉与视觉标记分配的关注不足密切相关。特别是,冗余视觉标记的出现分散了模型的注意力,使其无法专注于最相关信息。结果,关键视觉线索往往被忽视,从而加剧了幻觉的发生。基于这一观察,我们提出了一种名为PruneHal的无训练、简单而有效的方法,利用自适应KV缓存剪枝来增强模型对关键视觉信息的关注,从而减轻幻觉。据我们所知,我们是首次将标记剪枝应用于MLLMs中的幻觉缓解。值得注意的是,我们的方法不需要额外的训练,几乎不增加推理成本。此外,PruneHal具有模型无关性,可以无缝集成到不同的解码策略中,包括那些专门设计用于幻觉缓解的策略。我们在四个主流MLLMs上使用几个广泛使用的幻觉评估基准对PruneHal进行了评估,取得了稳健且出色的结果,突显了我们方法的有效性和优越性。我们的代码将公开发布。
Summary / 总结
PruneHal is a training-free method that uses adaptive KV cache pruning to reduce hallucinations in multi-modal large language models (MLLMs) by enhancing the model's focus on critical visual information. It addresses the issue of insufficient attention to visual tokens, which can lead to hallucinations. Experiments on various benchmarks with four MLLMs show that PruneHal effectively mitigates hallucinations with minimal computational overhead.
PruneHal 是一种无需训练的方法,通过自适应 KV 缓存剪枝来减少多模态大型语言模型(MLLMs)中的幻觉现象。它通过增强模型对关键视觉信息的关注来减轻幻觉,无需额外训练且几乎不增加推理成本。在四个主流 MLLMs 上对多种基准测试的评估显示,PruneHal 达到了稳健且优越的结果,使其成为一种适用于幻觉缓解的模型无感知解决方案。
Preliminary Use of Vision Language Model Driven Extraction of Mouse Behavior Towards Understanding Fear Expression
Authors: Paimon Goulart, Jordan Steinhauser, Kylene Shuler, Edward Korzus, Jia Chen, Evangelos E. Papalexakis
First: 2025-10-22T01:33:39+00:00 · Latest: 2025-10-22T01:33:39+00:00
Abstract
Integration of diverse data will be a pivotal step towards improving scientific explorations in many disciplines. This work establishes a vision-language model (VLM) that encodes videos with text input in order to classify various behaviors of a mouse existing in and engaging with their environment. Importantly, this model produces a behavioral vector over time for each subject and for each session the subject undergoes. The output is a valuable dataset that few programs are able to produce with as high accuracy and with minimal user input. Specifically, we use the open-source Qwen2.5-VL model and enhance its performance through prompts, in-context learning (ICL) with labeled examples, and frame-level preprocessing. We found that each of these methods contributes to improved classification, and that combining them results in strong F1 scores across all behaviors, including rare classes like freezing and fleeing, without any model fine-tuning. Overall, this model will support interdisciplinary researchers studying mouse behavior by enabling them to integrate diverse behavioral features, measured across multiple time points and environments, into a comprehensive dataset that can address complex research questions.
中文标题/摘要
标题:视觉语言模型驱动的小鼠行为提取初步应用以理解恐惧表达
整合多种数据将是许多学科中提高科学研究探索的关键步骤。本研究建立了一个视觉语言模型(VLM),该模型通过文本输入编码视频,以分类小鼠在其环境中的各种行为。重要的是,该模型为每个受试者和每个受试者经历的每个会话生成了随时间变化的行为向量。输出是一个有价值的数据库集,很少有程序能够以如此高的准确度生成,并且需要最少的用户输入。具体而言,我们使用开源的Qwen2.5-VL模型,并通过提示、带标签示例的上下文学习(ICL)和帧级预处理来增强其性能。我们发现这些方法中的每一种都提高了分类效果,而将它们结合起来则在所有行为上(包括冻结和逃跑等稀有类别)都产生了强大的F1分数,无需任何模型微调。总体而言,该模型将支持研究小鼠行为的跨学科研究人员,使他们能够整合多个时间点和环境中的多种行为特征,形成一个可以解决复杂研究问题的综合数据库集。
Summary / 总结
This study aims to enhance the understanding of mouse fear expression by integrating video and text data using a vision-language model. The model, enhanced with prompts, in-context learning, and frame-level preprocessing, successfully classifies various mouse behaviors with high accuracy. The method produces detailed behavioral vectors over time, which is a valuable dataset for interdisciplinary research, especially for rare behaviors like freezing and fleeing, without requiring model fine-tuning.
这项工作引入了一个视觉-语言模型(VLM)来分类小鼠在环境中的行为,为每个受试者生成一个随时间变化的行为向量。该模型使用了Qwen2.5-VL,并通过提示、上下文学习和帧级预处理进行增强,实现了对各种行为(包括罕见行为如静止和逃跑)的强大F1分数,无需微调。这支持跨学科研究,通过整合多个时间点和环境下的多种行为特征,形成一个全面的数据集。
History
20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553