arXiv 论文速递

2025-11-17 03:26
Snapshot: 20251117_0326
Querying Labeled Time Series Data with Scenario Programs
Authors: Edward Kim, Devan Shanker, Varun Bharadwaj, Hongbeen Park, Jinkyu Kim, Hazem Torfah, Daniel J Fremont, Sanjit A Seshia
Venue: NASA Formal Methods Conference 2025
First: 2025-11-13T18:52:27+00:00 · Latest: 2025-11-13T18:52:27+00:00
Abstract
Simulation-based testing has become a crucial complement to road testing for ensuring the safety of cyber physical systems (CPS). As a result, significant research efforts have been directed toward identifying failure scenarios within simulation environments. However, a critical question remains. Are the AV failure scenarios discovered in simulation reproducible on actual systems in the real world? The sim-to-real gap caused by differences between simulated and real sensor data means that failure scenarios identified in simulation might either be artifacts of synthetic sensor data or actual issues that also occur with real sensor data. To address this, an effective approach to validating simulated failure scenarios is to locate occurrences of these scenarios within real-world datasets and verify whether the failure persists on the datasets. To this end, we introduce a formal definition of how labeled time series sensor data can match an abstract scenario, represented as a scenario program using the Scenic probabilistic programming language. We present a querying algorithm that, given a scenario program and a labeled dataset, identifies the subset of data that matches the specified scenario. Our experiment shows that our algorithm is more accurate and orders of magnitude faster in querying scenarios than the state-of-the-art commercial vision large language models, and can scale with the duration of queried time series data.
中文标题/摘要
标题:使用情景程序查询标记的时间序列数据
基于仿真的测试已成为确保计算物理系统(CPS)安全的重要补充。因此,大量研究工作集中在识别仿真环境中的故障情景。然而,一个关键问题仍然存在:在仿真中发现的自动驾驶汽车(AV)故障情景在实际系统中是否可重现?由于仿真和实际传感器数据之间的差异导致的仿真实际差距意味着,仿真中发现的故障情景可能是合成传感器数据的产物,也可能是实际传感器数据中也存在的问题。为了解决这一问题,验证仿真中发现的故障情景的有效方法是定位实际数据集中与指定情景匹配的数据子集,并验证故障是否在这些数据中持续存在。为此,我们引入了如何使用Scenic概率编程语言表示的抽象情景来匹配标记的时间序列传感器数据的正式定义。我们提出了一种查询算法,给定一个情景程序和一个标记的数据集,该算法可以识别出与指定情景匹配的数据子集。我们的实验表明,与最先进的商业视觉大型语言模型相比,我们的算法在查询情景方面更准确且快了几个数量级,并且可以随着查询时间序列数据的持续时间扩展。
Summary / 总结
The research aims to validate simulated failure scenarios in autonomous vehicles by querying real-world sensor data. The method involves defining scenarios as scenario programs using Scenic and developing a querying algorithm to identify matching data subsets. Key findings show the algorithm is more accurate and significantly faster than state-of-the-art commercial vision models, and can handle long time series data effectively.
研究旨在通过查询实际数据集来验证自动驾驶汽车(AV)的模拟故障场景。方法是使用Scenic定义场景程序,并开发查询算法来识别匹配的数据。关键发现表明,该算法比最先进的商业视觉模型更准确且速度快得多,并且可以处理长时间序列数据。
Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals
Authors: Shruti Singh Baghel, Yash Pratap Singh Rathore, Sushovan Jena, Anurag Pradhan, Amit Shukla, Arnav Bhavsar, Pawan Goyal
First: 2025-11-13T18:45:39+00:00 · Latest: 2025-11-13T18:45:39+00:00
Comments: 8 pages
Abstract
Large Vision-Language Models (VLMs) excel at understanding and generating video descriptions but their high memory, computation, and deployment demands hinder practical use particularly for blind and low-vision (BLV) users who depend on detailed, context-aware descriptions. To study the effect of model size on accessibility-focused description quality, we evaluate SmolVLM2 variants with 500M and 2.2B parameters across two diverse datasets: AVCaps (outdoor), and Charades (indoor). In this work, we introduce two novel evaluation frameworks specifically designed for BLV accessibility assessment: the Multi-Context BLV Framework evaluating spatial orientation, social interaction, action events, and ambience contexts; and the Navigational Assistance Framework focusing on mobility-critical information. Additionally, we conduct a systematic evaluation of four different prompt design strategies and deploy both models on a smartphone, evaluating FP32 and INT8 precision variants to assess real-world performance constraints on resource-limited mobile devices.
中文标题/摘要
标题:轻量级VLM和自定义LLM评估向盲人和低视力用户无障碍的迈进
大型视觉-语言模型(VLMs)在理解和生成视频描述方面表现出色,但其高内存、计算和部署需求阻碍了其在盲人和低视力(BLV)用户中的实际应用,这些用户依赖于详细且上下文相关的描述。为了研究模型规模对无障碍描述质量的影响,我们评估了具有500M和2.2B参数的SmolVLM2变体在两个不同数据集上的表现:AVCaps(户外)和Charades(室内)。在本文中,我们引入了两个专门为BLV无障碍评估设计的新颖评估框架:多上下文BLV框架,评估空间定向、社会互动、动作事件和氛围上下文;以及导航辅助框架,专注于移动性关键信息。此外,我们系统地评估了四种不同的提示设计策略,并在智能手机上部署了这两种模型,评估了FP32和INT8精度变体,以评估资源受限的移动设备上的实际性能限制。
Summary / 总结
This study evaluates the accessibility of lightweight VLMs for blind and low-vision users by comparing SmolVLM2 variants with 500M and 2.2B parameters on AVCaps and Charades datasets. Two novel evaluation frameworks, Multi-Context BLV and Navigational Assistance, are introduced to assess spatial orientation, social interaction, action events, ambience contexts, and mobility-critical information. The research finds that smaller models perform better in terms of accessibility and real-world deployment on mobile devices, with FP32 and INT8 precision variants showing different performance trade-offs.
本研究旨在通过评估SmolVLM2的500M和2.2B参数变体在AVCaps和Charades数据集上的表现,提高大型视觉语言模型(VLMs)对盲和低视力用户的可访问性。引入了多上下文BLV和导航辅助两个新的评估框架,分别评估空间定向、社交互动、动作事件、环境氛围和移动关键信息。研究还评估了四种不同的提示设计策略,并在智能手机上测试了FP32和INT8精度变体,以了解资源受限设备上的实际性能限制。
From 2D to 3D Without Extra Baggage: Data-Efficient Cancer Detection in Digital Breast Tomosynthesis
Authors: Yen Nhi Truong Vu, Dan Guo, Sripad Joshi, Harshit Kumar, Jason Su, Thomas Paul Matthews
Venue: In Machine Learning for Health (ML4H). PMLR 297, 2025
First: 2025-11-13T18:35:45+00:00 · Latest: 2025-11-13T18:35:45+00:00
Abstract
Digital Breast Tomosynthesis (DBT) enhances finding visibility for breast cancer detection by providing volumetric information that reduces the impact of overlapping tissues; however, limited annotated data has constrained the development of deep learning models for DBT. To address data scarcity, existing methods attempt to reuse 2D full-field digital mammography (FFDM) models by either flattening DBT volumes or processing slices individually, thus discarding volumetric information. Alternatively, 3D reasoning approaches introduce complex architectures that require more DBT training data. Tackling these drawbacks, we propose M&M-3D, an architecture that enables learnable 3D reasoning while remaining parameter-free relative to its FFDM counterpart, M&M. M&M-3D constructs malignancy-guided 3D features, and 3D reasoning is learned through repeatedly mixing these 3D features with slice-level information. This is achieved by modifying operations in M&M without adding parameters, thus enabling direct weight transfer from FFDM. Extensive experiments show that M&M-3D surpasses 2D projection and 3D slice-based methods by 11-54% for localization and 3-10% for classification. Additionally, M&M-3D outperforms complex 3D reasoning variants by 20-47% for localization and 2-10% for classification in the low-data regime, while matching their performance in high-data regime. On the popular BCS-DBT benchmark, M&M-3D outperforms previous top baseline by 4% for classification and 10% for localization.
中文标题/摘要
标题:从2D到3D无需额外负担:数字乳腺断层成像中的数据高效癌症检测
数字乳腺断层成像(DBT)通过提供体积信息来增强乳腺癌检测的可见性,从而减少重叠组织的影响;然而,有限的标注数据限制了DBT深度学习模型的发展。为解决数据稀缺问题,现有方法尝试通过将DBT体积展平或逐片处理来重用全视野数字乳腺摄影(FFDM)模型,从而丢弃体积信息。或者,3D推理方法引入了复杂架构,需要更多DBT训练数据。为克服这些缺点,我们提出了一种M&M-3D架构,该架构能够在保持参数数量与FFDM模型M&M相当的情况下实现可学习的3D推理。M&M-3D构建了恶性肿瘤导向的3D特征,并通过反复将这些3D特征与切片级信息混合来学习3D推理。这通过修改M&M的操作实现,而不增加参数,从而允许直接从FFDM转移权重。大量实验表明,M&M-3D在定位和分类上的表现分别优于2D投影和基于切片的3D方法11-54%和3-10%。此外,在数据稀缺的情况下,M&M-3D在定位和分类上的表现分别优于复杂3D推理变体20-47%和2-10%,而在数据丰富的情况下,其性能与这些方法相当。在流行的BCS-DBT基准测试中,M&M-3D在分类上的表现优于之前的最佳基线4%,在定位上的表现优于10%。
Summary / 总结
The paper addresses the challenge of limited annotated data in Digital Breast Tomosynthesis (DBT) for breast cancer detection. It proposes M&M-3D, an architecture that enables learnable 3D reasoning without increasing parameters compared to its 2D counterpart, M&M. M&M-3D constructs malignancy-guided 3D features and learns 3D reasoning through mixing these features with slice-level information. Experiments show that M&M-3D outperforms 2D projection and 3D slice-based methods by 11-54% for localization and 3-10% for classification, and surpasses complex 3D reasoning variants by 20-47% for localization and 2-10% for classification in the low-data regime, while matching their performance in high-data regime. On the BCS-DBT benchmark, M&M-3D outperforms previous top baselines by 4% for classification and 10% for localization.
论文针对数字乳腺断层成像(DBT)中由于标注数据有限导致的乳腺癌检测挑战,提出了一种名为M&M-3D的架构,该架构在参数上与2D的M&M模型相当,但能够进行可学习的3D推理。M&M-3D通过构建恶性肿瘤导向的3D特征,并通过将这些特征与切片级信息混合来学习3D推理。实验结果显示,M&M-3D在定位和分类上的表现分别比2D投影和3D切片方法高出11-54%和3-10%,在低数据条件下,M&M-3D比复杂的3D推理变体高出20-47%的定位和2-10%的分类表现,而在高数据条件下,其表现与这些复杂模型相当。在BCS-DBT基准上,M&M-3D在分类和定位上的表现分别优于之前的最佳基线4%和10%。
Mined Prompting and Metadata-Guided Generation for Wound Care Visual Question Answering
Authors: Bavana Durgapraveen, Sornaraj Sivasankaran, Abhinand Balachandran, Sriram Rajkumar
First: 2025-11-13T18:28:58+00:00 · Latest: 2025-11-13T18:28:58+00:00
Comments: 2 figures, 11 pages
Abstract
The rapid expansion of asynchronous remote care has intensified provider workload, creating demand for AI systems that can assist clinicians in managing patient queries more efficiently. The MEDIQA-WV 2025 shared task addresses this challenge by focusing on generating free-text responses to wound care queries paired with images. In this work, we present two complementary approaches developed for the English track. The first leverages a mined prompting strategy, where training data is embedded and the top-k most similar examples are retrieved to serve as few-shot demonstrations during generation. The second approach builds on a metadata ablation study, which identified four metadata attributes that consistently enhance response quality. We train classifiers to predict these attributes for test cases and incorporate them into the generation pipeline, dynamically adjusting outputs based on prediction confidence. Experimental results demonstrate that mined prompting improves response relevance, while metadata-guided generation further refines clinical precision. Together, these methods highlight promising directions for developing AI-driven tools that can provide reliable and efficient wound care support.
中文标题/摘要
标题:开采提示和元数据引导生成在伤口护理视觉问答中的应用
异步远程护理的迅速扩展加剧了提供者的负担,从而产生了需要能够帮助临床医生更高效地管理患者查询的AI系统的需求。MEDIQA-WV 2025 共享任务通过专注于生成配图的伤口护理查询的自由文本回答来应对这一挑战。在本文中,我们为英语赛道提出了两种互补的方法。第一种方法利用了开采提示策略,其中训练数据被嵌入,并检索最相似的前k个示例作为生成过程中的零样本示范。第二种方法基于元数据消融研究,该研究确定了四个始终能提升回答质量的元数据属性。我们训练分类器来预测这些属性在测试案例中的情况,并将它们整合到生成管道中,根据预测置信度动态调整输出。实验结果表明,开采提示提高了回答的相关性,而元数据引导的生成进一步提高了临床精度。这些方法共同展示了开发能够提供可靠和高效伤口护理支持的AI驱动工具的有希望的方向。
Summary / 总结
This study addresses the need for AI systems to assist clinicians in managing wound care queries more efficiently. Two approaches were developed: one uses mined prompting to retrieve similar examples for generating relevant responses, and the other incorporates metadata to enhance clinical precision. Experimental results show that mined prompting improves response relevance, while metadata-guided generation further refines clinical precision, demonstrating promising directions for AI-driven wound care support.
该研究旨在通过AI系统帮助临床医生更高效地管理伤口护理查询。开发了两种方法:一种是使用挖掘提示检索类似示例以生成相关响应,另一种是利用元数据预测并提升响应质量。实验结果表明,挖掘提示提高了响应的相关性,而元数据引导的生成进一步提升了临床精度,展示了AI驱动的伤口护理支持的有前途的方向。
Impact of Layer Norm on Memorization and Generalization in Transformers
Authors: Rishi Singhal, Jung-Eun Kim
Venue: NeurIPS 2025
First: 2025-11-13T18:07:07+00:00 · Latest: 2025-11-13T18:07:07+00:00
Comments: NeurIPS 2025
Abstract
Layer Normalization (LayerNorm) is one of the fundamental components in transformers that stabilizes training and improves optimization. In recent times, Pre-LayerNorm transformers have become the preferred choice over Post-LayerNorm transformers due to their stable gradient flow. However, the impact of LayerNorm on learning and memorization across these architectures remains unclear. In this work, we investigate how LayerNorm influences memorization and learning for Pre- and Post-LayerNorm transformers. We identify that LayerNorm serves as a key factor for stable learning in Pre-LayerNorm transformers, while in Post-LayerNorm transformers, it impacts memorization. Our analysis reveals that eliminating LayerNorm parameters in Pre-LayerNorm models exacerbates memorization and destabilizes learning, while in Post-LayerNorm models, it effectively mitigates memorization by restoring genuine labels. We further precisely identify that early layers LayerNorm are the most critical over middle/later layers and their influence varies across Pre and Post LayerNorm models. We have validated it through 13 models across 6 Vision and Language datasets. These insights shed new light on the role of LayerNorm in shaping memorization and learning in transformers.
中文标题/摘要
标题:层规范化对Transformer中记忆与泛化的影响
层规范化(LayerNorm)是Transformer中的一项基本组件,能够稳定训练并改善优化。近年来,由于其稳定的梯度流动,Pre-LayerNorm Transformer成为了Post-LayerNorm Transformer的首选。然而,LayerNorm对这些架构中学习和记忆的影响仍然不清楚。在本研究中,我们探讨了LayerNorm如何影响Pre-和Post-LayerNorm Transformer中的记忆和学习。我们发现,对于Pre-LayerNorm Transformer,LayerNorm是稳定学习的关键因素,而在Post-LayerNorm Transformer中,它影响记忆。我们的分析表明,在Pre-LayerNorm模型中消除LayerNorm参数会加剧记忆并使学习不稳定,而在Post-LayerNorm模型中,它通过恢复真实标签有效地减轻了记忆。我们进一步精确地发现,早期层的LayerNorm是最重要的,其影响在Pre和Post LayerNorm模型中有所不同。我们通过13个模型在6个视觉和语言数据集上进行了验证。这些见解为理解LayerNorm在塑造Transformer中记忆和学习方面的作用提供了新的视角。
Summary / 总结
This study investigates the impact of Layer Normalization (LayerNorm) on memorization and learning in both Pre- and Post-LayerNorm transformers. The research finds that LayerNorm is crucial for stable learning in Pre-LayerNorm models, whereas it affects memorization in Post-LayerNorm models. Removing LayerNorm parameters in Pre-LayerNorm models increases memorization and destabilizes learning, while in Post-LayerNorm models, it mitigates memorization by restoring genuine labels. The study validates these findings across 13 models on six Vision and Language datasets.
研究探讨了层规范化(LayerNorm)在前向和后向层规范化变压器中的记忆和学习影响。研究发现,LayerNorm 对前向层规范化模型的稳定学习至关重要,而在后向层规范化模型中则影响记忆。移除前向层规范化模型中的LayerNorm参数会增加记忆并使学习不稳定,而在后向层规范化模型中,它通过恢复真实标签来减轻记忆。研究通过六个视觉和语言数据集上的13个模型验证了这些发现。
OmniVGGT: Omni-Modality Driven Visual Geometry Grounded
Authors: Haosong Peng, Hao Li, Yalun Dai, Yushi Lan, Yihang Luo, Tianyu Qi, Zhengshen Zhang, Yufeng Zhan, Junfei Zhang, Wenchao Xu, Ziwei Liu
First: 2025-11-13T17:59:01+00:00 · Latest: 2025-11-13T17:59:01+00:00
Comments: Project Page: https://livioni.github.io/OmniVGGT-offcial/
Abstract
General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Comprehensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks.
中文标题/摘要
标题:OmniVGGT:全方位驱动的视觉几何导向
通用3D基础模型已经开始引领统一多种视觉任务的趋势,但大多数模型仅假设RGB输入,忽略了可用的几何线索(例如,相机内参、外参和深度图)。为了解决这一问题,我们引入了OmniVGGT,这是一种新型框架,可以在训练和推理过程中有效利用任意数量的辅助几何模态。在我们的框架中,提出了一种GeoAdapter来将深度和相机内参/外参编码到空间基础模型中。它使用零初始化卷积逐步注入几何信息,而不破坏基础模型的表示空间。这种设计确保了优化的稳定性,几乎没有额外开销,即使有多个附加输入,推理速度也与VGGT相当。此外,还提出了一种随机多模态融合机制,在训练过程中按实例随机采样模态子集。这使得在测试过程中可以使用任意数量的模态输入,并促进学习稳健的空间表示,而不是过度拟合辅助线索。在单目/多视图深度估计、多视图立体和相机姿态估计的全面实验中,OmniVGGT在有辅助输入的情况下优于先前方法,并且即使仅使用RGB输入也能达到最先进的结果。为了进一步突出其实用性,我们将OmniVGGT集成到视觉-语言-动作(VLA)模型中。通过OmniVGGT增强的VLA模型不仅在主流基准上优于基于点云的基线模型,而且能够有效利用可获取的辅助输入,在机器人任务中实现一致的性能提升。
Summary / 总结
OmniVGGT is a framework that integrates geometric cues (depth, camera intrinsics/extrinsics) with visual inputs to improve 3D vision tasks. It uses a GeoAdapter to encode geometric information into a spatial foundation model without disrupting its representation space. OmniVGGT demonstrates superior performance in monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation, both with and without additional inputs. Additionally, integrating OmniVGGT into vision-language-action models enhances their performance on robotic tasks by effectively utilizing available auxiliary inputs.
OmniVGGT 是一种新型框架,将几何信息(深度、相机内参/外参)与空间基础模型结合以提升多种视觉任务。它使用 GeoAdapter 编码几何信息而不破坏模型的表示空间,并采用随机多模态融合机制以实现稳健学习。实验表明,OmniVGGT 在有辅助输入的情况下优于先前方法,并在仅使用 RGB 输入时达到最先进的性能。它还增强了视觉-语言-动作模型,在机器人任务中提供了持续的性能提升。
Text to Robotic Assembly of Multi Component Objects using 3D Generative AI and Vision Language Models
Authors: Alexander Htet Kyaw, Richa Gupta, Dhruv Shah, Anoop Sinha, Kory Mathewson, Stefanie Pender, Sachin Chitta, Yotto Koga, Faez Ahmed, Lawrence Sass, Randall Davis
Venue: NeurIPS 2025
First: 2025-11-04T01:02:21+00:00 · Latest: 2025-11-13T17:46:04+00:00
Comments: Accepted to NeurIPS 2025, Conference on Neural Information Processing Systems, Creative AI Track
Abstract
Advances in 3D generative AI have enabled the creation of physical objects from text prompts, but challenges remain in creating objects involving multiple component types. We present a pipeline that integrates 3D generative AI with vision-language models (VLMs) to enable the robotic assembly of multi-component objects from natural language. Our method leverages VLMs for zero-shot, multi-modal reasoning about geometry and functionality to decompose AI-generated meshes into multi-component 3D models using predefined structural and panel components. We demonstrate that a VLM is capable of determining which mesh regions need panel components in addition to structural components, based on the object's geometry and functionality. Evaluation across test objects shows that users preferred the VLM-generated assignments 90.6% of the time, compared to 59.4% for rule-based and 2.5% for random assignment. Lastly, the system allows users to refine component assignments through conversational feedback, enabling greater human control and agency in making physical objects with generative AI and robotics.
中文标题/摘要
标题:使用3D生成AI和视觉语言模型从文本构建多组件物体的机器人装配
3D生成AI的进步使得从文本提示创建物理对象成为可能,但在创建涉及多种组件类型的对象时仍面临挑战。我们提出了一种将3D生成AI与视觉语言模型(VLMs)结合的管道,以使自然语言生成多组件物体的机器人装配成为可能。我们的方法利用VLMs进行零样本、多模态的几何和功能推理,将生成的网格分解为使用预定义结构和面板组件的多组件3D模型。我们证明VLM能够根据物体的几何形状和功能确定哪些网格区域需要面板组件。在测试对象上的评估显示,用户中有90.6%的时间更喜欢VLM生成的分配,而基于规则的分配为59.4%,随机分配为2.5%。最后,该系统允许用户通过对话反馈来细化组件分配,从而在使用生成AI和机器人技术制作物理对象时赋予更大的人类控制权和自主权。
Summary / 总结
This research addresses the challenge of creating multi-component objects from text prompts by integrating 3D generative AI with vision-language models. The method uses VLMs for zero-shot reasoning to decompose AI-generated meshes into multi-component 3D models. Evaluation shows that VLM-generated assignments were preferred 90.6% of the time, compared to 59.4% for rule-based and 2.5% for random assignments. Users can also refine component assignments through conversational feedback, enhancing human control in the process.
该研究通过将3D生成AI与视觉语言模型相结合,解决了从文本提示创建多组件物体的挑战。方法利用VLM进行零样本多模态推理,将生成的AI网格分解为多组件3D模型。评估结果显示,VLM生成的分配被首选90.6%,而基于规则的分配为59.4%,随机分配仅为2.5%。用户还可以通过对话反馈进一步细化组件分配,增强人类在生成AI和机器人过程中的控制权。
SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation
Authors: Wei Li, Renshan Zhang, Rui Shao, Zhijian Fang, Kaiwen Zhou, Zhuotao Tian, Liqiang Nie
Venue: AAAI 2026 Oral
First: 2025-11-13T17:24:37+00:00 · Latest: 2025-11-13T17:24:37+00:00
Comments: Accepted to AAAI 2026 (Oral), Project Page: https://github.com/JiuTian-VL/SemanticVLA
Abstract
Vision-Language-Action (VLA) models have advanced in robotic manipulation, yet practical deployment remains hindered by two key limitations: 1) perceptual redundancy, where irrelevant visual inputs are processed inefficiently, and 2) superficial instruction-vision alignment, which hampers semantic grounding of actions. In this paper, we propose SemanticVLA, a novel VLA framework that performs Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation. Specifically: 1) To sparsify redundant perception while preserving semantic alignment, Semantic-guided Dual Visual Pruner (SD-Pruner) performs: Instruction-driven Pruner (ID-Pruner) extracts global action cues and local semantic anchors in SigLIP; Spatial-aggregation Pruner (SA-Pruner) compacts geometry-rich features into task-adaptive tokens in DINOv2. 2) To exploit sparsified features and integrate semantics with spatial geometry, Semantic-complementary Hierarchical Fuser (SH-Fuser) fuses dense patches and sparse tokens across SigLIP and DINOv2 for coherent representation. 3) To enhance the transformation from perception to action, Semantic-conditioned Action Coupler (SA-Coupler) replaces the conventional observation-to-DoF approach, yielding more efficient and interpretable behavior modeling for manipulation tasks. Extensive experiments on simulation and real-world tasks show that SemanticVLA sets a new SOTA in both performance and efficiency. SemanticVLA surpasses OpenVLA on LIBERO benchmark by 21.1% in success rate, while reducing training cost and inference latency by 3.0-fold and 2.7-fold.SemanticVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/SemanticVLA
中文标题/摘要
标题:SemanticVLA:语义对齐的稀疏化与增强以实现高效的机器人操作
视觉-语言-动作(VLA)模型在机器人操作方面取得了进展,但实际部署仍受到两个关键限制的阻碍:1)感知冗余,其中无关的视觉输入被无效处理;2)表面的指令-视觉对齐,这妨碍了动作的语义定位。本文提出了一种新的VLA框架——SemanticVLA,以实现语义对齐的稀疏化与增强以实现高效的机器人操作。具体而言:1)为了稀疏化冗余感知同时保持语义对齐,语义引导的双视觉剪枝器(SD-Pruner)执行:指令驱动剪枝器(ID-Pruner)从SigLIP中提取全局动作线索和局部语义锚点;空间聚合剪枝器(SA-Pruner)将DINOv2中的几何丰富特征压缩成任务适配的令牌。2)为了利用稀疏特征并整合语义与空间几何,语义互补的分层融合器(SH-Fuser)在SigLIP和DINOv2之间融合密集补丁和稀疏令牌,以实现连贯的表示。3)为了增强从感知到动作的转换,语义条件的动作耦合器(SA-Coupler)替代了传统的观测到自由度的方法,从而为操作任务提供更高效和可解释的行为建模。在模拟和实际任务上的广泛实验表明,SemanticVLA在性能和效率方面均达到了新的SOTA。SemanticVLA在LIBERO基准测试上的成功率比OpenVLA高出21.1%,同时将训练成本和推理延迟分别降低了3.0倍和2.7倍。SemanticVLA已开源并公开发布在https://github.com/JiuTian-VL/SemanticVLA
Summary / 总结
The research aims to improve the efficiency and effectiveness of Vision-Language-Action (VLA) models in robotic manipulation by addressing perceptual redundancy and superficial instruction-vision alignment. The proposed SemanticVLA framework includes a Semantic-guided Dual Visual Pruner (SD-Pruner) for sparsifying redundant perception while preserving semantic alignment, a Semantic-complementary Hierarchical Fuser (SH-Fuser) for integrating semantics with spatial geometry, and a Semantic-conditioned Action Coupler (SA-Coupler) for enhancing the transformation from perception to action. Experiments show that SemanticVLA outperforms OpenVLA on the LIBERO benchmark with a 21.1% higher success rate, while reducing training cost and inference latency by 3.0-fold and 2.7-fold, respectively.
研究旨在通过解决感知冗余和表层指令-视觉对齐问题,提高Vision-Language-Action (VLA) 模型在机器人操作中的效率和效果。提出的SemanticVLA框架包括一个语义引导的双视觉剪枝器(SD-Pruner),用于稀疏化冗余感知并保持语义对齐;一个语义互补的层次融合器(SH-Fuser),用于将语义与空间几何体融合;以及一个语义条件的动作耦合器(SA-Coupler),用于增强从感知到动作的转换。实验结果表明,SemanticVLA在LIBERO基准测试中比OpenVLA的成功率高出21.1%,同时将训练成本和推理延迟分别降低了3.0倍和2.7倍。
Drifting Away from Truth: GenAI-Driven News Diversity Challenges LVLM-Based Misinformation Detection
Authors: Fanxiao Li, Jiaying Wu, Tingchao Fu, Yunyun Dong, Bingbing Song, Wei Zhou
First: 2025-08-18T08:19:43+00:00 · Latest: 2025-11-13T16:57:54+00:00
Abstract
The proliferation of multimodal misinformation poses growing threats to public discourse and societal trust. While Large Vision-Language Models (LVLMs) have enabled recent progress in multimodal misinformation detection (MMD), the rise of generative AI (GenAI) tools introduces a new challenge: GenAI-driven news diversity, characterized by highly varied and complex content. We show that this diversity induces multi-level drift, comprising (1) model-level misperception drift, where stylistic variations disrupt a model's internal reasoning, and (2) evidence-level drift, where expression diversity degrades the quality or relevance of retrieved external evidence. These drifts significantly degrade the robustness of current LVLM-based MMD systems. To systematically study this problem, we introduce DriftBench, a large-scale benchmark comprising 16,000 news instances across six categories of diversification. We design three evaluation tasks: (1) robustness of truth verification under multi-level drift; (2) susceptibility to adversarial evidence contamination generated by GenAI; and (3) analysis of reasoning consistency across diverse inputs. Experiments with six state-of-the-art LVLM-based detectors show substantial performance drops (average F1 -14.8%) and increasingly unstable reasoning traces, with even more severe failures under adversarial evidence injection. Our findings uncover fundamental vulnerabilities in existing MMD systems and suggest an urgent need for more resilient approaches in the GenAI era.
中文标题/摘要
标题:远离真相:由GenAI驱动的新闻多样性挑战LVLM基的误信息检测
多模态误信息的泛滥对公共话语和社会信任构成了日益增长的威胁。虽然大型视觉-语言模型(LVLM)在多模态误信息检测(MMD)方面取得了近期进展,但生成型AI(GenAI)工具的兴起引入了一个新的挑战:由GenAI驱动的新闻多样性,其特征是内容高度多样化和复杂化。我们表明,这种多样性导致了多级漂移,包括(1)模型级感知漂移,其中风格变化干扰了模型的内部推理,以及(2)证据级漂移,其中表达多样性降低了检索外部证据的质量或相关性。这些漂移显著削弱了当前基于LVLM的MMD系统的稳健性。为了系统地研究这一问题,我们引入了DriftBench,这是一个包含16,000个新闻实例的大规模基准,涵盖了六类多样化的类别。我们设计了三个评估任务:(1)在多级漂移下的事实验证稳健性;(2)对抗由GenAI生成的虚假证据污染的易感性;以及(3)对多样输入推理一致性的分析。六种最先进的基于LVLM的检测器的实验显示,性能下降显著(平均F1 -14.8%),推理轨迹越来越不稳定,并且在对抗虚假证据注入下表现更加严重。我们的研究揭示了现有MMD系统中的根本性漏洞,并建议在GenAI时代迫切需要更稳健的方法。
Summary / 总结
The paper addresses the challenge of GenAI-driven news diversity in multimodal misinformation detection, which causes multi-level drift in LVLM-based systems. It introduces DriftBench, a benchmark with 16,000 news instances, and evaluates six state-of-the-art LVLM-based detectors, showing significant performance drops and unstable reasoning traces. The study highlights the need for more robust MMD systems in the GenAI era.
论文探讨了GenAI驱动的新闻多样性对多模态 misinformation检测的挑战,导致LVLM基系统中的多级漂移。它引入了包含16,000个新闻实例的DriftBench基准,并评估了六种最先进的LVLM基检测器,在多样输入和对抗性证据注入下显示出显著的性能下降和不稳定的推理轨迹。这项研究突显了在GenAI时代需要更 robust的MMD系统的需求。
vMFCoOp: Towards Equilibrium on a Unified Hyperspherical Manifold for Prompting Biomedical VLMs
Authors: Minye Shao, Sihan Guo, Xinrun Li, Xingyu Miao, Haoran Duan, Yang Long
Venue: AAAI 2026 Oral Presentation
First: 2025-11-12T18:38:33+00:00 · Latest: 2025-11-13T16:56:35+00:00
Comments: Accepted as an Oral Presentation at AAAI 2026 Main Technical Track (this version is not peer-reviewed; it is the extended version)
Abstract
Recent advances in context optimization (CoOp) guided by large language model (LLM)-distilled medical semantic priors offer a scalable alternative to manual prompt engineering and full fine-tuning for adapting biomedical CLIP-based vision-language models (VLMs). However, prompt learning in this context is challenged by semantic misalignment between LLMs and CLIP variants due to divergent training corpora and model architectures; it further lacks scalability across continuously evolving families of foundation models. More critically, pairwise multimodal alignment via conventional Euclidean-space optimization lacks the capacity to model unified representations or apply localized geometric constraints, which tends to amplify modality gaps in complex biomedical imaging and destabilize few-shot adaptation. In this work, we propose vMFCoOp, a framework that inversely estimates von Mises-Fisher (vMF) distributions on a shared Hyperspherical Manifold, aligning semantic biases between arbitrary LLMs and CLIP backbones via Unified Semantic Anchors to achieve robust biomedical prompting and superior few-shot classification. Grounded in three complementary constraints, vMFCoOp demonstrates consistent improvements across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming state-of-the-art methods in accuracy, generalization, and clinical applicability. This work aims to continuously expand to encompass more downstream applications, and the corresponding resources are intended to be shared through https://github.com/VinyehShaw/UniEqui.
中文标题/摘要
标题:vMFCoOp:在统一超球面流形上朝均衡方向努力以促进生物医学VLM
基于大型语言模型(LLM)提炼的医学语义先验的上下文优化(CoOp)的最新进展为使用生物医学CLIP基视觉语言模型(VLMs)进行手动提示工程和全面微调提供了可扩展的替代方案。然而,这种上下文中的提示学习受到LLM和CLIP变体之间语义不匹配的挑战,这归因于不同的训练语料库和模型架构;此外,它在不断演进的基础模型家族中缺乏可扩展性。更严重的是,通过传统的欧几里得空间优化进行的两模态对齐缺乏建模统一表示或应用局部几何约束的能力,这在复杂的生物医学成像中往往会放大模态差距并破坏少量样本的适应性。在本文中,我们提出vMFCoOp框架,该框架在共享的超球面流形上逆向估计von Mises-Fisher(vMF)分布,通过统一语义锚点对任意LLM和CLIP主干之间的语义偏差进行对齐,以实现稳健的生物医学提示和优越的少量样本分类。基于三个互补约束,vMFCoOp在14个医学数据集、12种医学成像模态和13个解剖区域上表现出一致的改进,超越了最先进的方法在准确度、泛化能力和临床应用方面。本文旨在不断扩展以涵盖更多的下游应用,相应的资源将通过https://github.com/VinyehShaw/UniEqui共享。
Summary / 总结
vMFCoOp proposes a framework for aligning semantic biases between LLMs and CLIP backbones using von Mises-Fisher distributions on a shared Hyperspherical Manifold, achieving robust biomedical prompting and superior few-shot classification across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming existing methods in accuracy and generalization. This work aims to continuously expand to more downstream applications.
vMFCoOp 提出了一种框架,使用共享超球面流形上的 von Mises-Fisher 分布来对齐 LLM 和 CLIP 后端之间的语义偏见,实现了在 14 个医学数据集、12 种医学成像模态和 13 个解剖区域中稳健的生物医学提示和优越的少量样本分类,优于现有方法在准确性和泛化性方面。
Intrinsic Dimensionality as a Model-Free Measure of Class Imbalance
Authors: Çağrı Eser, Zeynep Sonat Baltacı, Emre Akbaş, Sinan Kalkan
First: 2025-11-13T16:41:37+00:00 · Latest: 2025-11-13T16:41:37+00:00
Comments: 45 pages, 11 figures
Abstract
Imbalance in classification tasks is commonly quantified by the cardinalities of examples across classes. This, however, disregards the presence of redundant examples and inherent differences in the learning difficulties of classes. Alternatively, one can use complex measures such as training loss and uncertainty, which, however, depend on training a machine learning model. Our paper proposes using data Intrinsic Dimensionality (ID) as an easy-to-compute, model-free measure of imbalance that can be seamlessly incorporated into various imbalance mitigation methods. Our results across five different datasets with a diverse range of imbalance ratios show that ID consistently outperforms cardinality-based re-weighting and re-sampling techniques used in the literature. Moreover, we show that combining ID with cardinality can further improve performance. Code: https://github.com/cagries/IDIM.
中文标题/摘要
标题:固有维数作为无模型类不平衡度量
分类任务中的不平衡通常通过各类别示例的数量来量化。然而,这种方法忽略了冗余示例的存在以及类间学习难度的内在差异。相反,可以使用训练损失和不确定性等复杂度量,但这些度量依赖于训练机器学习模型。我们论文提出使用数据固有维数(ID)作为易于计算、无模型的不平衡度量,可以无缝融入各种不平衡缓解方法中。我们在五个不同数据集上的结果表明,ID在各种不平衡比率下始终优于文献中使用的基于数量的重权和重采样技术。此外,我们展示了将ID与数量结合使用可以进一步提高性能。代码:https://github.com/cagries/IDIM。
Summary / 总结
The paper addresses the issue of class imbalance in classification tasks by proposing the use of Intrinsic Dimensionality (ID) as a model-free measure. Unlike traditional methods that rely on the cardinalities of examples, ID captures the inherent learning difficulty of classes and the presence of redundant examples. Experiments across five datasets demonstrate that ID outperforms cardinality-based re-weighting and re-sampling techniques, and combining ID with cardinality further enhances performance.
论文通过提出使用内在维数(ID)作为无模型的方法来解决分类任务中的类别不平衡问题。与依赖于示例数量的传统方法不同,ID 能够捕捉数据的内在复杂性和冗余性。实验结果显示,ID 在五个不同数据集上的表现优于基于数量的重加权和重采样技术,并且将 ID 与数量结合使用可以进一步提高性能。
Surrogate Quantum Circuit Design for the Lattice Boltzmann Collision Operator
Authors: Monica Lăcătuş, Matthias Möller
First: 2025-07-16T14:02:01+00:00 · Latest: 2025-11-13T15:44:42+00:00
Comments: 54 pages, 18 figures
Abstract
This study introduces a framework for learning a low-depth surrogate quantum circuit (SQC) that approximates the nonlinear, dissipative, and hence non-unitary Bhatnagar-Gross-Krook (BGK) collision operator in the lattice Boltzmann method (LBM) for the D2Q9 lattice. By appropriately selecting the quantum state encoding, circuit architecture, and measurement protocol, non-unitary dynamics emerge naturally within the physical population space. This approach removes the need for probabilistic algorithms relying on any ancilla qubits and post-selection to reproduce dissipation, or for multiple state copies to capture nonlinearity. The SQC is designed to preserve key physical properties of the BGK operator, including mass conservation, scale equivariance, and D8 equivariance, while momentum conservation is encouraged through penalization in the training loss. When compiled to the IBM Heron quantum processor's native gate set, assuming all-to-all qubit connectivity, the circuit requires only 724 native gates and operates locally on the velocity register, making it independent of the lattice size. The learned SQC is validated on two benchmark cases, the Taylor-Green vortex decay and the lid-driven cavity, showing accurate reproduction of vortex decay and flow recirculation. While integration of the SQC into a quantum LBM framework presently requires measurement and re-initialization at each timestep, the necessary steps towards a measurement-free formulation are outlined.
中文标题/摘要
标题:晶格玻尔兹曼碰撞算子的代理量子电路设计
本研究提出了一种框架,用于学习一个低深度的代理量子电路(SQC),以近似晶格玻尔兹曼方法(LBM)中D2Q9晶格上的非线性、耗散且因此非幺正的Bhatnagar-Gross-Krook(BGK)碰撞算子。通过适当选择量子态编码、电路架构和测量协议,非幺正动力学自然地在物理群体空间中出现。这种方法消除了依赖辅助量子位和后选择的概率算法以再现耗散,或依赖多个状态副本以捕捉非线性的需求。SQC被设计为保留BGK算子的关键物理特性,包括质量守恒、尺度不变性和D8不变性,同时通过训练损失中的惩罚鼓励动量守恒。当编译为IBM Heron量子处理器的本征门集时,假设所有到所有量子位连接,该电路仅需724个本征门,并且仅在速度寄存器上本地操作,使其与晶格大小无关。所学的SQC在两个基准案例——Taylor-Green涡旋衰减和盖板驱动的腔室——上进行了验证,显示出涡旋衰减和流体再循环的准确再现。虽然将SQC集成到量子LBM框架中目前需要在每个时间步进行测量和重新初始化,但实现无测量形式的必要步骤已被概述。
Summary / 总结
This study presents a framework for designing a low-depth surrogate quantum circuit (SQC) to approximate the BGK collision operator in the lattice Boltzmann method (LBM). By carefully selecting the quantum state encoding, circuit architecture, and measurement protocol, the approach naturally captures non-unitary dynamics without requiring ancilla qubits or post-selection. The SQC preserves key physical properties such as mass conservation and D8 equivariance, and is optimized for the IBM Heron quantum processor with only 724 native gates. Experimental validation on benchmark cases demonstrates accurate vortex decay and flow recirculation, though measurement and re-initialization are still needed at each timestep for integration into a quantum LBM framework.
该研究提出了一种框架,用于设计低深度的代理量子电路(SQC),以近似D2Q9格子的BGK碰撞算子在格子玻尔兹曼方法(LBM)中的应用。该方法通过特定的量子态编码、电路架构和测量协议,自然地引入非幺正动力学。SQC保留了质量、动量守恒等关键物理特性,并在基准案例中得到验证,准确再现了涡旋衰减和流体回流。电路编译后适用于IBM Heron量子处理器,仅需724个本机门,并且在速度寄存器上本地操作,使其具有可扩展性。然而,将其集成到量子LBM框架中仍需要在每个时间步进行测量和重新初始化,但已概述了向无测量形式的步骤。
MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns
Authors: Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye, Yupei Zhong, Junteng Ma, Tao Wei, Haiyang Xu, Weikai Chen, Zeen Wang, Qiangjun Ji, Fanxi Zhou, Qi Zhang, Yuanrui Hu, Jiahao Liu, Zhang Li, Ziyang Zhang, Qiang Liu, Xiang Bai
First: 2025-11-13T15:12:17+00:00 · Latest: 2025-11-13T15:12:17+00:00
Abstract
Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage parsing pipeline. The first stage employs a large multimodal model to jointly predict document layout and reading order, leveraging visual information to ensure structural and sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios.
中文标题/摘要
标题:MonkeyOCR v1.5 技术报告:解锁复杂模式下的稳健文档解析
文档解析是文档智能的核心任务,支持信息提取、检索增强生成和自动化文档分析等应用。然而,现实中的文档往往具有复杂的布局,包括多级表格、嵌入的图像或公式以及跨页结构,这些对现有的OCR系统来说仍然是挑战。我们介绍了MonkeyOCR v1.5,这是一种统一的视觉-语言框架,通过两阶段解析管道增强布局理解和内容识别。第一阶段使用大型多模态模型联合预测文档布局和阅读顺序,利用视觉信息确保结构和顺序的一致性。第二阶段在检测到的区域内局部识别文本、公式和表格,保持高视觉保真度的同时减少错误传播。为了解决复杂的表格结构,我们提出了一种基于视觉一致性的强化学习方案,通过渲染和比较对齐来评估识别质量,从而提高结构准确性,无需手动注释。此外,还引入了两个专门模块——图像解耦表格解析和类型引导表格合并,以实现包含嵌入图像的表格的可靠解析以及跨页或跨列表格的重建。在OmniDocBench v1.5上的全面实验表明,MonkeyOCR v1.5达到了最先进的性能,优于PPOCR-VL和MinerU 2.5,并且在视觉复杂的文档场景中表现出色。
Summary / 总结
MonkeyOCR v1.5 is a unified vision-language framework designed to handle complex document layouts with multi-level tables, embedded images, and cross-page structures. It uses a two-stage parsing pipeline: the first stage predicts layout and reading order, ensuring structural and sequential consistency, while the second stage recognizes text, formulas, and tables with high visual fidelity. The framework includes a visual consistency-based reinforcement learning scheme and specialized modules for parsing tables with embedded images and reconstructing multi-page tables, achieving state-of-the-art performance on OmniDocBench v1.5.
MonkeyOCR v1.5 是一个统一的视觉-语言框架,通过两阶段管道提高复杂布局的文档解析能力。第一阶段使用多模态模型预测布局和阅读顺序,确保结构和顺序的一致性。第二阶段以高视觉保真度识别文本、公式和表格。它引入了用于复杂表格结构的强化学习方案,并提出了专门模块来解析包含嵌入图像的表格和跨页或列的表格。实验表明,MonkeyOCR v1.5 在视觉复杂文档场景中优于现有系统。
MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation
Authors: Xun Huang, Shijia Zhao, Yunxiang Wang, Xin Lu, Wanfa Zhang, Rongsheng Qu, Weixin Li, Yunhong Wang, Chenglu Wen
First: 2025-11-13T14:51:21+00:00 · Latest: 2025-11-13T14:51:21+00:00
Comments: 10 pages
Abstract
Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relational edges with dynamically assigned images. Built on M3DSG, we propose MSGNav, a zero-shot navigation system that includes a Key Subgraph Selection module for efficient reasoning, an Adaptive Vocabulary Update module for open vocabulary support, and a Closed-Loop Reasoning module for accurate exploration reasoning. Additionally, we further identify the last-mile problem in zero-shot navigation - determining the feasible target location with a suitable final viewpoint, and propose a Visibility-based Viewpoint Decision module to explicitly resolve it. Comprehensive experimental results demonstrate that MSGNav achieves state-of-the-art performance on GOAT-Bench and HM3D-OVON datasets. The open-source code will be publicly available.
中文标题/摘要
标题:MSGNav:多模态3D场景图的潜力释放以实现零样本嵌入式导航
嵌入式导航是机器人代理操作的基本能力。实际部署需要开放词汇的一般化和低训练开销,因此推动了零样本方法而不是特定任务的RL训练。然而,现有的构建显式3D场景图的零样本方法通常将丰富的视觉观察压缩为仅文本关系,导致高构建成本、不可逆的视觉证据损失和受限的词汇量。为了解决这些限制,我们引入了多模态3D场景图(M3DSG),通过用动态分配的图像替换文本关系边来保留视觉线索。基于M3DSG,我们提出了MSGNav,这是一种零样本导航系统,包括一个关键子图选择模块以实现高效推理、一个自适应词汇更新模块以支持开放词汇以及一个闭环推理模块以实现准确的探索推理。此外,我们进一步识别了零样本导航中的最后一英里问题——确定具有合适最终视角的可行目标位置,并提出了一种基于可见性的视角决策模块以明确解决该问题。全面的实验结果表明,MSGNav在GOAT-Bench和HM3D-OVON数据集上达到了最先进的性能。开源代码将公开。
Summary / 总结
MSGNav addresses the limitations of existing zero-shot embodied navigation methods by introducing the Multi-modal 3D Scene Graph (M3DSG) that preserves visual cues. MSGNav includes a Key Subgraph Selection module for efficient reasoning, an Adaptive Vocabulary Update module for open vocabulary support, and a Closed-Loop Reasoning module for accurate exploration. Additionally, it proposes a Visibility-based Viewpoint Decision module to determine feasible target locations. Experimental results show that MSGNav outperforms existing methods on GOAT-Bench and HM3D-OVON datasets.
研究旨在开发一种零样本室内导航系统,能够在未见过的环境中和任务中进行泛化,并且具有较低的训练开销。方法引入了多模态3D场景图(M3DSG)以保留视觉线索,并提出了一种名为MSGNav的零样本导航系统,该系统包括高效推理模块、开放词汇支持模块和准确探索推理模块。MSGNav在GOAT-Bench和HM3D-OVON数据集上达到了最先进的性能,解决了先前方法在构建成本和词汇量限制方面的局限性。
Rethinking Visual Information Processing in Multimodal LLMs
Authors: Dongwan Kim, Viresh Ranjan, Takashi Nagata, Arnab Dhua, Amit Kumar K C
First: 2025-11-13T13:36:30+00:00 · Latest: 2025-11-13T13:36:30+00:00
Abstract
Despite the remarkable success of the LLaVA architecture for vision-language tasks, its design inherently struggles to effectively integrate visual features due to the inherent mismatch between text and vision modalities. We tackle this issue from a novel perspective in which the LLM not only serves as a language model but also a powerful vision encoder. To this end, we present LLaViT - Large Language Models as extended Vision Transformers - which enables the LLM to simultaneously function as a vision encoder through three key modifications: (1) learning separate QKV projections for vision modality, (2) enabling bidirectional attention on visual tokens, and (3) incorporating both global and local visual representations. Through extensive controlled experiments on a wide range of LLMs, we demonstrate that LLaViT significantly outperforms the baseline LLaVA method on a multitude of benchmarks, even surpassing models with double its parameter count, establishing a more effective approach to vision-language modeling.
中文标题/摘要
标题:重新思考多模态LLM中的视觉信息处理
尽管LLaVA架构在视觉语言任务中取得了显著的成功,但由于文本和视觉模态之间的固有不匹配,其设计本质上难以有效地整合视觉特征。我们从一个新颖的角度出发,使LLM不仅作为语言模型,还作为强大的视觉编码器。为此,我们提出了LLaViT——大型语言模型作为扩展的视觉变换器——这使得LLM能够通过三种关键修改同时作为视觉编码器运行:(1) 为视觉模态学习独立的QKV投影,(2) 允许视觉标记的双向注意,(3) 结合全局和局部视觉表示。通过在广泛范围的LLM上进行大量的受控实验,我们证明LLaViT在多种基准测试中显著优于基线LLaVA方法,甚至超越了参数量是其两倍的模型,从而确立了一种更有效的视觉语言建模方法。
Summary / 总结
This paper addresses the challenge of integrating visual features in vision-language tasks by proposing LLaViT, which modifies the LLaVA architecture. LLaViT introduces three key changes: separate QKV projections for vision, bidirectional attention on visual tokens, and the use of both global and local visual representations. Extensive experiments show that LLaViT outperforms the original LLaVA and even models with twice the parameters across various benchmarks, demonstrating a more effective vision-language modeling approach.
研究旨在通过重新思考LLM的角色来改善多模态LLM中视觉特征的整合。方法包括对LLaVA架构进行三项关键修改:为视觉模态学习独立的QKV投影、在视觉标记上启用双向注意以及结合全局和局部视觉表示。实验表明,LLaViT在各种基准测试中显著优于原始的LLaVA方法,甚至超越了参数量翻倍的模型,证明了更有效的视觉-语言建模方法。
Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models
Authors: Zhengtao Zou, Ya Gao, Jiarui Guan, Bin Li, Pekka Marttinen
First: 2025-11-13T13:29:38+00:00 · Latest: 2025-11-13T13:29:38+00:00
Comments: Under review
Abstract
Large Vision-Language Models (LVLMs) often suffer from object hallucination, generating text inconsistent with visual inputs, which can critically undermine their reliability. Existing inference-time interventions to mitigate this issue present a challenging trade-off: while methods that steer internal states or adjust output logits can be effective, they often incur substantial computational overhead, typically requiring extra forward passes. This efficiency bottleneck can limit their practicality for real-world, latency-sensitive deployments. In this work, we aim to address this trade-off with Residual-Update Directed DEcoding Regulation (RUDDER), a low-overhead framework that steers LVLMs towards visually-grounded generation. RUDDER is built on two key innovations: (1) Contextual Activation Residual Direction (CARD) vector, a per-sample visual evidence vector extracted from the residual update of a self-attention layer during a single, standard forward pass. (2) A Bayesian-inspired adaptive gate that performs token-wise injection, applying a corrective signal whose strength is conditioned on the model's deviation from the visual context. Extensive experiments on key hallucination benchmarks, including POPE and CHAIR, indicate that RUDDER achieves performance comparable to state-of-the-art methods while introducing negligible computational latency, validating RUDDER as a pragmatic and effective approach for improving LVLMs' reliability without a significant compromise on efficiency.
中文标题/摘要
标题:低开销残差更新引导在大型视觉语言模型中减少幻觉的自适应方法
大型视觉-语言模型(LVLMs)经常遭受对象幻觉的问题,生成与视觉输入不一致的文本,这会严重影响其可靠性。现有的推理时干预措施缓解这一问题存在挑战:虽然可以有效引导内部状态或调整输出概率的方法存在,但它们通常会带来显著的计算开销,通常需要额外的前向传递。这种效率瓶颈限制了它们在实际、对延迟敏感的部署中的实用性。在本文中,我们旨在通过残差更新导向解码调节(RUDDER)框架解决这一权衡问题,RUDDER是一种低开销框架,可以引导LVLMs生成与视觉内容一致的文本。RUDDER基于两项创新:(1)上下文激活残差方向(CARD)向量,这是一种从单个标准前向传递中自注意力层的残差更新中提取的每样本视觉证据向量。(2)一种贝叶斯启发的自适应门,进行逐词注入,应用强度根据模型与视觉上下文的偏差进行调整的矫正信号。在POPE和CHAIR等关键幻觉基准上的广泛实验表明,RUDDER在引入几乎可以忽略的计算延迟的同时,实现了与最先进的方法相当的性能,验证了RUDDER作为一种实用且有效的改进LVLMs可靠性的方法的有效性。
Summary / 总结
This work addresses the issue of object hallucination in large vision-language models by introducing RUDDER, a low-overhead framework that uses a single forward pass to extract a contextual activation residual direction vector and a Bayesian-inspired adaptive gate to steer the model towards visually-grounded generation. Experiments show that RUDDER achieves performance comparable to state-of-the-art methods while introducing minimal computational latency, making it a practical solution for improving LVLMs' reliability without significant efficiency trade-offs.
本文提出了一种低开销框架RUDDER,通过引导LVLMs生成与视觉输入一致的内容来解决大型视觉-语言模型中的对象幻觉问题。RUDDER利用Contextual Activation Residual Direction (CARD)向量和一个贝叶斯启发式的自适应门来纠正模型的输出,而不增加显著的计算延迟。实验表明,RUDDER在性能上与最先进的方法相当,同时保持了微乎其微的计算延迟,使其成为提高LVLMs可靠性的实用解决方案。
VADB: A Large-Scale Video Aesthetic Database with Professional and Multi-Dimensional Annotations
Authors: Qianqian Qiao, DanDan Zheng, Yihang Bo, Bao Peng, Heng Huang, Longteng Jiang, Huaye Wang, Jingdong Chen, Jun Zhou, Xin Jin
First: 2025-10-29T07:37:08+00:00 · Latest: 2025-11-13T13:06:51+00:00
Abstract
Video aesthetic assessment, a vital area in multimedia computing, integrates computer vision with human cognition. Its progress is limited by the lack of standardized datasets and robust models, as the temporal dynamics of video and multimodal fusion challenges hinder direct application of image-based methods. This study introduces VADB, the largest video aesthetic database with 10,490 diverse videos annotated by 37 professionals across multiple aesthetic dimensions, including overall and attribute-specific aesthetic scores, rich language comments and objective tags. We propose VADB-Net, a dual-modal pre-training framework with a two-stage training strategy, which outperforms existing video quality assessment models in scoring tasks and supports downstream video aesthetic assessment tasks. The dataset and source code are available at https://github.com/BestiVictory/VADB.
中文标题/摘要
标题:VADB:大规模视频美学数据库及其专业多维度注释
视频美学评估是多媒体计算中的一个重要领域,它将计算机视觉与人类认知相结合。由于缺乏标准化数据集和稳健的模型,视频的时间动态和多模态融合的挑战限制了图像方法的直接应用。本研究介绍了VADB,这是一个包含10,490个多样视频的大规模视频美学数据库,这些视频由37名专业人士从多个美学维度进行标注,包括总体和属性特定的美学评分、丰富的语言评论和客观标签。我们提出了VADB-Net,这是一种双模态预训练框架,具有两阶段训练策略,在评分任务中优于现有的视频质量评估模型,并支持下游视频美学评估任务。数据集和源代码可在https://github.com/BestiVictory/VADB获取。
Summary / 总结
The study addresses the need for a standardized video aesthetic database and robust assessment models due to the challenges in temporal dynamics and multimodal fusion. VADB, a large-scale database with 10,490 videos annotated by 37 professionals, is introduced. The researchers also propose VADB-Net, a dual-modal pre-training framework that outperforms existing models in scoring tasks and supports video aesthetic assessment. The dataset and source code are available online.
研究旨在解决视频美学评估中标准化数据集和稳健模型的缺乏问题,这对于多媒体计算至关重要。该研究引入了VADB,一个包含10,490个视频的大规模数据库,这些视频由37位专业人士从多个美学维度进行标注。研究提出了VADB-Net,这是一种双模态预训练框架,其在评分任务中优于现有模型,并支持后续的美学评估任务。
PROPA: Toward Process-level Optimization in Visual Reasoning via Reinforcement Learning
Authors: Yanbei Jiang, Chao Lei, Yihao Ding, Krista Ehinger, Jey Han Lau
First: 2025-11-13T13:06:12+00:00 · Latest: 2025-11-13T13:06:12+00:00
Abstract
Despite significant progress, Vision-Language Models (VLMs) still struggle with complex visual reasoning, where multi-step dependencies cause early errors to cascade through the reasoning chain. Existing post-training paradigms are limited: Supervised Fine-Tuning (SFT) relies on costly step-level annotations, while Reinforcement Learning with Verifiable Rewards (RLVR) methods like GRPO provide only sparse, outcome-level feedback, hindering stable optimization. We introduce PROPA (Process-level Reasoning Optimization with interleaved Policy Alignment), a novel framework that integrates Monte Carlo Tree Search (MCTS) with GRPO to generate dense, process-level rewards and optimize reasoning at each intermediate step without human annotations. To overcome the cold-start problem, PROPA interleaves GRPO updates with SFT, enabling the model to learn from both successful and failed reasoning trajectories. A Process Reward Model (PRM) is further trained to guide inference-time search, aligning the test-time search with the training signal. Across seven benchmarks and four VLM backbones, PROPA consistently outperforms both SFT- and RLVR-based baselines. It achieves up to 17.0% gains on in-domain tasks and 21.0% gains on out-of-domain tasks compared to existing state-of-the-art, establishing a strong reasoning and generalization capability for visual reasoning tasks. The code isavailable at: https://github.com/YanbeiJiang/PROPA.
中文标题/摘要
标题:PROPA:通过强化学习实现视觉推理过程级优化
尽管取得了显著进展,视觉语言模型(VLMs)仍然难以应对复杂的视觉推理任务,其中多步依赖关系会导致早期错误在推理链中累积。现有的后训练范式有限:监督微调(SFT)依赖于昂贵的步骤级注释,而验证奖励的强化学习(RLVR)方法如GRPO只能提供稀疏的结果级反馈,阻碍了稳定优化。我们提出了PROPA(过程级推理优化与交错策略对齐),这是一种新颖的框架,将蒙特卡洛树搜索(MCTS)与GRPO结合,生成密集的过程级奖励,并在无需人工注释的情况下优化每个中间步骤的推理。为了解决冷启动问题,PROPA将GRPO更新与SFT交错进行,使模型能够从成功的和失败的推理轨迹中学习。进一步训练了一个过程奖励模型(PRM)来指导推理时的搜索,使测试时的搜索与训练信号对齐。在七个基准和四种VLM主干网络上,PROPA始终优于SFT-和RLVR基线。与现有最先进的技术相比,它在领域内任务上实现了高达17.0%的提升,在领域外任务上实现了高达21.0%的提升,建立了强大的推理和泛化能力。代码可在:https://github.com/YanbeiJiang/PROPA 获取。
Summary / 总结
The research aims to improve the process-level reasoning capability of Vision-Language Models (VLMs) in complex visual reasoning tasks. PROPA integrates Monte Carlo Tree Search (MCTS) with GRPO to generate dense process-level rewards and optimize reasoning at each step without human annotations. It interleaves GRPO updates with Supervised Fine-Tuning (SFT) to learn from both successful and failed reasoning trajectories. PROPA outperforms both SFT- and RLVR-based baselines, achieving up to 17.0% gains on in-domain tasks and 21.0% gains on out-of-domain tasks compared to existing state-of-the-art methods, demonstrating strong reasoning and generalization capabilities.
研究旨在提高复杂场景下的视觉推理能力,其中多步依赖可能导致级联错误。PROPA 使用结合 Monte Carlo Tree Search (MCTS) 和 GRPO 的框架生成密集的过程级奖励,无需人工注释即可优化每一步的推理。它将 GRPO 更新与 Supervised Fine-Tuning (SFT) 交错进行,以从成功和失败的轨迹中学习。PROPA 在七个基准测试中均优于 SFT- 和 RLVR 基线方法,相比现有最先进的模型,在跨域任务中可实现高达 21.0% 的性能提升。
Xiaoice: Training-Free Video Understanding via Self-Supervised Spatio-Temporal Clustering of Semantic Features
Authors: Shihao Ji, Zihui Song
First: 2025-10-19T10:13:34+00:00 · Latest: 2025-11-13T13:02:07+00:00
Comments: This paper is being withdrawn because we have identified a significant error in the implementation of our self-supervised clustering approach. Specifically, our feature aggregation step inadvertently leaked temporal information across frames, which violates the core assumption of our training-free method. We sincerely apologize to the research community
Abstract
The remarkable zero-shot reasoning capabilities of large-scale Visual Language Models (VLMs) on static images have yet to be fully translated to the video domain. Conventional video understanding models often rely on extensive, task-specific training on annotated datasets, a process that is both costly and limited in scalability. This paper introduces a novel, training-free framework for video understanding that circumvents end-to-end training by synergistically combining the rich semantic priors of pre-trained VLMs with classic machine learning algorithms for pattern discovery. Our core idea is to reframe video understanding as a self-supervised spatio-temporal clustering problem within a high-dimensional semantic feature space. The proposed pipeline first transforms a video stream into a semantic feature trajectory using the frozen visual encoder of a pre-trained VLM. Subsequently, we employ Kernel Temporal Segmentation (KTS), a robust machine learning technique, to partition the continuous feature stream into discrete, semantically coherent event segments. These segments are then subjected to unsupervised density-based clustering to identify recurring macroscopic scenes and themes throughout the video. By selecting representative keyframes from each discovered cluster and leveraging the VLM's generative capabilities for textual description, our framework automatically produces a structured, multi-modal summary of the video content. This approach provides an effective, interpretable, and model-agnostic pathway for zero-shot, automated structural analysis of video content.
中文标题/摘要
标题:Xiaoice:通过预训练视觉语言模型的自监督时空语义特征聚类实现无需训练的视频理解
大规模视觉语言模型(VLMs)在静态图像上的出色零样本推理能力尚未完全扩展到视频领域。传统视频理解模型通常依赖于大量特定任务的标注数据集进行训练,这一过程既昂贵又在可扩展性上有限。本文提出了一种无需训练的视频理解新框架,通过将预训练VLM丰富的语义先验与经典模式发现的机器学习算法相结合,绕过了端到端的训练。我们的核心思想是将视频理解重新定义为高维语义特征空间内的自监督时空聚类问题。所提出的流水线首先使用预训练VLM的冻结视觉编码器将视频流转换为语义特征轨迹。随后,我们采用内核时间分割(KTS)这一稳健的机器学习技术,将连续的特征流分割成离散的、语义上一致的事件段。这些段落随后接受无监督的基于密度的聚类,以识别视频中反复出现的宏观场景和主题。通过从每个发现的聚类中选择代表性的关键帧,并利用VLM的生成能力进行文本描述,我们的框架可以自动产生视频内容的结构化、多模态摘要。这种方法为视频内容的零样本、自动结构分析提供了一条有效、可解释且模型无关的途径。
Summary / 总结
The paper proposes a training-free framework for video understanding by leveraging the semantic priors of pre-trained Visual Language Models (VLMs) and applying self-supervised spatio-temporal clustering. The method transforms video streams into semantic feature trajectories and uses Kernel Temporal Segmentation to partition these features into semantically coherent segments. These segments are then clustered to identify recurring scenes and themes, and keyframes are selected to generate a structured, multi-modal summary. However, the approach has a significant error in feature aggregation that violates the training-free assumption, leading to the withdrawal of the paper.
该论文提出了一种无需训练的视频理解框架,通过利用预训练的视觉语言模型(VLM)和自监督时空聚类。方法将视频流转换为语义特征轨迹,并使用核时序分割(KTS)将这些特征分割成语义上连贯的片段。无监督聚类识别出重复出现的场景,并生成结构化的多模态摘要。然而,该方法在特征聚合步骤中存在重大错误,违反了无需训练的假设,导致论文被撤回。
Causal-HalBench: Uncovering LVLMs Object Hallucinations Through Causal Intervention
Authors: Zhe Xu, Zhicai Wang, Junkang Wu, Jinda Lu, Xiang Wang
Venue: AAAI
First: 2025-11-13T12:53:03+00:00 · Latest: 2025-11-13T12:53:03+00:00
Comments: accepted for publication in the Association for the Advancement of Artificial Intelligence (AAAI), 2026
Abstract
Large Vision-Language Models (LVLMs) often suffer from object hallucination, making erroneous judgments about the presence of objects in images. We propose this primar- ily stems from spurious correlations arising when models strongly associate highly co-occurring objects during train- ing, leading to hallucinated objects influenced by visual con- text. Current benchmarks mainly focus on hallucination de- tection but lack a formal characterization and quantitative evaluation of spurious correlations in LVLMs. To address this, we introduce causal analysis into the object recognition scenario of LVLMs, establishing a Structural Causal Model (SCM). Utilizing the language of causality, we formally de- fine spurious correlations arising from co-occurrence bias. To quantify the influence induced by these spurious correla- tions, we develop Causal-HalBench, a benchmark specifically constructed with counterfactual samples and integrated with comprehensive causal metrics designed to assess model ro- bustness against spurious correlations. Concurrently, we pro- pose an extensible pipeline for the construction of these coun- terfactual samples, leveraging the capabilities of proprietary LVLMs and Text-to-Image (T2I) models for their genera- tion. Our evaluations on mainstream LVLMs using Causal- HalBench demonstrate these models exhibit susceptibility to spurious correlations, albeit to varying extents.
中文标题/摘要
标题:Causal-HalBench:通过因果干预揭示LVLMs对象幻觉
大型视觉-语言模型(LVLMs)经常遭受对象幻觉的困扰,对图像中对象的存在做出错误判断。我们主要认为这是由于在训练过程中模型强烈关联高共现对象时产生的虚假相关性导致的,从而导致受视觉上下文影响的幻觉对象。当前的基准测试主要集中在幻觉检测,但缺乏对LVLMs中虚假相关性的正式表征和定量评估。为了解决这一问题,我们引入因果分析到LVLMs的对象识别场景中,建立了结构因果模型(SCM)。利用因果语言,我们正式定义了由共现偏差引起的虚假相关性。为了量化这些虚假相关性的影响,我们开发了Causal-HalBench,这是一个专门使用反事实样本构建并结合了全面的因果度量基准,旨在评估模型在虚假相关性方面的鲁棒性。同时,我们提出了一个可扩展的管道,用于这些反事实样本的构建,利用专有的LVLMs和文本到图像(T2I)模型生成它们。使用Causal-HalBench对主流LVLMs的评估表明,这些模型在不同程度上对虚假相关性表现出敏感性。
Summary / 总结
The research aims to address the issue of object hallucination in Large Vision-Language Models (LVLMs) by identifying spurious correlations during training. The study introduces a Structural Causal Model (SCM) and develops Causal-HalBench, a benchmark that evaluates models' robustness against spurious correlations through counterfactual samples. The findings show that mainstream LVLMs are susceptible to these correlations to varying degrees.
研究旨在通过识别训练中的虚假关联来解决大型视觉-语言模型(LVLM)中的物体幻觉问题。该研究引入了结构因果模型(SCM)和名为Causal-HalBench的基准,以量化这些关联的影响。关键发现表明,LVLMs对虚假关联具有易感性,不同模型的易感性程度不同。
Preconditioned Inexact Stochastic ADMM for Deep Model
Authors: Shenglong Zhou, Ouya Wang, Ziyan Luo, Yongxu Zhu, Geoffrey Ye Li
First: 2025-02-15T12:28:51+00:00 · Latest: 2025-11-13T12:49:21+00:00
Abstract
The recent advancement of foundation models (FMs) has brought about a paradigm shift, revolutionizing various sectors worldwide. The popular optimizers used to train these models are stochastic gradient descent-based algorithms, which face inherent limitations, such as slow convergence and stringent assumptions for convergence. In particular, data heterogeneity arising from distributed settings poses significant challenges to their theoretical and numerical performance. This paper develops an algorithm, PISA (Preconditioned Inexact Stochastic Alternating Direction Method of Multipliers). Grounded in rigorous theoretical guarantees, the algorithm converges under the sole assumption of Lipschitz continuity of the gradient on a bounded region, thereby removing the need for other conditions commonly imposed by stochastic methods. This capability enables the proposed algorithm to tackle the challenge of data heterogeneity effectively. Moreover, the algorithmic architecture enables scalable parallel computing and supports various preconditions, such as second-order information, second moment, and orthogonalized momentum by Newton-Schulz iterations. Incorporating the latter two preconditions in PISA yields two computationally efficient variants: SISA and NSISA. Comprehensive experimental evaluations for training or fine-tuning diverse deep models, including vision models, large language models, reinforcement learning models, generative adversarial networks, and recurrent neural networks, demonstrate superior numerical performance of SISA and NSISA compared to various state-of-the-art optimizers.
中文标题/摘要
标题:预条件不精确随机ADMM算法在深度模型中的应用
基础模型(FMs)的最新进展带来了范式的转变,革新了全球各个领域。用于训练这些模型的常用优化器是基于随机梯度下降的算法,这些算法存在固有的局限性,如收敛速度慢和对收敛的严格假设。特别是,分布式设置中出现的数据异质性对它们的理论和数值性能构成了重大挑战。本文开发了一种算法,PISA(预条件不精确随机交替方向乘子法)。该算法基于严格的理论保证,在梯度在有界区域内Lipschitz连续的唯一假设下收敛,从而消除了其他随机方法通常需要的其他条件。这种能力使所提出的算法能够有效应对数据异质性的挑战。此外,该算法的架构支持可扩展的并行计算,并支持各种预条件,如二阶信息、二阶矩和通过牛顿-舒尔兹迭代正交化动量。在PISA中结合后两种预条件产生了两种计算效率高的变体:SISA和NSISA。针对包括视觉模型、大型语言模型、强化学习模型、生成对抗网络和递归神经网络在内的多种深度模型的训练或微调进行全面的实验评估表明,SISA和NSISA在数值性能上优于各种最先进的优化器。
Summary / 总结
This paper addresses the limitations of traditional optimizers for training foundation models, particularly their slow convergence and sensitivity to data heterogeneity. It introduces PISA, a preconditioned inexact stochastic ADMM algorithm that converges under mild conditions and supports various preconditions. Experimental results show that SISA and NSISA, two efficient variants of PISA, outperform state-of-the-art optimizers in training and fine-tuning various deep models.
该论文针对传统优化器在训练基础模型时存在的收敛慢和对数据异质性敏感的问题,提出了一种预条件不精确随机交替方向乘子算法PISA,该算法在较宽松的条件下即可收敛,并支持多种预条件。实验结果表明,SISA和NSISA这两种PISA的高效变体在训练和微调各种深度模型时,相较于最先进的优化器具有更好的数值性能。
Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis
Authors: Jiulong Wu, Yucheng Shen, Lingyong Yan, Haixin Sun, Deguo Xia, Jizhou Huang, Min Cao
Venue: AAAI 2026
First: 2025-11-13T12:40:21+00:00 · Latest: 2025-11-13T12:40:21+00:00
Comments: This paper has been accepted by AAAI 2026. 16 pages, 3 figures, 10 tables
Abstract
Facial Emotion Analysis (FEA) extends traditional facial emotion recognition by incorporating explainable, fine-grained reasoning. The task integrates three subtasks: emotion recognition, facial Action Unit (AU) recognition, and AU-based emotion reasoning to model affective states jointly. While recent approaches leverage Vision-Language Models (VLMs) and achieve promising results, they face two critical limitations: (1) hallucinated reasoning, where VLMs generate plausible but inaccurate explanations due to insufficient emotion-specific knowledge; and (2) misalignment between emotion reasoning and recognition, caused by fragmented connections between observed facial features and final labels. We propose Facial-R1, a three-stage alignment framework that effectively addresses both challenges with minimal supervision. First, we employ instruction fine-tuning to establish basic emotional reasoning capability. Second, we introduce reinforcement training guided by emotion and AU labels as reward signals, which explicitly aligns the generated reasoning process with the predicted emotion. Third, we design a data synthesis pipeline that iteratively leverages the prior stages to expand the training dataset, enabling scalable self-improvement of the model. Built upon this framework, we introduce FEA-20K, a benchmark dataset comprising 17,737 training and 1,688 test samples with fine-grained emotion analysis annotations. Extensive experiments across eight standard benchmarks demonstrate that Facial-R1 achieves state-of-the-art performance in FEA, with strong generalization and robust interpretability.
中文标题/摘要
标题:面部-R1:面部情感分析中推理与识别的对齐
面部情感分析(FEA)通过融入可解释的细粒度推理扩展了传统的情感识别。该任务整合了三个子任务:情感识别、面部动作单元(AU)识别和基于AU的情感推理,以联合建模情感状态。尽管最近的方法利用了视觉-语言模型(VLMs)并取得了令人鼓舞的结果,但它们面临两个关键限制:(1)幻觉推理,由于缺乏特定的情感知识,VLMs生成的解释可能是合理的但不准确的;(2)情感推理与识别之间的不一致,由于观察到的面部特征与最终标签之间的断开连接。我们提出了面部-R1,这是一种三阶段对齐框架,能够有效解决这两个问题,同时需要最少的监督。首先,我们采用指令微调来建立基本的情感推理能力。其次,我们引入了由情感和AU标签引导的强化训练,作为奖励信号,明确地将生成的推理过程与预测的情感对齐。第三,我们设计了一个数据合成管道,通过迭代利用先前阶段来扩展训练数据集,使模型能够实现可扩展的自我改进。基于此框架,我们引入了FEA-20K基准数据集,包含17,737个训练样本和1,688个测试样本,带有细粒度的情感分析注释。在八个标准基准上的广泛实验表明,面部-R1在FEA中达到了最先进的性能,具有强大的泛化能力和稳健的可解释性。
Summary / 总结
Facial-R1 addresses the limitations of recent approaches in Facial Emotion Analysis by proposing a three-stage alignment framework. It employs instruction fine-tuning, reinforcement training, and a data synthesis pipeline to improve reasoning and recognition. Facial-R1 outperforms existing methods on eight standard benchmarks, showing strong generalization and interpretability. The framework introduces FEA-20K, a benchmark dataset for fine-grained emotion analysis with 17,737 training and 1,688 test samples.
Facial-R1 提出了一种三阶段对齐框架来解决面部情感分析中现有方法的局限性。该框架通过指令微调、强化训练和数据合成管道来增强推理和识别。Facial-R1 在八个标准基准上优于现有方法,展示了强大的泛化能力和可解释性。该框架还引入了包含 17,737 个训练样本和 1,688 个测试样本的 FEA-20K 基准数据集,用于细粒度的情感分析。
TubeRMC: Tube-conditioned Reconstruction with Mutual Constraints for Weakly-supervised Spatio-Temporal Video Grounding
Authors: Jinxuan Li, Yi Zhang, Jian-Fang Hu, Chaolei Tan, Tianming Liang, Beihao Xia
First: 2025-11-13T12:15:23+00:00 · Latest: 2025-11-13T12:15:23+00:00
Abstract
Spatio-Temporal Video Grounding (STVG) aims to localize a spatio-temporal tube that corresponds to a given language query in an untrimmed video. This is a challenging task since it involves complex vision-language understanding and spatiotemporal reasoning. Recent works have explored weakly-supervised setting in STVG to eliminate reliance on fine-grained annotations like bounding boxes or temporal stamps. However, they typically follow a simple late-fusion manner, which generates tubes independent of the text description, often resulting in failed target identification and inconsistent target tracking. To address this limitation, we propose a Tube-conditioned Reconstruction with Mutual Constraints (\textbf{TubeRMC}) framework that generates text-conditioned candidate tubes with pre-trained visual grounding models and further refine them via tube-conditioned reconstruction with spatio-temporal constraints. Specifically, we design three reconstruction strategies from temporal, spatial, and spatio-temporal perspectives to comprehensively capture rich tube-text correspondences. Each strategy is equipped with a Tube-conditioned Reconstructor, utilizing spatio-temporal tubes as condition to reconstruct the key clues in the query. We further introduce mutual constraints between spatial and temporal proposals to enhance their quality for reconstruction. TubeRMC outperforms existing methods on two public benchmarks VidSTG and HCSTVG. Further visualization shows that TubeRMC effectively mitigates both target identification errors and inconsistent tracking.
中文标题/摘要
标题:TubeRMC:基于管条件重构与互斥约束的弱监督时空视频定位
时空视频定位(STVG)旨在在一个未剪辑的视频中定位与给定语言查询对应的时空管。这是一个具有挑战性的任务,因为它涉及复杂的视觉-语言理解和时空推理。最近的研究在STVG中探索了弱监督设置,以消除对细粒度注释(如边界框或时间戳)的依赖。然而,它们通常遵循简单的后期融合方式,生成与文本描述无关的管,经常导致目标识别失败和不一致的目标跟踪。为了解决这一局限性,我们提出了一种基于管条件重构与互斥约束(Tube-conditioned Reconstruction with Mutual Constraints,TubeRMC)框架,该框架使用预训练的视觉定位模型生成文本条件下的候选管,并通过时空约束下的管条件重构进一步优化它们。具体而言,我们从时间、空间和时空三个角度设计了三种重构策略,以全面捕捉丰富的管-文本对应关系。每种策略都配备了基于管条件的重构器,利用时空管作为条件来重构查询中的关键线索。我们还引入了空间和时间提案之间的互斥约束,以提高它们的重构质量。TubeRMC在两个公开基准VidSTG和HCSTVG上优于现有方法。进一步的可视化显示,TubeRMC有效地缓解了目标识别错误和不一致跟踪的问题。
Summary / 总结
The research aims to improve spatio-temporal video grounding by addressing the limitations of weakly-supervised methods that often generate tubes independent of text descriptions. The proposed TubeRMC framework uses a pre-trained visual grounding model to generate text-conditioned candidate tubes and refines them through tube-conditioned reconstruction with spatio-temporal constraints. This method introduces three reconstruction strategies and mutual constraints to enhance tube-text correspondences, leading to better target identification and tracking. Experimental results show that TubeRMC outperforms existing methods on VidSTG and HCSTVG benchmarks.
研究旨在通过解决弱监督方法生成的管状体与文本描述独立的问题,改进时空视频定位。提出的TubeRMC框架使用预训练的视觉定位模型生成文本条件下的候选管状体,并在时空约束下进行管状体条件下的重构。设计了三种重构策略以全面捕捉管状体与文本的对应关系,并引入了时空提案之间的相互约束以提高重构质量。在VidSTG和HCSTVG基准上的实验表明,TubeRMC优于现有方法,并有效缓解了目标识别错误和不一致跟踪的问题。
Remodeling Semantic Relationships in Vision-Language Fine-Tuning
Authors: Xiangyang Wu, Liu Liu, Baosheng Yu, Jiayan Qiu, Zhenwei Shi
First: 2025-11-11T13:37:13+00:00 · Latest: 2025-11-13T12:01:57+00:00
Abstract
Vision-language fine-tuning has emerged as an efficient paradigm for constructing multimodal foundation models. While textual context often highlights semantic relationships within an image, existing fine-tuning methods typically overlook this information when aligning vision and language, thus leading to suboptimal performance. Toward solving this problem, we propose a method that can improve multimodal alignment and fusion based on both semantics and relationships.Specifically, we first extract multilevel semantic features from different vision encoder to capture more visual cues of the relationships. Then, we learn to project the vision features to group related semantics, among which are more likely to have relationships. Finally, we fuse the visual features with the textual by using inheritable cross-attention, where we globally remove the redundant visual relationships by discarding visual-language feature pairs with low correlation. We evaluate our proposed method on eight foundation models and two downstream tasks, visual question answering and image captioning, and show that it outperforms all existing methods.
中文标题/摘要
标题:视觉语言微调中语义关系的重塑
视觉语言微调已成为构建多模态基础模型的有效范式。尽管文本上下文通常强调图像中的语义关系,但现有的微调方法通常在对齐视觉和语言时忽略了这些信息,从而导致性能不佳。为了解决这一问题,我们提出了一种方法,该方法可以在语义和关系的基础上改进多模态对齐和融合。具体而言,我们首先从不同的视觉编码器中提取多层次的语义特征,以捕获更多关于关系的视觉线索。然后,我们学习将视觉特征投影到相关语义组中,其中更有可能存在关系。最后,我们通过使用可继承的跨注意力机制将视觉特征与文本融合,其中我们通过丢弃低相关性的视觉-语言特征对来全局去除冗余的视觉关系。我们在八个基础模型和两个下游任务(视觉问答和图像字幕)上评估了我们提出的方法,并表明它优于所有现有方法。
Summary / 总结
This paper addresses the issue of suboptimal performance in vision-language fine-tuning due to the neglect of semantic relationships in existing methods. The authors propose a method that enhances multimodal alignment and fusion by extracting multilevel semantic features and learning to project vision features to group related semantics. They then use inheritable cross-attention to fuse visual and textual features, removing redundant visual relationships with low correlation. Experiments on eight foundation models and two downstream tasks show improved performance over existing methods.
研究旨在通过利用图像中的语义关系来提升视觉-语言微调。方法提取多层级语义特征并将视觉特征投影到相关语义组中,然后使用可继承的交叉注意力融合视觉和文本特征,同时去除冗余的视觉关系。实验表明,在视觉问答和图像字幕两个下游任务上,该方法在八个基础模型中均优于现有技术。
An Efficient Training Pipeline for Reasoning Graphical User Interface Agents
Authors: Georgios Pantazopoulos, Eda B. Özyiğit
First: 2025-11-11T12:36:04+00:00 · Latest: 2025-11-13T11:27:31+00:00
Abstract
Visual grounding is the task of localising image regions from natural language queries and is critical for reasoning capable Graphical User Interface agents. Many existing methods rely on massive, noisy synthetic datasets.This work introduces an efficient training pipeline that combines model-based data filtering with parameter-efficient fine-tuning. From 4.8M synthetic examples, 12K clean and diverse instances are curated by first identifying challenging cases, removing misaligned and then selecting a diverse set of multimodal instances. On this data, a 3B-parameter Vision-Language Model is trained under three regimes: supervised fine-tuning, chain-of-thought-augmented fine-tuning, and reinforcement learning via Group Relative Policy Optimization. Models trained with the filtered data and lightweight training strategies match or surpass larger baselines on benchmarks such as ScreenSpot, Multimodal-Mind2Web, and AndroidControl. These results demonstrate that principled data curation and robust adaptation can rival large-scale training, enabling compact yet capable multimodal reasoning agents.
中文标题/摘要
标题:一种高效的推理图形用户界面代理训练管道
视觉定位是将自然语言查询中的图像区域进行定位的任务,对于具备推理能力的图形用户界面代理至关重要。许多现有方法依赖于大量嘈杂的合成数据集。本研究引入了一种高效的训练管道,结合基于模型的数据过滤与参数高效的微调。从480万合成示例中,通过首先识别具有挑战性的案例、移除对齐错误,然后选择多样化的多模态实例,筛选出1.2万干净且多样的实例。在此数据上,使用三种模式进行训练:监督微调、带有链式思考增强的微调以及通过组相对策略优化的强化学习。使用筛选数据和轻量级训练策略训练的模型在ScreenSpot、Multimodal-Mind2Web和AndroidControl等基准测试中与更大规模的基线模型相当或超越。这些结果表明,合理的数据筛选和稳健的适应可以与大规模训练相媲美,从而实现紧凑但功能强大的多模态推理代理。
LayerPeeler: Autoregressive Peeling for Layer-wise Image Vectorization
Authors: Ronghuan Wu, Wanchao Su, Jing Liao
First: 2025-05-29T17:58:03+00:00 · Latest: 2025-11-13T10:26:57+00:00
Comments: Project Page: https://layerpeeler.github.io/
Abstract
Image vectorization is a powerful technique that converts raster images into vector graphics, enabling enhanced flexibility and interactivity. However, popular image vectorization tools struggle with occluded regions, producing incomplete or fragmented shapes that hinder editability. While recent advancements have explored optimization-based and learning-based layer-wise image vectorization, these methods face limitations in vectorization quality and flexibility. In this paper, we introduce LayerPeeler, a novel layer-wise image vectorization approach that addresses these challenges through a progressive simplification paradigm. The key to LayerPeeler's success lies in its autoregressive peeling strategy: by identifying and removing the topmost non-occluded layers while recovering underlying content, we generate vector graphics with complete paths and coherent layer structures. Our method leverages vision-language models to construct a layer graph that captures occlusion relationships among elements, enabling precise detection and description for non-occluded layers. These descriptive captions are used as editing instructions for a finetuned image diffusion model to remove the identified layers. To ensure accurate removal, we employ localized attention control that precisely guides the model to target regions while faithfully preserving the surrounding content. To support this, we contribute a large-scale dataset specifically designed for layer peeling tasks. Extensive quantitative and qualitative experiments demonstrate that LayerPeeler significantly outperforms existing techniques, producing vectorization results with superior path semantics, geometric regularity, and visual fidelity.
中文标题/摘要
标题:LayerPeeler:自回归去层的逐层图像矢量化
图像矢量化是一种强大的技术,能够将位图图像转换为矢量图形,从而增强灵活性和交互性。然而,流行的图像矢量化工具在处理遮挡区域时存在困难,导致不完整或碎片化的形状,影响编辑性。虽然最近的研究探索了基于优化和基于学习的逐层图像矢量化方法,但这些方法在矢量化质量和灵活性方面存在局限性。在本文中,我们介绍了一种名为LayerPeeler的新颖逐层图像矢量化方法,通过渐进简化范式来应对这些挑战。LayerPeeler的成功关键在于其自回归去层策略:通过识别并移除最顶层的非遮挡层并恢复底层内容,我们生成了具有完整路径和连贯层结构的矢量图形。我们的方法利用视觉-语言模型构建一个层图,捕捉元素之间的遮挡关系,从而实现精确的检测和描述非遮挡层。这些描述性标题被用作微调图像扩散模型的编辑指令,以移除识别的层。为了确保准确移除,我们采用局部注意力控制,精确引导模型对目标区域进行操作,同时忠实保留周围内容。为此,我们贡献了一个专门用于层剥离任务的大规模数据集。广泛的定量和定性实验表明,LayerPeeler显著优于现有技术,生成具有更优路径语义、几何规则性和视觉保真的矢量化结果。
Summary / 总结
LayerPeeler is a novel layer-wise image vectorization method that addresses the limitations of existing tools by using an autoregressive peeling strategy. It constructs a layer graph to identify and remove non-occluded layers while preserving underlying content, resulting in complete and coherent vector graphics. Experiments show that LayerPeeler outperforms existing techniques in terms of path semantics, geometric regularity, and visual fidelity.
LayerPeeler 是一种新型的分层图像矢量化方法,通过使用自回归剥离策略来解决现有技术的局限性。它构建了一层图来识别并移除非遮挡层,同时保留底层内容,从而生成具有完整路径和一致结构的矢量图形。实验表明,LayerPeeler 在路径语义、几何规律性和视觉保真度方面优于现有方法。
Intilligence Foundation Model: A New Perspective to Approach Artificial General Intelligence
Authors: Borui Cai, Yao Zhao
First: 2025-11-13T09:28:41+00:00 · Latest: 2025-11-13T09:28:41+00:00
Abstract
We propose a new perspective for approaching artificial general intelligence (AGI) through an intelligence foundation model (IFM). Unlike existing foundation models (FMs), which specialize in pattern learning within specific domains such as language, vision, or time series, IFM aims to acquire the underlying mechanisms of intelligence by learning directly from diverse intelligent behaviors. Vision, language, and other cognitive abilities are manifestations of intelligent behavior; learning from this broad range of behaviors enables the system to internalize the general principles of intelligence. Based on the fact that intelligent behaviors emerge from the collective dynamics of biological neural systems, IFM consists of two core components: a novel network architecture, termed the state neural network, which captures neuron-like dynamic processes, and a new learning objective, neuron output prediction, which trains the system to predict neuronal outputs from collective dynamics. The state neural network emulates the temporal dynamics of biological neurons, allowing the system to store, integrate, and process information over time, while the neuron output prediction objective provides a unified computational principle for learning these structural dynamics from intelligent behaviors. Together, these innovations establish a biologically grounded and computationally scalable foundation for building systems capable of generalization, reasoning, and adaptive learning across domains, representing a step toward truly AGI.
中文标题/摘要
标题:智能基础模型:接近人工通用智能的新视角
我们提出了一种通过智能基础模型(IFM)来接近人工通用智能(AGI)的新视角。与现有的专注于特定领域如语言、视觉或时间序列中的模式学习的基础模型(FM)不同,IFM旨在通过直接从多样化的智能行为中学习来获取智能的基本机制。视觉、语言和其他认知能力是智能行为的表现形式;从这一广泛的行为中学习使系统能够内化智能的一般原则。基于智能行为源自生物神经系统的集体动力学这一事实,IFM由两个核心组件组成:一种称为状态神经网络的新型网络架构,用于捕捉神经元样动态过程,以及一种新的学习目标——神经元输出预测,用于训练系统从集体动力学中预测神经元输出。状态神经网络模拟了生物神经元的时间动态,使系统能够存储、整合和处理信息,而神经元输出预测目标则提供了一个统一的计算原则,用于从智能行为中学习这些结构动力学。这些创新共同建立了一个基于生物学和计算上可扩展的基础,用于构建能够在不同领域进行泛化、推理和适应性学习的系统,代表着向真正AGI迈出的一步。
Summary / 总结
The paper introduces an intelligence foundation model (IFM) aimed at achieving artificial general intelligence (AGI) by learning from diverse intelligent behaviors rather than specializing in specific domains. IFM consists of a state neural network that captures neuron-like dynamics and a neuron output prediction learning objective. This approach enables the system to internalize general principles of intelligence and perform tasks across domains, representing a step towards AGI. Key findings include the system's ability to generalize, reason, and adapt across different domains through its biologically grounded architecture and learning objectives.
论文提出了一种智能基础模型(IFM),旨在通过学习多样化的智能行为来实现人工通用智能(AGI),而不是专注于特定领域。IFM 包含一个能够捕捉神经元动态过程的状态神经网络和一个神经元输出预测的学习目标。这种方法使系统能够内化智能的基本原理,并在跨领域执行任务,代表了向真正AGI迈出的一步。关键发现包括系统能够存储、整合和处理时间信息,以及从智能行为中学习的统一计算原理。
MTAttack: Multi-Target Backdoor Attacks against Large Vision-Language Models
Authors: Zihan Wang, Guansong Pang, Wenjun Miao, Jin Zheng, Xiao Bai
First: 2025-11-13T09:00:21+00:00 · Latest: 2025-11-13T09:00:21+00:00
Comments: AAAI2026, with supplementary material
Abstract
Recent advances in Large Visual Language Models (LVLMs) have demonstrated impressive performance across various vision-language tasks by leveraging large-scale image-text pretraining and instruction tuning. However, the security vulnerabilities of LVLMs have become increasingly concerning, particularly their susceptibility to backdoor attacks. Existing backdoor attacks focus on single-target attacks, i.e., targeting a single malicious output associated with a specific trigger. In this work, we uncover multi-target backdoor attacks, where multiple independent triggers corresponding to different attack targets are added in a single pass of training, posing a greater threat to LVLMs in real-world applications. Executing such attacks in LVLMs is challenging since there can be many incorrect trigger-target mappings due to severe feature interference among different triggers. To address this challenge, we propose MTAttack, the first multi-target backdoor attack framework for enforcing accurate multiple trigger-target mappings in LVLMs. The core of MTAttack is a novel optimization method with two constraints, namely Proxy Space Partitioning constraint and Trigger Prototype Anchoring constraint. It jointly optimizes multiple triggers in the latent space, with each trigger independently mapping clean images to a unique proxy class while at the same time guaranteeing their separability. Experiments on popular benchmarks demonstrate a high success rate of MTAttack for multi-target attacks, substantially outperforming existing attack methods. Furthermore, our attack exhibits strong generalizability across datasets and robustness against backdoor defense strategies. These findings highlight the vulnerability of LVLMs to multi-target backdoor attacks and underscore the urgent need for mitigating such threats. Code is available at https://github.com/mala-lab/MTAttack.
中文标题/摘要
标题:MTAttack:针对大型视觉语言模型的多目标后门攻击
大型视觉语言模型(LVLMs)的最新进展通过大规模图像-文本预训练和指令调优,在各种视觉语言任务中展示了令人印象深刻的性能。然而,LVLMs的安全漏洞日益引起关注,特别是它们对后门攻击的易感性。现有的后门攻击主要集中在单目标攻击上,即针对特定触发器关联的单一恶意输出。在本文中,我们揭示了多目标后门攻击,其中在一次训练过程中添加了多个独立的触发器,对应不同的攻击目标,这在实际应用中对LVLMs构成了更大的威胁。在LVLMs中执行此类攻击具有挑战性,因为不同触发器之间可能存在严重的特征干扰,导致许多不正确的触发器-目标映射。为了解决这一挑战,我们提出了MTAttack,这是第一个用于在LVLMs中强制执行准确的多触发器-目标映射的多目标后门攻击框架。MTAttack的核心是一种新颖的优化方法,包含两个约束,即代理空间分区约束和触发器原型锚定约束。该方法在潜在空间中联合优化多个触发器,每个触发器独立地将干净图像映射到唯一的代理类别,同时保证它们的可分性。在流行的基准测试上的实验表明,MTAttack在多目标攻击中的成功率很高,显著优于现有攻击方法。此外,我们的攻击在不同数据集上表现出很强的泛化能力和对后门防御策略的鲁棒性。这些发现突显了LVLMs对多目标后门攻击的脆弱性,并强调了缓解此类威胁的紧迫性。代码可在https://github.com/mala-lab/MTAttack/获取。
Summary / 总结
MTAttack is a novel multi-target backdoor attack framework for Large Visual Language Models (LVLMs) that addresses the challenge of enforcing accurate multiple trigger-target mappings. It uses a novel optimization method with two constraints to jointly optimize multiple triggers in the latent space, ensuring each trigger maps clean images to a unique proxy class while maintaining separability. Experiments show MTAttack achieves high success rates in multi-target attacks, outperforming existing methods and demonstrating strong generalizability and robustness against defenses.
MTAttack 是一种针对大型视觉语言模型(LVLM)的多目标后门攻击框架,解决了不同触发器之间严重特征干扰的挑战。它引入了一种两约束优化方法,以确保多个触发器目标映射的准确性。实验表明,MTAttack 在多目标攻击中的成功率很高,并且具有很强的跨数据集的一般化能力和对防御策略的鲁棒性,突显了 LVLM 对此类攻击的脆弱性。
GridPrune: From "Where to Look" to "What to Select" in Visual Token Pruning for MLLMs
Authors: Yuxiang Duan, Ao Li, Yingqin Li, Luyu Li, Pengwei Wang
First: 2025-11-13T08:35:39+00:00 · Latest: 2025-11-13T08:35:39+00:00
Abstract
Multimodal large language models (MLLMs) have shown remarkable capabilities in a wide range of vision-language tasks. However, the large number of visual tokens introduces significant computational overhead. To address this issue, visual token pruning has emerged as a key technique for enhancing the efficiency of MLLMs. In cognitive science, humans tend to first determine which regions of a scene to attend to ("where to look") before deciding which specific elements within those regions to process in detail ("what to select"). This two-stage strategy enables the visual system to efficiently allocate attention at a coarse spatial level before performing fine-grained selection. However, existing pruning methods primarily focus on directly optimizing "what to select", typically using attention scores or similarity metrics. They rarely consider "where to look", which has been shown to lead to inefficient spatial allocation, positional bias, and the retention of irrelevant or redundant tokens. In this paper, we propose GridPrune, a method that replaces the global Top-K mechanism with a "guide-globally, select-locally" zonal selection system. GridPrune splits the pruning process into two steps: first, it uses text-conditional guidance to dynamically allocate a token budget across spatial zones; and then, it performs local selection within each budgeted zone. Experimental results demonstrate that GridPrune achieves superior performance across various MLLM architectures. On LLaVA-NeXT-7B, GridPrune retains 96.98% of the full performance while using 11.1% of the tokens, outperforming the best-performing baseline by 2.34% at the same pruning rate.
中文标题/摘要
标题:GridPrune:从“看哪里”到“选什么”在视觉标记剪枝中的应用
多模态大型语言模型(MLLMs)在视觉语言任务中展现了显著的能力。然而,大量的视觉标记引入了显著的计算开销。为了解决这一问题,视觉标记剪枝已成为提高MLLMs效率的关键技术。在认知科学中,人类倾向于首先确定场景中需要关注的区域(“看哪里”),然后再决定在这些区域中需要详细处理的具体元素(“选什么”)。这种两阶段策略使视觉系统能够在粗略的空间层次上高效地分配注意力,然后再进行精细的选择。然而,现有的剪枝方法主要集中在直接优化“选什么”,通常使用注意力分数或相似度度量。它们很少考虑“看哪里”,这已被证明会导致空间分配效率低下、位置偏见以及保留无关或冗余的标记。在本文中,我们提出了一种名为GridPrune的方法,该方法用“全局引导,局部选择”的区域选择系统取代了全局Top-K机制。GridPrune将剪枝过程分为两步:首先,使用文本条件引导动态分配空间区域的标记预算;然后,在每个预算区域中进行局部选择。实验结果表明,GridPrune在各种MLLM架构中均表现出优越的性能。在LLaVA-NeXT-7B上,GridPrune保留了96.98%的全性能,同时仅使用11.1%的标记,与最佳基线相比,在相同的剪枝率下性能高出2.34%。
Summary / 总结
GridPrune is a method designed to improve the efficiency of multimodal large language models (MLLMs) by addressing the computational overhead caused by visual tokens. Inspired by human visual attention, GridPrune introduces a two-step pruning process: first, it uses text-conditional guidance to allocate a token budget across spatial zones, and then it performs local selection within each zone. This approach outperforms existing methods, achieving 96.98% of full performance with only 11.1% of the tokens on LLaVA-NeXT-7B, surpassing the best baseline by 2.34% at the same pruning rate.
GridPrune通过引入两阶段选择过程来解决MLLM中视觉标记剪枝的效率问题,该过程借鉴了人类视觉注意力机制。首先使用文本条件指导动态分配空间区域的标记预算,然后在每个预算区域内进行局部选择。该方法在LLaVA-NeXT-7B上实现了96.98%的全性能,仅使用11.1%的标记,优于最佳基线2.34%,在相同的剪枝率下表现出色。
VLF-MSC: Vision-Language Feature-Based Multimodal Semantic Communication System
Authors: Gwangyeon Ahn, Jiwan Seo, Joonhyuk Kang
Venue: NeurIPS 2025
First: 2025-11-13T08:29:32+00:00 · Latest: 2025-11-13T08:29:32+00:00
Comments: To appear in the AI4NextG Workshop at NeurIPS 2025
Abstract
We propose Vision-Language Feature-based Multimodal Semantic Communication (VLF-MSC), a unified system that transmits a single compact vision-language representation to support both image and text generation at the receiver. Unlike existing semantic communication techniques that process each modality separately, VLF-MSC employs a pre-trained vision-language model (VLM) to encode the source image into a vision-language semantic feature (VLF), which is transmitted over the wireless channel. At the receiver, a decoder-based language model and a diffusion-based image generator are both conditioned on the VLF to produce a descriptive text and a semantically aligned image. This unified representation eliminates the need for modality-specific streams or retransmissions, improving spectral efficiency and adaptability. By leveraging foundation models, the system achieves robustness to channel noise while preserving semantic fidelity. Experiments demonstrate that VLF-MSC outperforms text-only and image-only baselines, achieving higher semantic accuracy for both modalities under low SNR with significantly reduced bandwidth.
中文标题/摘要
标题:VLF-MSC:基于视觉语言特征的多模态语义通信系统
我们提出了一种基于视觉语言特征的多模态语义通信(VLF-MSC)系统,该系统通过无线信道传输单一紧凑的视觉语言表示,以支持接收端的图像和文本生成。与现有技术分别处理每种模态不同,VLF-MSC 使用预训练的视觉语言模型(VLM)将源图像编码为视觉语言语义特征(VLF),并通过无线信道传输。在接收端,基于解码器的语言模型和基于扩散的图像生成器都以VLF为条件,生成描述性文本和语义对齐的图像。这种统一表示消除了对特定模态流或重传的需求,提高了频谱效率和适应性。通过利用基础模型,该系统在保持语义保真度的同时,对信道噪声具有鲁棒性。实验表明,VLF-MSC 在低信噪比下优于仅文本和仅图像基线,显著减少带宽的同时实现了更高的语义准确性。
Summary / 总结
VLF-MSC is a unified system that uses a pre-trained vision-language model to encode an image into a compact vision-language semantic feature (VLF) for transmission, which is then decoded by a language model and an image generator to produce text and images. This approach improves spectral efficiency and semantic accuracy, especially under low signal-to-noise ratio conditions, outperforming text-only and image-only baselines with reduced bandwidth requirements.
VLF-MSC 是一种统一系统,使用预训练的视觉-语言模型将图像编码为紧凑的视觉-语言语义特征,然后传输到接收端。在这里,解码器语言模型和扩散图像生成器都基于此特征生成文本和图像。这种方法与模态特定的方法相比,提高了频谱效率和适应性。实验表明,VLF-MSC 在低信噪比下比纯文本和纯图像基线表现出更高的语义准确性,并且带宽显著减少。
History
20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553