arXiv 论文速递

2025-12-02 03:35
Snapshot: 20251202_0335
Visual Generation Tuning
Authors: Jiahao Guo, Sinan Du, Jingfeng Yao, Wenyu Liu, Bo Li, Haoxiang Cao, Kun Gai, Chun Yuan, Kai Wu, Xinggang Wang
First: 2025-11-28T18:57:13+00:00 · Latest: 2025-11-28T18:57:13+00:00
Abstract
Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned with language. However, it remains underexplored whether these representations, optimized for multimodal understanding tasks, harbor an inherent potential for visual generation. In this paper, we propose VGT, Visual Generation Tuning, a novel paradigm designed to stimulate the underlying capabilities of visual generation within any vision language models. By performing efficient visual generation tuning on well-pretrained VLMs, we significantly mitigate the alignment costs and accelerate the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, we dismiss the entangled pixel-level VAEs designed for diffusion transformers and formulate VGT-AE through aligning the semantic encoders from pretrained VLMs with the latent representations of pixel decoders. In image reconstruction tasks, we achieve 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs; in visual generation tasks, we achieve state-of-the-art outcomes among autoregressive models, 0.77 on GenEval and 78.73 on DPG-Bench. Furthermore, our proposed VGT showcases significant scaling promise and is versatile for endowing any VLMs trained for multimodal understanding with the capabilities of visual generation, which paves the new avenue to explore next-generation unified multimodal foundation models. Models and codes are available at https://github.com/hustvl/VGT.
中文标题/摘要
标题:视觉生成调优
大规模视觉语言模型(VLMs)通过广泛的预训练有效地弥合了模态差距,获得了与语言对齐的复杂视觉表示。然而,尚未充分探索这些优化用于多模态理解任务的表示是否内在地具有视觉生成的潜力。在本文中,我们提出了VGT(视觉生成调优),这是一种新型范式,旨在激发视觉生成的潜在能力在任何视觉语言模型中。通过对充分预训练的VLMs进行高效的视觉生成调优,我们显著降低了对齐成本并加速了自回归建模在连续空间中的收敛(20倍加速)。具体而言,我们摒弃了为扩散变换器设计的纠缠像素级VAEs,并通过将预训练VLMs的语义编码器与像素解码器的潜在表示对齐来形成VGT-AE。在图像重建任务中,我们实现了26.67 PSNR和0.50 rFID,在28倍压缩比下优于专门的VAEs;在视觉生成任务中,我们在自回归模型中实现了最先进的结果,GenEval上为0.77,DPG-Bench上为78.73。此外,我们提出的VGT展示了显著的扩展潜力,并且可以赋予任何用于多模态理解训练的VLMs视觉生成的能力,这开辟了探索下一代统一多模态基础模型的新途径。模型和代码可在https://github.com/hustvl/VGT/获取。
Summary / 总结
This paper introduces VGT (Visual Generation Tuning), a novel method to enhance the visual generation capabilities of large vision language models (VLMs). By aligning semantic encoders from pre-trained VLMs with latent representations of pixel decoders, VGT significantly improves the speed and performance of autoregressive modeling. The method achieves 26.67 PSNR and 0.50 rFID in image reconstruction tasks and state-of-the-art results in visual generation tasks with scores of 0.77 on GenEval and 78.73 on DPG-Bench. VGT also demonstrates scalability and versatility, enabling any VLMs trained for multimodal understanding to generate visual content efficiently.
本文提出了一种名为VGT(Visual Generation Tuning)的新方法,通过将预训练的Vision Language Models(VLMs)的语义编码器与像素解码器对齐来增强其视觉生成能力。VGT显著提高了自回归建模的速度和性能,在图像重建任务中实现了26.67 PSNR和0.50 rFID,在视觉生成任务中取得了0.77的GenEval和78.73的DPG-Bench的最优结果。该方法为VLMs赋予视觉生成能力提供了可扩展的解决方案,为下一代统一多模态基础模型开辟了新的途径。
LFM2 Technical Report
Authors: Alexander Amini, Anna Banaszak, Harold Benoit, Arthur Böök, Tarek Dakhran, Song Duong, Alfred Eng, Fernando Fernandes, Marc Härkönen, Anne Harrington, Ramin Hasani, Saniya Karwa, Yuri Khrustalev, Maxime Labonne, Mathias Lechner, Valentine Lechner, Simon Lee, Zetian Li, Noel Loo, Jacob Marks, Edoardo Mosca, Samuel J. Paech, Paul Pak, Rom N. Parnichkun, Alex Quach, Ryan Rogers, Daniela Rus, Nayan Saxena, Bettina Schlager, Tim Seyde, Jimmy T. H. Smith, Aditya Tadimeti, Neehal Tumma
First: 2025-11-28T17:56:35+00:00 · Latest: 2025-11-28T17:56:35+00:00
Abstract
We present LFM2, a family of Liquid Foundation Models designed for efficient on-device deployment and strong task capabilities. Using hardware-in-the-loop architecture search under edge latency and memory constraints, we obtain a compact hybrid backbone that combines gated short convolutions with a small number of grouped query attention blocks, delivering up to 2x faster prefill and decode on CPUs compared to similarly sized models. The LFM2 family covers 350M-8.3B parameters, including dense models (350M, 700M, 1.2B, 2.6B) and a mixture-of-experts variant (8.3B total, 1.5B active), all with 32K context length. LFM2's training pipeline includes a tempered, decoupled Top-K knowledge distillation objective that avoids support mismatch; curriculum learning with difficulty-ordered data; and a three-stage post-training recipe of supervised fine-tuning, length-normalized preference optimization, and model merging. Pre-trained on 10-12T tokens, LFM2 models achieve strong results across diverse benchmarks; for example, LFM2-2.6B reaches 79.56% on IFEval and 82.41% on GSM8K. We further build multimodal and retrieval variants: LFM2-VL for vision-language tasks, LFM2-Audio for speech, and LFM2-ColBERT for retrieval. LFM2-VL supports tunable accuracy-latency tradeoffs via token-efficient visual processing, while LFM2-Audio separates audio input and output pathways to enable real-time speech-to-speech interaction competitive with models 3x larger. LFM2-ColBERT provides a low-latency encoder for queries and documents, enabling high-performance retrieval across multiple languages. All models are released with open weights and deployment packages for ExecuTorch, llama.cpp, and vLLM, making LFM2 a practical base for edge applications that need fast, memory-efficient inference and strong task capabilities.
中文标题/摘要
标题:LFM2 技术报告
我们提出了LFM2,一种为高效设备端部署和强大任务能力设计的液态基础模型家族。通过在边缘延迟和内存约束下使用硬件在环的架构搜索,我们获得了一个紧凑的混合骨干,结合了门控短卷积和少量分组查询注意模块,相比同等规模的模型,CPU上的预填充和解码速度可提高2倍。LFM2家族包括350M-8.3B参数,包括密集模型(350M,700M,1.2B,2.6B)和专家混合变体(总计8.3B,活跃1.5B),所有模型的上下文长度均为32K。LFM2的训练管道包括一个温和的、解耦的Top-K知识蒸馏目标,避免支持不匹配;难度排序的数据的课程学习;以及三个阶段的后训练食谱:监督微调、长度归一化的偏好优化和模型合并。预训练在10-12T令牌上,LFM2模型在多种基准测试中表现出色;例如,LFM2-2.6B在IFEval上达到79.56%,在GSM8K上达到82.41%。我们进一步构建了多模态和检索变体:LFM2-VL用于视觉语言任务,LFM2-Audio用于语音,LFM2-ColBERT用于检索。LFM2-VL通过高效的视觉处理支持可调的准确率-延迟权衡,而LFM2-Audio分离音频输入和输出路径,以实现与大3倍的模型竞争的实时语音到语音交互。LFM2-ColBERT提供了一个低延迟的查询和文档编码器,能够在多种语言中实现高性能检索。所有模型均以开放权重和部署包的形式发布,支持ExecuTorch、llama.cpp和vLLM,使LFM2成为需要快速、内存高效推理和强大任务能力的边缘应用的实用基础。
Summary / 总结
LFM2 is a family of Liquid Foundation Models designed for efficient on-device deployment and strong task capabilities. Using hardware-in-the-loop architecture search under edge latency and memory constraints, LFM2 achieves up to 2x faster prefill and decode on CPUs compared to similarly sized models. Key findings include strong performance across diverse benchmarks, such as LFM2-2.6B achieving 79.56% on IFEval and 82.41% on GSM8K. LFM2 includes multimodal and retrieval variants, such as LFM2-VL and LFM2-ColBERT, which offer tunable accuracy-latency tradeoffs and low-latency encoders, respectively. All models are released with open weights and deployment packages for ExecuTorch, llama.cpp, and vLLM, making LFM2 practical for edge applications requiring fast, memory-efficient inference and strong task capabilities.
LFM2 是一种液态基础模型家族,旨在实现高效的设备端部署和强大的任务能力。通过在边缘延迟和内存约束下使用硬件在环架构搜索,LFM2 在 CPU 上实现了比同等大小模型快 2 倍的预填充和解码速度。关键发现包括在多种基准测试中取得优异结果,例如 LFM2-2.6B 在 IFEval 中达到 79.56%,在 GSM8K 中达到 82.41%。LFM2 还包括多模态和检索变体,如 LFM2-VL 和 LFM2-ColBERT,它们提供了可调的准确率-延迟权衡和用于高性能检索的低延迟编码器。
Configurable Fairness: Direct Optimization of Parity Metrics via Vision-Language Models
Authors: Miao Zhang, Rumi Chunara
First: 2024-03-15T18:37:15+00:00 · Latest: 2025-11-28T17:33:28+00:00
Abstract
Performance disparities of image recognition across demographic groups are known to exist in deep learning-based models, due to imbalanced group representations or spurious correlation between group and target labels. Previous work has addressed such challenges without relying on expensive group labels, typically by upweighting high-loss samples or balancing discovered clusters. However, these heuristic strategies lack direct connection to specific fairness metrics and cannot guarantee optimization of parity-based criteria like equal opportunity, which ensures equal chance to receive positive outcomes across groups. In this work, we propose a novel paradigm that directly optimizes parity-based fairness metrics through specifically designed training objectives, without requiring group labels. We leverage vision-language models to analyze sensitive attribute relevancy for individual samples, then formulate loss functions that mathematically connect to each target fairness metric. This enables flexible optimization of different fairness criteria based on application needs. Experiments on multiple image classification datasets show that our metric-specific approach significantly improves parity-based fairness criteria and outperforms existing methods.
中文标题/摘要
标题:可配置的公平性:通过视觉语言模型直接优化平等度量
在基于深度学习的模型中,图像识别在不同人口群体之间的性能差异是已知存在的,这归因于群体表示的不平衡或群体和目标标签之间的虚假相关性。先前的工作通过加权高损失样本或平衡发现的聚类来解决此类挑战,而无需依赖昂贵的群体标签。然而,这些启发式策略缺乏与特定公平性指标的直接联系,并不能保证优化基于平等性的标准,如平等机会,这确保了不同群体中获得积极结果的机会均等。在本工作中,我们提出了一种新的范式,通过专门设计的训练目标直接优化基于平等性的公平性指标,而无需要求群体标签。我们利用视觉语言模型分析敏感属性对个体样本的相关性,然后制定损失函数,使其数学上与每个目标公平性指标相连。这使得根据应用需求灵活优化不同的公平性标准成为可能。在多个图像分类数据集上的实验表明,我们的指标特定方法显著提高了基于平等性的公平性标准,并优于现有方法。
Summary / 总结
This work addresses performance disparities in image recognition across demographic groups by directly optimizing parity-based fairness metrics using vision-language models. It proposes a novel training objective that connects to specific fairness criteria without needing group labels. Experiments demonstrate that this metric-specific approach significantly improves parity-based fairness and outperforms existing methods on multiple image classification datasets.
该研究通过提出一种新的方法,利用视觉-语言模型直接优化基于公平性的度量标准,来解决图像识别中不同群体之间的性能差异问题。与之前的启发式策略不同,该方法不需要群体标签,并通过数学方式将损失函数与特定的公平性指标联系起来。实验结果显示,所提出的方法在多个图像分类数据集上显著提高了基于公平性的度量标准,并优于现有方法。
DEAL-300K: Diffusion-based Editing Area Localization with a 300K-Scale Dataset and Frequency-Prompted Baseline
Authors: Rui Zhang, Hongxia Wang, Hangqing Liu, Yang Zhou, Qiang Zeng
First: 2025-11-28T17:22:07+00:00 · Latest: 2025-11-28T17:22:07+00:00
Comments: 13pages,12 figures
Abstract
Diffusion-based image editing has made semantic level image manipulation easy for general users, but it also enables realistic local forgeries that are hard to localize. Existing benchmarks mainly focus on the binary detection of generated images or the localization of manually edited regions and do not reflect the properties of diffusion-based edits, which often blend smoothly into the original content. We present Diffusion-Based Image Editing Area Localization Dataset (DEAL-300K), a large scale dataset for diffusion-based image manipulation localization (DIML) with more than 300,000 annotated images. We build DEAL-300K by using a multi-modal large language model to generate editing instructions, a mask-free diffusion editor to produce manipulated images, and an active-learning change detection pipeline to obtain pixel-level annotations. On top of this dataset, we propose a localization framework that uses a frozen Visual Foundation Model (VFM) together with Multi Frequency Prompt Tuning (MFPT) to capture both semantic and frequency-domain cues of edited regions. Trained on DEAL-300K, our method reaches a pixel-level F1 score of 82.56% on our test split and 80.97% on the external CoCoGlide benchmark, providing strong baselines and a practical foundation for future DIML research.The dataset can be accessed via https://github.com/ymhzyj/DEAL-300K.
中文标题/摘要
标题:DEAL-300K:基于扩散的编辑区域定位,附30万规模数据集及频率提示基线
基于扩散的图像编辑使普通用户能够轻松进行语义级别的图像操作,但也使得难以定位的现实局部伪造成为可能。现有基准主要集中在生成图像的二元检测或手动编辑区域的定位上,未能反映基于扩散的编辑特性,这些编辑通常会平滑地融入原始内容。我们提出了基于扩散的图像编辑区域定位数据集(DEAL-300K),这是一个用于基于扩散的图像操作定位(DIML)的大规模数据集,包含超过30万张标注图像。我们通过使用多模态大型语言模型生成编辑指令、无掩码扩散编辑器生成处理后的图像,以及主动学习变化检测流水线获得像素级标注。在此数据集基础上,我们提出了一种定位框架,该框架结合冻结的视觉基础模型(VFM)和多频率提示调优(MFPT),以捕捉编辑区域的语义和频域线索。在DEAL-300K上训练,我们的方法在我们的测试分割上达到像素级F1分数82.56%,在外部CoCoGlide基准上达到80.97%,为未来的DIML研究提供了强大的基线和实用的基础。数据集可通过https://github.com/ymhzyj/DEAL-300K访问。
Summary / 总结
The paper introduces DEAL-300K, a large-scale dataset for localizing diffusion-based image manipulations, addressing the challenge of localizing realistic forgeries. It uses a multi-modal LLM for instructions, a mask-free diffusion editor, and an active-learning pipeline for pixel-level annotations. The proposed localization framework, combining a frozen Visual Foundation Model with Multi Frequency Prompt Tuning, achieves a pixel-level F1 score of 82.56% on the test split and 80.97% on the CoCoGlide benchmark, providing strong baselines for future research.
研究旨在解决由扩散基础图像编辑生成的现实局部伪造难以定位的问题。作者开发了包含超过300,000个标注图像的DEAL-300K大型数据集,使用多模态大语言模型生成指令、无掩码扩散编辑器和主动学习变更检测流水线进行像素级标注。他们提出了一种结合冻结视觉基础模型和多频率提示调优的定位框架,以捕捉编辑区域的语义和频域特征。该方法在测试分割上的像素级F1分数为82.56%,在外部CoCoGlide基准上的分数为80.97%,为未来的研究提供了强大的基线和实用的基础。
FAST: Foreground-aware Diffusion with Accelerated Sampling Trajectory for Segmentation-oriented Anomaly Synthesis
Authors: Xichen Xu, Yanshu Wang, Jinbao Wang, Xiaoning Lei, Guoyang Xie, Guannan Jiang, Zhichao Lu
First: 2025-09-24T16:28:15+00:00 · Latest: 2025-11-28T16:42:48+00:00
Abstract
Industrial anomaly segmentation relies heavily on pixel-level annotations, yet real-world anomalies are often scarce, diverse, and costly to label. Segmentation-oriented industrial anomaly synthesis (SIAS) has emerged as a promising alternative; however, existing methods struggle to balance sampling efficiency and generation quality. Moreover, most approaches treat all spatial regions uniformly, overlooking the distinct statistical differences between anomaly and background areas. This uniform treatment hinders the synthesis of controllable, structure-specific anomalies tailored for segmentation tasks. In this paper, we propose FAST, a foreground-aware diffusion framework featuring two novel modules: the Anomaly-Informed Accelerated Sampling (AIAS) and the Foreground-Aware Reconstruction Module (FARM). AIAS is a training-free sampling algorithm specifically designed for segmentation-oriented industrial anomaly synthesis, which accelerates the reverse process through coarse-to-fine aggregation and enables the synthesis of state-of-the-art segmentation-oriented anomalies in as few as 10 steps. Meanwhile, FARM adaptively adjusts the anomaly-aware noise within the masked foreground regions at each sampling step, preserving localized anomaly signals throughout the denoising trajectory. Extensive experiments on multiple industrial benchmarks demonstrate that FAST consistently outperforms existing anomaly synthesis methods in downstream segmentation tasks. We release the code at: https://github.com/Chhro123/fast-foreground-aware-anomaly-synthesis.
中文标题/摘要
标题:FAST:面向分割的加速采样轨迹与前景感知扩散异常合成框架
工业异常分割严重依赖于像素级注释,但实际中的异常往往稀缺、多样且标注成本高昂。分割导向的工业异常合成(SIAS)已成为一种有前景的替代方案;然而,现有方法难以在采样效率和生成质量之间取得平衡。此外,大多数方法对所有空间区域处理一致,忽视了异常区域与背景区域之间的统计差异。这种一致处理阻碍了为分割任务定制的、结构特定的异常合成。本文提出了一种前景感知扩散框架FAST,包含两个新型模块:异常导向加速采样(AIAS)和前景感知重构模块(FARM)。AIAS 是一种无需训练的采样算法,专门针对分割导向的工业异常合成,通过从粗到细的聚合加速逆过程,并能在不到10步内合成最先进的分割导向异常。同时,FARM 在每次采样步骤中自适应调整掩码前景区域内的异常感知噪声,确保在去噪轨迹中保留局部异常信号。在多个工业基准上的广泛实验表明,FAST 在下游分割任务中始终优于现有异常合成方法。我们已将代码发布在:https://github.com/Chhro123/fast-foreground-aware-anomaly-synthesis。
Summary / 总结
FAST is a foreground-aware diffusion framework designed to synthesize segmentation-oriented industrial anomalies efficiently. It includes the Anomaly-Informed Accelerated Sampling (AIAS) and the Foreground-Aware Reconstruction Module (FARM). AIAS accelerates the synthesis process with coarse-to-fine aggregation, while FARM adjusts noise in masked foreground regions to preserve localized anomaly signals. Experiments show that FAST outperforms existing methods in segmentation tasks across multiple industrial benchmarks.
FAST 是一种前景感知的扩散框架,旨在提高工业异常分割的效率和质量。它引入了两个新型模块:AIAS 用于加速采样,FARM 用于前景感知重构。实验表明,FAST 在分割任务中优于现有方法,仅需 10 步即可完成合成。该方法通过关注异常和背景之间的差异,解决了均匀处理空间区域的局限性,从而生成更可控和结构特定的异常。
Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach
Authors: Haruki Sakajo, Hiroshi Takato, Hiroshi Tsutsui, Komei Soda, Hidetaka Kamigaito, Taro Watanabe
First: 2025-11-28T16:09:36+00:00 · Latest: 2025-11-28T16:09:36+00:00
Comments: Accepted to MMLoSo 2025
Abstract
Large-scale Vision Language Models (LVLMs) exhibit advanced capabilities in tasks that require visual information, including object detection. These capabilities have promising applications in various industrial domains, such as autonomous driving. For example, LVLMs can generate safety-oriented descriptions of videos captured by road-facing cameras. However, ensuring comprehensive safety requires monitoring driver-facing views as well to detect risky events, such as the use of mobiles while driving. Thus, the ability to process synchronized inputs is necessary from both driver-facing and road-facing cameras. In this study, we develop models and investigate the capabilities of LVLMs by constructing a dataset and evaluating their performance on this dataset. Our experimental results demonstrate that while pre-trained LVLMs have limited effectiveness, fine-tuned LVLMs can generate accurate and safety-aware driving instructions. Nonetheless, several challenges remain, particularly in detecting subtle or complex events in the video. Our findings and error analysis provide valuable insights that can contribute to the improvement of LVLM-based systems in this domain.
中文标题/摘要
标题:迈向自动安全驾驶指导:大规模视觉语言模型方法
大规模视觉语言模型(LVLMs)在需要视觉信息的任务中表现出先进的能力,包括物体检测。这些能力在各个工业领域中具有潜在的应用前景,例如自动驾驶。例如,LVLMs可以生成由道路朝向摄像头捕获的视频的安全导向描述。然而,确保全面的安全性还需要监控驾驶员朝向的视图以检测诸如使用手机等危险事件。因此,处理来自驾驶员朝向和道路朝向摄像头的同步输入的能力是必要的。在本研究中,我们开发了模型并通过对数据集的构建和评估其在该数据集上的性能来研究LVLMs的能力。我们的实验结果表明,虽然预训练的LVLMs效果有限,但微调的LVLMs可以生成准确且安全导向的驾驶指令。然而,仍存在一些挑战,特别是在检测视频中的细微或复杂事件方面。我们的研究结果和错误分析提供了有价值的见解,可以为该领域的LVLM基系统改进做出贡献。
Summary / 总结
This study aims to leverage large-scale Vision Language Models (LVLMs) for generating safe driving instructions by processing synchronized inputs from both driver-facing and road-facing cameras. The research constructs a dataset and evaluates LVLMs, finding that fine-tuned models can produce accurate and safety-aware instructions, though challenges persist in detecting subtle or complex events.
本研究旨在利用大规模视觉语言模型(LVLM)通过同时处理驾驶员视角和道路视角的摄像头输入来生成安全驾驶指令。研究构建了一个数据集并评估了LVLMs,发现经过微调的模型可以生成准确且安全的指令,但仍存在检测细微或复杂事件的挑战。
Transformer-Driven Triple Fusion Framework for Enhanced Multimodal Author Intent Classification in Low-Resource Bangla
Authors: Ariful Islam, Tanvir Mahmud, Md Rifat Hossen
First: 2025-11-28T15:44:42+00:00 · Latest: 2025-11-28T15:44:42+00:00
Comments: Accepted at the 28th International Conference on Computer and Information Technology (ICCIT 2025). To be published in IEEE proceedings
Abstract
The expansion of the Internet and social networks has led to an explosion of user-generated content. Author intent understanding plays a crucial role in interpreting social media content. This paper addresses author intent classification in Bangla social media posts by leveraging both textual and visual data. Recognizing limitations in previous unimodal approaches, we systematically benchmark transformer-based language models (mBERT, DistilBERT, XLM-RoBERTa) and vision architectures (ViT, Swin, SwiftFormer, ResNet, DenseNet, MobileNet), utilizing the Uddessho dataset of 3,048 posts spanning six practical intent categories. We introduce a novel intermediate fusion strategy that significantly outperforms early and late fusion on this task. Experimental results show that intermediate fusion, particularly with mBERT and Swin Transformer, achieves 84.11% macro-F1 score, establishing a new state-of-the-art with an 8.4 percentage-point improvement over prior Bangla multimodal approaches. Our analysis demonstrates that integrating visual context substantially enhances intent classification. Cross-modal feature integration at intermediate levels provides optimal balance between modality-specific representation and cross-modal learning. This research establishes new benchmarks and methodological standards for Bangla and other low-resource languages. We call our proposed framework BangACMM (Bangla Author Content MultiModal).
中文标题/摘要
标题:基于Transformer的三模态融合框架在低资源孟加拉语作者意图分类中的增强多模态分析
互联网和社交网络的扩展导致用户生成内容的爆炸式增长。作者意图理解在解释社交媒体内容中起着关键作用。本文通过利用文本和视觉数据,解决孟加拉语社交媒体帖子的作者意图分类问题。鉴于之前单模态方法的局限性,我们系统地评估了基于变换器的语言模型(mBERT、DistilBERT、XLM-RoBERTa)和视觉架构(ViT、Swin、SwiftFormer、ResNet、DenseNet、MobileNet),使用包含3,048个帖子的Uddessho数据集,这些帖子覆盖了六个实际意图类别。我们提出了一种新颖的中间融合策略,在此任务上显著优于早期和晚期融合。实验结果表明,中间融合,特别是与mBERT和Swin变换器结合,实现了84.11%的宏F1分数,建立了新的最先进的水平,比之前的孟加拉语多模态方法提高了8.4个百分点。我们的分析表明,整合视觉上下文显著增强了意图分类。跨模态特征在中间层的集成提供了模态特定表示和跨模态学习之间的最佳平衡。这项研究为孟加拉语和其他低资源语言建立了新的基准和方法论标准。我们称之为BangACMM(孟加拉语作者内容多模态)的提议框架。
Summary / 总结
This paper aims to enhance author intent classification in Bangla social media posts by integrating textual and visual data. It benchmarks transformer-based language models and vision architectures, introducing an intermediate fusion strategy that outperforms early and late fusion. The proposed BangACMM framework achieves an 84.11% macro-F1 score, surpassing previous approaches by 8.4 percentage points, and demonstrates the importance of integrating visual context for better intent classification.
该研究提出了一种基于Transformer的三重融合框架,用于增强孟加拉语社交媒体帖子的作者意图分类,结合文本和视觉数据。它对各种基于变换器的语言模型和视觉架构进行了基准测试,通过中间融合实现了84.11%的宏F1分数,比之前的方法提高了8.4个百分点。研究强调了整合视觉上下文对于更好地进行意图分类的重要性,并为孟加拉语和其他低资源语言设立了新的基准。
OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning
Authors: Timothy Ossowski, Sheng Zhang, Qianchu Liu, Guanghui Qin, Reuben Tan, Tristan Naumann, Junjie Hu, Hoifung Poon
First: 2025-11-28T15:21:51+00:00 · Latest: 2025-11-28T15:21:51+00:00
Abstract
High-quality and carefully curated data is a cornerstone of training medical large language models, as it directly impacts both generalization and robustness to unseen clinical tasks. We investigate strategies for training and data curation to develop a robust multimodal reasoning model in the medical domain. Our work focuses on supervised fine-tuning (SFT) and explores data recipes that leverage structured reasoning traces. Using our proposed data recipe, we scale experiments to a dataset of over 8 million examples and 6.8 billion response tokens, achieving state-of-the-art performance among open-source models across diverse out-of-distribution medical benchmark tasks. Our results further indicate that curating a high-quality, diverse training dataset with varying structured reasoning trace lengths enables the fine-tuned model to self-calibrate its reasoning trajectory lengths based on the downstream task, without explicit supervision. We present key insights, describe the data curation strategy, and outline next steps toward developing robust medical vision-language reasoning system.
中文标题/摘要
标题:OctoMed:前沿多模态医疗推理的数据食谱
高质量且精心挑选的数据是训练医疗大型语言模型的基础,因为它直接影响模型的泛化能力和对未见过的临床任务的鲁棒性。我们研究了训练和数据整理策略,以开发医疗领域的稳健多模态推理模型。我们的工作集中在监督微调(SFT)上,并探索利用结构化推理痕迹的数据食谱。使用我们提出的数据食谱,我们将实验扩展到包含超过800万例和68亿个响应标记的数据集,实现了开源模型在多种分布外医疗基准任务中的最佳性能。我们的结果进一步表明,整理高质量、多样化的训练数据集,其中包含不同长度的结构化推理痕迹,可以使微调后的模型根据下游任务自校准其推理轨迹长度,而无需显式监督。我们呈现了关键见解,描述了数据整理策略,并概述了开发稳健的医疗视觉-语言推理系统的方法。
Adapting Like Humans: A Metacognitive Agent with Test-time Reasoning
Authors: Yang Li, Zhiyuan He, Yuxuan Huang, Zhuhanling Xiao, Chao Yu, Meng Fang, Kun Shao, Jun Wang
First: 2025-11-28T15:15:47+00:00 · Latest: 2025-11-28T15:15:47+00:00
Abstract
Recent Vision-Language Models (VLMs) exhibit strong perceptual reasoning abilities, yet they often struggle to adapt efficiently when encountering novel tasks at test time. In contrast, humans leverage the metacognitive model with memory, enabling continuous strategy refinement through metacognitive control when faced with new challenges. To bridge this gap, we propose metacognitive test-time reasoning (MCTR), a framework that equips models with the ability to learn, adapt, and improve during test time through metacognitive self-updating. Inspired by the dual structure of human metacognition, MCTR comprises meta-level and object-level VLM reasoning modules, each equipped with dedicated memory systems for hierarchical adaptive reasoning. Specifically, MCTR consists of (1) a meta-reasoning module which incrementally builds a structured memory by discovering and storing task-relevant rules, environmental patterns, and action-outcome relationships from test-time observations as natural language descriptions; and (2) an action-reasoning module that determines optimal actions through context-aware perception and strategic reasoning by dynamically retrieving and integrating knowledge from memory. The action-reasoning module continuously updates its policy through proposed metacognitive test-time reinforcement learning, adapting as knowledge memory evolves. We evaluate MCTR on 45 Atari games (33 seen, 12 unseen). MCTR demonstrates robust test-time adaptation, achieving 9/12 top-1 results on unseen games compared with baselines. Analyses through ablations, learning dynamics, and case studies reveal the complementary contributions of both components and show meta-reasoning evolving toward human-like adaptation strategies.
中文标题/摘要
标题:像人类一样适应:具有测试时推理的元认知代理
近期的视觉-语言模型(VLMs)在感知推理方面表现出色,但在遇到新的测试任务时,它们往往难以高效地适应。相比之下,人类利用元认知模型和记忆,能够在面对新挑战时通过元认知控制不断优化策略。为了弥合这一差距,我们提出了元认知测试时推理(MCTR),这是一种框架,使模型能够在测试时通过元认知自我更新来学习、适应和改进。受人类元认知的双重结构启发,MCTR 包含元级和对象级的 VLM 推理模块,每个模块都配备了专用的记忆系统以实现分层适应推理。具体来说,MCTR 包括(1)一个元推理模块,该模块通过从测试时观察中发现并存储与任务相关规则、环境模式和动作-结果关系,并以自然语言描述的形式构建结构化记忆,逐步构建结构化记忆;以及(2)一个行动推理模块,该模块通过动态检索和整合记忆中的知识来基于上下文感知和战略推理来确定最优行动。行动推理模块通过提出的元认知测试时强化学习不断更新其策略,随着知识记忆的演变进行适应。我们在 45 个雅达利游戏(33 个已见过,12 个未见过)上评估了 MCTR。MCTR 展现出稳健的测试时适应能力,在未见过的游戏上取得了 9/12 的顶级结果,优于基线。通过消融分析、学习动态和案例研究,我们揭示了两个组件的互补贡献,并展示了元推理向人类似的适应策略演变。
Summary / 总结
The research aims to enhance the adaptability of Vision-Language Models (VLMs) by introducing metacognitive test-time reasoning (MCTR), which allows models to learn and improve during test time. MCTR consists of a meta-reasoning module that builds structured memory from test-time observations and an action-reasoning module that uses this memory for strategic reasoning and action selection. Experiments on 45 Atari games show that MCTR achieves 9 out of 12 top-1 results on unseen games, outperforming baseline models. Analysis indicates that both components contribute to human-like adaptive strategies.
本文旨在解决视觉-语言模型在测试时对新任务进行高效适应的问题。提出了元认知测试时推理(MCTR)框架,使模型能够通过元认知自我更新来学习、适应和改进。MCTR 包含一个用于构建结构化记忆的元推理模块和一个用于确定最优行动的行动推理模块。该模型在45个雅达利游戏上进行了评估,MCTR 在12个未见过的游戏中的9个中取得了第一名的结果,优于基线模型,展示了强大的测试时适应能力。
AgriCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture
Authors: Yibin Wen, Qingmei Li, Zi Ye, Jiarui Zhang, Jing Wu, Zurong Mai, Shuohong Lou, Yuhang Chen, Henglian Huang, Xiaoya Fan, Yang Zhang, Lingyuan Zhao, Haohuan Fu, Huang Jianxi, Juepeng Zheng
First: 2025-11-28T15:02:19+00:00 · Latest: 2025-11-28T15:02:19+00:00
Abstract
Recent advancements in Vision-Language Models (VLMs) have significantly transformed various industries. In agriculture, these dual-modal capabilities offer promising applications such as precision farming, crop monitoring, pest detection, and environmental sustainability. While several Visual Question Answering (VQA) datasets and benchmarks have been developed to evaluate VLM performance, they often fail to adequately assess the critical reasoning and problem-solving skills required in complex agricultural contexts. To address this gap, we introduce AgriCoT, a VQA dataset that incorporates Chain-of-Thought (CoT) reasoning, specifically designed to evaluate the reasoning capabilities of VLMs. With 4,535 carefully curated samples, AgriCoT offers a comprehensive and robust evaluation of reasoning abilities for VLMs, particularly in zero-shot scenarios, by focusing on their capacity to engage in logical reasoning and effective problem-solving. Our evaluations, conducted with 26 representative VLMs, including both proprietary and open-source models, reveal that while some proprietary models excel at answering questions, there is a notable and significant gap in their reasoning capabilities. This underscores the importance of incorporating CoT for more precise and effective assessments. Our dataset are available at https://huggingface.co/datasets/wenyb/AgriCoT.
中文标题/摘要
标题:AgriCoT:农业领域视觉语言模型推理能力评估的数据集
近年来,视觉语言模型(VLMs)在各个行业中的应用取得了显著进展。在农业领域,这些双模态能力提供了诸如精准农业、作物监测、病虫害检测和环境可持续性等有前景的应用。尽管已经开发了多种视觉问答(VQA)数据集和基准来评估VLM性能,但它们往往未能充分评估在复杂农业环境中所需的关键推理和问题解决能力。为解决这一问题,我们引入了AgriCoT,这是一个结合了推理链(CoT)的VQA数据集,专门用于评估VLM的推理能力。AgriCoT包含4,535个精心策划的样本,提供了对VLM推理能力的全面和稳健评估,特别是在零样本场景中,通过关注它们进行逻辑推理和有效问题解决的能力。我们的评估使用了26个代表性VLM,包括专有和开源模型,结果显示,尽管一些专有模型在回答问题方面表现出色,但在推理能力方面存在明显的显著差距。这突显了引入CoT进行更精确和有效评估的重要性。我们的数据集可在https://huggingface.co/datasets/wenyb/AgriCoT获取。
Summary / 总结
AgriCoT is a VQA dataset designed to evaluate the reasoning capabilities of Vision-Language Models (VLMs) in agricultural contexts. It includes 4,535 samples that require Chain-of-Thought (CoT) reasoning, focusing on logical reasoning and problem-solving. Evaluations with 26 VLMs show a significant gap in reasoning abilities, highlighting the need for CoT in assessing VLMs for precision agriculture applications.
AgriCoT 是一个 VQA 数据集,旨在评估 Vision-Language 模型(VLMs)在农业领域的推理能力。它包含 4,535 个样本,需要进行链式思考(CoT)推理,解决了现有基准在捕捉复杂农业推理方面的局限性。对 26 个 VLMs 的评估显示,在推理能力方面存在显著差距,突显了在评估 VLMs 用于精准农业应用时需要包含 CoT 的重要性。
Unlocking Multilingual Reasoning Capability of LLMs and LVLMs through Representation Engineering
Authors: Qiming Li, Xiaocheng Feng, Yixuan Ma, Zekai Ye, Ruihan Chen, Xiachong Feng, Bing Qin
First: 2025-11-28T14:40:27+00:00 · Latest: 2025-11-28T14:40:27+00:00
Abstract
Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) demonstrate strong reasoning capabilities, yet their performance in English significantly outperforms that in low-resource languages, raising fairness concerns in multilingual applications. Existing approaches either rely on costly multilingual training or employ prompting with external translation tools, both of which are resource-intensive and sensitive to translation quality. To address these limitations, we propose a training-free inference-time method to enhance Multilingual Reasoning capabilities via Representation Engineering (MRRE) without using any additional training data or tools. MRRE sequentially injects two precomputed vectors at specific layers during inference processing: cross-lingual reasoning enhancement vectors, which steer non-English reasoning representations toward English space to unlock multilingual reasoning, and target-language output anchoring vectors, which restore the distribution of the target language to preserve input-output language consistency. Comprehensive experiments across six advanced LLMs and LVLMs on four reasoning benchmarks demonstrate that MRRE consistently enhances non-English reasoning by an average gain of 5.48% and up to 7.54% in low-resource languages (Thai and Swahili), while improving input-output language consistency by 3.78%.
中文标题/摘要
标题:通过表示工程解锁LLMs和LVLMs的多语言推理能力
大型语言模型(LLMs)和大型视觉-语言模型(LVLMs)展示了强大的推理能力,但在低资源语言中的表现显著低于英语,这在多语言应用中引发了公平性问题。现有方法要么依赖昂贵的多语言训练,要么使用外部翻译工具进行提示,这两种方法都资源密集且对翻译质量敏感。为了解决这些限制,我们提出了一种无需额外训练数据或工具的训练免费推理时方法,通过表示工程(MRRE)增强多语言推理能力。MRRE在推理处理过程中在特定层逐步注入两个预计算向量:跨语言推理增强向量,引导非英语推理表示向英语空间以解锁多语言推理;目标语言输出锚定向量,恢复目标语言的分布以保持输入输出语言一致性。在六种高级LLMs和LVLMs上对四个推理基准的全面实验表明,MRRE平均提高了非英语推理能力5.48%,在低资源语言(泰语和斯瓦希里语)中最多提高了7.54%,同时提高了输入输出语言一致性3.78%。
Summary / 总结
This paper addresses the fairness issue in multilingual applications by proposing a training-free method called MRRE to enhance multilingual reasoning capabilities of LLMs and LVLMs. MRRE injects two precomputed vectors during inference to steer non-English reasoning representations towards English space and restore the target language distribution, improving non-English reasoning by an average of 5.48% and up to 7.54% in low-resource languages while maintaining input-output language consistency by 3.78%.
论文针对大型语言模型(LLMs)和大型视觉-语言模型(LVLMs)在低资源语言中与英语相比的表现差异。它提出了一种名为MRRE的训练免费方法,在推理过程中注入两个预计算向量以增强多语言推理能力。MRRE在六个LLMs和LVLMs上四个推理基准测试中,将非英语推理的平均提升幅度提高到5.48%,最多7.54%(针对泰语和斯瓦希里语等低资源语言),同时提高了输入输出语言一致性3.78%。
Fault-Tolerant MARL for CAVs under Observation Perturbations for Highway On-Ramp Merging
Authors: Yuchen Shi, Huaxin Pei, Yi Zhang, Danya Yao
First: 2025-11-28T13:57:21+00:00 · Latest: 2025-11-28T13:57:21+00:00
Abstract
Multi-Agent Reinforcement Learning (MARL) holds significant promise for enabling cooperative driving among Connected and Automated Vehicles (CAVs). However, its practical application is hindered by a critical limitation, i.e., insufficient fault tolerance against observational faults. Such faults, which appear as perturbations in the vehicles' perceived data, can substantially compromise the performance of MARL-based driving systems. Addressing this problem presents two primary challenges. One is to generate adversarial perturbations that effectively stress the policy during training, and the other is to equip vehicles with the capability to mitigate the impact of corrupted observations. To overcome the challenges, we propose a fault-tolerant MARL method for cooperative on-ramp vehicles incorporating two key agents. First, an adversarial fault injection agent is co-trained to generate perturbations that actively challenge and harden the vehicle policies. Second, we design a novel fault-tolerant vehicle agent equipped with a self-diagnosis capability, which leverages the inherent spatio-temporal correlations in vehicle state sequences to detect faults and reconstruct credible observations, thereby shielding the policy from misleading inputs. Experiments in a simulated highway merging scenario demonstrate that our method significantly outperforms baseline MARL approaches, achieving near-fault-free levels of safety and efficiency under various observation fault patterns.
中文标题/摘要
标题:在观测扰动下具有容错性的MARL在高速公路入匝道合并中的应用
多智能体强化学习(MARL)在使连接和自动化车辆(CAVs)实现协同驾驶方面具有巨大潜力。然而,其实际应用受到一个关键限制的阻碍,即对观测故障的不足容错性。这些故障表现为车辆感知数据中的扰动,会严重损害基于MARL的驾驶系统的性能。解决这一问题面临两大挑战。一是生成有效的对抗性扰动,在训练过程中对策略进行压力测试;二是使车辆具备减轻受污染观测影响的能力。为克服这些挑战,我们提出了一种结合两个关键代理的具有容错性的MARL方法。首先,一个对抗性故障注入代理与车辆策略共同训练,以生成能够积极挑战和强化车辆策略的扰动。其次,我们设计了一种新型具有自我诊断能力的容错车辆代理,利用车辆状态序列中的固有时空相关性来检测故障并重建可信观测,从而防止策略受到误导性输入的影响。在模拟高速公路合并场景中的实验表明,我们的方法在各种观测故障模式下显著优于基线MARL方法,实现了接近无故障的水平的安全性和效率。
Summary / 总结
The paper addresses the issue of fault tolerance in Multi-Agent Reinforcement Learning (MARL) for cooperative driving among Connected and Automated Vehicles (CAVs), particularly focusing on highway on-ramp merging. It proposes a fault-tolerant MARL method involving an adversarial fault injection agent to generate perturbations and a self-diagnosis fault-tolerant vehicle agent to detect and mitigate the impact of corrupted observations. The experiments show that this method significantly improves safety and efficiency under various observation fault patterns compared to baseline MARL approaches.
研究旨在解决多智能体强化学习(MARL)在连接和自动化车辆(CAVs)协同驾驶中的容错性不足问题。提出了一种容错MARL方法,包含两个关键代理:一个对抗性故障注入代理,在训练过程中生成干扰以挑战策略,另一个具备自我诊断能力的容错车辆代理,能够检测并缓解被篡改观测数据的影响。实验结果表明,该方法在各种观测故障模式下显著提高了安全性和效率,优于基线MARL方法。
Obstruction reasoning for robotic grasping
Authors: Runyu Jiao, Matteo Bortolon, Francesco Giuliari, Alice Fasoli, Sergio Povoli, Guofeng Mei, Yiming Wang, Fabio Poiesi
First: 2025-11-28T13:53:12+00:00 · Latest: 2025-11-28T13:53:12+00:00
Abstract
Successful robotic grasping in cluttered environments not only requires a model to visually ground a target object but also to reason about obstructions that must be cleared beforehand. While current vision-language embodied reasoning models show emergent spatial understanding, they remain limited in terms of obstruction reasoning and accessibility planning. To bridge this gap, we present UNOGrasp, a learning-based vision-language model capable of performing visually-grounded obstruction reasoning to infer the sequence of actions needed to unobstruct the path and grasp the target object. We devise a novel multi-step reasoning process based on obstruction paths originated by the target object. We anchor each reasoning step with obstruction-aware visual cues to incentivize reasoning capability. UNOGrasp combines supervised and reinforcement finetuning through verifiable reasoning rewards. Moreover, we construct UNOBench, a large-scale dataset for both training and benchmarking, based on MetaGraspNetV2, with over 100k obstruction paths annotated by humans with obstruction ratios, contact points, and natural-language instructions. Extensive experiments and real-robot evaluations show that UNOGrasp significantly improves obstruction reasoning and grasp success across both synthetic and real-world environments, outperforming generalist and proprietary alternatives. Project website: https://tev-fbk.github.io/UnoGrasp/.
中文标题/摘要
标题:机器人抓取中的障碍推理
在杂乱环境中成功进行机器人抓取不仅需要视觉模型将目标物体定位,还需要推理出必须清除的障碍物。尽管当前的视觉-语言嵌入式推理模型在空间理解方面表现出色,但在障碍推理和可达性规划方面仍有限制。为解决这一问题,我们提出了UNOGrasp,这是一种基于学习的视觉-语言模型,能够进行视觉定位的障碍推理,以推断清除障碍并抓取目标物体所需的操作序列。我们设计了一种基于目标物体引发的障碍路径的多步推理过程,并通过障碍感知的视觉提示来锚定每一步推理,以激励推理能力。UNOGrasp通过可验证的推理奖励结合监督和强化微调。此外,我们基于MetaGraspNetV2构建了UNOBench,这是一个大规模数据集,用于训练和基准测试,包含超过10万条由人类标注的障碍路径,包括障碍比例、接触点和自然语言指令。广泛的实验和真实机器人评估表明,UNOGrasp在合成和真实环境中显著提高了障碍推理和抓取成功率,优于通用和专有替代方案。项目网站:https://tev-fbk.github.io/UnoGrasp/
Summary / 总结
UNOGrasp is a learning-based vision-language model designed to perform visually-grounded obstruction reasoning for robotic grasping in cluttered environments. It combines supervised and reinforcement learning to infer the sequence of actions needed to clear obstructions and grasp the target object. UNOGrasp outperforms existing models in both synthetic and real-world environments, significantly improving obstruction reasoning and grasp success. The model is evaluated through extensive experiments and real-robot tests, demonstrating its effectiveness in handling complex scenarios.
UNOGrasp 是一种基于学习的视觉-语言模型,旨在在杂乱环境中执行基于视觉的障碍物推理,以实现机器人抓取。它结合了监督学习和强化学习,以推断清除障碍物并抓取目标物体所需的序列动作。UNOGrasp 在合成和真实世界环境中均表现出色,显著提高了障碍物推理和抓取成功率。该模型通过广泛的实验和真实机器人测试进行了评估,展示了其在处理复杂场景方面的有效性。
Source-free Video Domain Adaptation by Learning from Noisy Labels
Authors: Avijit Dasgupta, C. V. Jawahar, Karteek Alahari
Venue: Pattern Recognition, 161, p.111328 (2025)
First: 2023-11-30T14:06:27+00:00 · Latest: 2025-11-28T13:52:31+00:00
Comments: Our extended ICVGIP paper is now accepted in Pattern Recognition
Abstract
Despite the progress seen in classification methods, current approaches for handling videos with distribution shifts in source and target domains remain source-dependent as they require access to the source data during the adaptation stage. In this paper, we present a self-training based source-free video domain adaptation approach to address this challenge by bridging the gap between the source and the target domains. We use the source pre-trained model to generate pseudo-labels for the target domain samples, which are inevitably noisy. Thus, we treat the problem of source-free video domain adaptation as learning from noisy labels and argue that the samples with correct pseudo-labels can help us in adaptation. To this end, we leverage the cross-entropy loss as an indicator of the correctness of the pseudo-labels and use the resulting small-loss samples from the target domain for fine-tuning the model. We further enhance the adaptation performance by implementing a teacher-student (TS) framework, in which the teacher, which is updated gradually, produces reliable pseudo-labels. Meanwhile, the student undergoes fine-tuning on the target domain videos using these generated pseudo-labels to improve its performance. Extensive experimental evaluations show that our methods, termed as CleanAdapt, CleanAdapt + TS, achieve state-of-the-art results, outperforming the existing approaches on various open datasets. Our source code is publicly available at https://avijit9.github.io/CleanAdapt.
中文标题/摘要
标题:基于学习噪声标签的源代码视频领域自适应
尽管分类方法取得了进展,但当前处理源域和目标域之间分布偏移的方法仍然依赖于源数据,因为它们在适应阶段需要访问源数据。本文提出了一种基于自我训练的源代码视频领域自适应方法,通过缩小源域和目标域之间的差距来解决这一挑战。我们使用预训练的源模型为目标域样本生成伪标签,这些伪标签不可避免地是噪声的。因此,我们将源代码视频领域自适应问题视为学习噪声标签的问题,并认为具有正确伪标签的样本有助于适应。为此,我们利用交叉熵损失作为伪标签正确性的指标,并使用目标域中损失较小的样本对模型进行微调。我们还通过实现教师-学生(TS)框架进一步提高了适应性能,在该框架中,逐渐更新的教师生成可靠的伪标签,而学生则使用这些生成的伪标签对目标域视频进行微调以提高其性能。广泛的实验评估表明,我们的方法CleanAdapt和CleanAdapt + TS达到了最先进的效果,在各种公开数据集上优于现有方法。我们的源代码已公开发布在https://avijit9.github.io/CleanAdapt/。
Summary / 总结
The paper addresses the challenge of source-free video domain adaptation by proposing a self-training approach that uses a pre-trained model to generate pseudo-labels for the target domain, treating the problem as learning from noisy labels. The method leverages the cross-entropy loss to identify reliable pseudo-labels and fine-tunes the model on these samples, with an additional teacher-student framework to enhance performance. Experiments demonstrate that the proposed methods, CleanAdapt and CleanAdapt + TS, achieve state-of-the-art results on various open datasets, outperforming existing approaches.
本文提出了一种源数据无依赖的视频域适应方法CleanAdapt及其增强版本CleanAdapt + TS。动机是解决在没有访问源数据的情况下处理分布偏移的问题。该方法使用预训练模型为目标域生成伪标签,将问题视为从噪声标签学习。该方法利用交叉熵损失选择小损失样本进行微调,并采用教师-学生框架生成可靠的伪标签。实验结果表明,CleanAdapt和CleanAdapt + TS在各种公开数据集上优于现有方法,取得了最先进的结果。
Look Where It Matters: Training-Free Ultra-HR Remote Sensing VQA via Adaptive Zoom Search
Authors: Yunqi Zhou, Chengjie Jiang, Chun Yuan, Jing Li
First: 2025-11-25T16:25:54+00:00 · Latest: 2025-11-28T13:03:57+00:00
Comments: 17 pages, 8 figures
Abstract
With advances in satellite constellations, sensor technologies, and imaging pipelines, ultra-high-resolution (Ultra-HR) remote sensing imagery is becoming increasingly widespread. However, current remote sensing foundation models are ill-suited to such inputs: full-image encoding exhausts token and memory budgets, while resize-based preprocessing loses fine-grained and answer-critical details. In this context, guiding the model look where it matters before prediction becomes crucial. Therefore, we present ZoomSearch, a training-free, plug-and-play pipeline that decouples 'where to look' from 'how to answer' for Ultra-HR Remote Sensing Visual Question Answering (RS-VQA). ZoomSearch combines Adaptive Multi-Branch Zoom Search, which performs a hierarchical search over image patches to localize query-relevant regions, with Layout-Aware Patch Reassembly, which reorganizes the selected patches into a compact, layout-faithful canvas. We conduct comprehensive experiments on Ultra-HR RS-VQA benchmarks MME-RealWorld-RS and LRS-VQA, comparing against (i) strong general foundation models, (ii) remote sensing foundation models, (iii) Ultra-HR RS-VQA methods, and (iv) plug-and-play search-based VQA methods. When integrated with LLaVA-ov, ZoomSearch attains state-of-the-art accuracy across diverse tasks, improving the LLaVA-ov baseline by 26.3% on LRS-VQA and 114.8% on MME-RealWorld-RS. Meanwhile, it achieves much higher inference efficiency, outperforming prior search-based methods by 20%~44% in speed.
中文标题/摘要
标题:关注关键处:基于自适应缩放搜索的无需训练超高分辨率遥感VQA
随着卫星星座、传感器技术和成像流水线的进步,超高分辨率(Ultra-HR)遥感图像变得越来越普遍。然而,当前的遥感基础模型对这种输入并不适合:全图像编码会耗尽标记和内存预算,而基于重采样的预处理会丢失关键的细节信息。在此背景下,指导模型在预测前关注关键处变得至关重要。因此,我们提出了ZoomSearch,这是一种无需训练、即插即用的管道,将“关注何处”与“如何回答”解耦,适用于超高分辨率遥感视觉问答(RS-VQA)。ZoomSearch 结合了自适应多分支缩放搜索,它在图像块上进行分层搜索以定位查询相关的区域,以及布局感知块重组,它将选定的块重新组织成一个紧凑且布局忠实的画布。我们在超高分辨率RS-VQA基准MME-RealWorld-RS和LRS-VQA上进行了全面的实验,与(i)强大的通用基础模型,(ii)遥感基础模型,(iii)超高分辨率RS-VQA方法,以及(iv)基于搜索的视觉问答方法进行了比较。当与LLaVA-ov结合使用时,ZoomSearch在各种任务中达到了最先进的准确率,在LRS-VQA上提高了LLaVA-ov基线26.3%,在MME-RealWorld-RS上提高了114.8%。同时,它实现了更高的推理效率,在速度上比之前的基于搜索的方法快20%~44%。
Summary / 总结
The paper introduces ZoomSearch, a training-free pipeline for Ultra-HR remote sensing visual question answering (RS-VQA) that decouples 'where to look' from 'how to answer'. It uses Adaptive Multi-Branch Zoom Search to search over image patches and Layout-Aware Patch Reassembly to reorganize selected patches. Experiments on MME-RealWorld-RS and LRS-VQA benchmarks show that ZoomSearch improves accuracy by 26.3% and 114.8% respectively over the LLaVA-ov baseline, while maintaining higher inference efficiency compared to previous search-based methods.
研究提出了ZoomSearch,一种无需训练的方法,将“看哪里”与“如何回答”分离,以处理超高清遥感图像的视觉问答(RS-VQA)。它使用Adaptive Multi-Branch Zoom Search在图像块中进行搜索,并使用Layout-Aware Patch Reassembly重新组织选定的块。实验表明,ZoomSearch在超高清RS-VQA基准上优于现有方法,实现了最先进的准确性和更高的推理效率。
Activation Quantization of Vision Encoders Needs Prefixing Registers
Authors: Seunghyeon Kim, Jinho Kim, Taesun Yeom, Wonpyo Park, Kyuyeun Kim, Jaeho Lee
First: 2025-10-06T07:27:46+00:00 · Latest: 2025-11-28T12:54:42+00:00
Comments: 19 pages, 8 figures
Abstract
Transformer-based vision encoders -- such as CLIP -- are central to multimodal intelligence, powering applications from autonomous web agents to robotic control. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Quantization offers a practical path, but remains challenging even at 8-bit precision due to massive-scale activations (i.e., outliers). In this work, we propose $\textit{RegCache}$, a training-free algorithm that mitigates outliers in large-scale pretrained vision encoders and serves as a plug-in module that can be applied on top of other quantization methods. The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder, which prevents other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experiments show that our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.
中文标题/摘要
标题:视觉编码器的激活量化需要前缀寄存器
基于变换器的视觉编码器——如CLIP——是多模态智能的核心,推动着从自主网络代理到机器人控制的各种应用。由于这些应用通常需要实时处理大量视觉数据,因此降低视觉编码器的推理成本至关重要。量化提供了一条可行的路径,但由于大规模激活(即异常值)的存在,即使在8位精度下也仍然具有挑战性。在本文中,我们提出了一种无需训练的算法$\textit{RegCache}$,该算法可以缓解大规模预训练视觉编码器中的异常值,并作为可插拔模块应用于其他量化方法之上。所提出的RegCache在目标视觉编码器中引入了易产生异常但无语义意义的前缀标记,从而防止其他标记产生异常值。值得注意的是,我们观察到视觉编码器中的异常值与语言模型中的异常值行为不同,这促使我们提出了两种技术创新:中间层前缀和标记删除。实验表明,我们的方法在文本监督和自我监督的视觉编码器中一致地提高了量化模型的准确性。
Summary / 总结
This paper addresses the challenge of quantizing transformer-based vision encoders, which are crucial for real-time processing of visual data in applications like autonomous web agents and robotic control. The authors propose RegCache, a training-free method that introduces prefix tokens to mitigate outliers in large-scale pretrained vision encoders. This method improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders by preventing other tokens from having outliers. The key technical innovations include middle-layer prefixing and token deletion, which are motivated by the different behavior of outliers in vision encoders compared to language models.
本文解决了基于Transformer的视觉编码器量化的问题,这些编码器对于实时多模态应用至关重要。提出的$\textit{RegCache}$方法引入了前缀令牌以缓解大型预训练视觉编码器中的异常值,从而提高量化模型的准确性,涵盖文本监督和自我监督的视觉编码器。
Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy
Authors: Teng Hu, Zhentao Yu, Guozhen Zhang, Zihan Su, Zhengguang Zhou, Youliang Zhang, Yuan Zhou, Qinglin Lu, Ran Yi
First: 2025-11-26T16:53:05+00:00 · Latest: 2025-11-28T12:25:17+00:00
Abstract
The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.
中文标题/摘要
标题:和谐:通过跨任务协同实现音频和视频生成的同步
同步音频-视觉内容的合成是生成式AI中的一个关键挑战,开源模型在鲁棒的音频-视频对齐方面面临挑战。我们的分析表明,这一问题源于联合扩散过程中的三个基本挑战:(1)对应关系漂移,同时进化的噪声潜在变量阻碍了对齐的稳定学习;(2)低效的全局注意力机制,无法捕捉细微的时间线索;(3)传统无条件分类引导(CFG)的模内偏差,增强了条件性但未提高跨模态同步。为克服这些挑战,我们引入了和谐,一种新的框架,机械地确保音频-视觉同步。我们首先提出了一种跨任务协同训练范式,通过利用由音频驱动的视频生成和视频驱动的音频生成任务提供的强监督信号来减轻漂移。然后,我们设计了一种全局-局部解耦交互模块,以实现高效和精确的时间风格对齐。最后,我们提出了一种新的同步增强无条件分类引导(SyncCFG),在推理过程中明确隔离并放大对齐信号。广泛的实验表明,和谐建立了新的最先进的水平,在生成保真度方面显著优于现有方法,并且在实现细微的音频-视觉同步方面更为关键。
Summary / 总结
The research addresses the challenge of robust audio-video alignment in generative AI by identifying three key issues: correspondence drift, inefficient global attention, and intra-modal bias in Classifier-Free Guidance. To tackle these, the authors propose Harmony, a framework that includes a Cross-Task Synergy training paradigm, a Global-Local Decoupled Interaction Module, and a Synchronization-Enhanced Classifier-Free Guidance. Experiments show that Harmony improves generation fidelity and achieves better fine-grained audio-visual synchronization compared to existing methods.
论文解决了生成AI模型中音频视频对齐的稳健性问题,识别出三个关键问题:对应关系漂移、全局注意力机制效率低下以及模内偏差。为了解决这些问题,作者引入了Harmony框架,包括跨任务协同训练范式、全局-局部解耦交互模块以及同步增强的分类器无引导。实验表明,Harmony在生成质量和细粒度的音频视频同步方面均优于现有方法。
MathSight: A Benchmark Exploring Have Vision-Language Models Really Seen in University-Level Mathematical Reasoning?
Authors: Yuandong Wang, Yao Cui, Yuxin Zhao, Zhen Yang, Yangfu Zhu, Zhenzhou Shao
First: 2025-11-28T11:55:05+00:00 · Latest: 2025-11-28T11:55:05+00:00
Comments: Comments: 32 pages, 15 figures, 9 tables, includes appendix. Project page: https://cnu-bot-group.github.io/MathSight/
Abstract
Recent advances in Vision-Language Models (VLMs) have achieved impressive progress in multimodal mathematical reasoning. Yet, how much visual information truly contributes to reasoning remains unclear. Existing benchmarks report strong overall performance but seldom isolate the role of the image modality, leaving open whether VLMs genuinely leverage visual understanding or merely depend on linguistic priors. To address this, we present MathSight, a university-level multimodal mathematical reasoning benchmark designed to disentangle and quantify the effect of visual input. Each problem includes multiple visual variants -- original, hand-drawn, photo-captured -- and a text-only condition for controlled comparison. Experiments on state-of-the-art VLMs reveal a consistent trend: the contribution of visual information diminishes with increasing problem difficulty. Remarkably, Qwen3-VL without any image input surpasses both its multimodal variants and GPT-5, underscoring the need for benchmarks like MathSight to advance genuine vision-grounded reasoning in future models.
中文标题/摘要
标题:MathSight:探索视觉语言模型在大学级数学推理中是否真正见过
视觉语言模型(VLMs)在多模态数学推理方面取得了令人印象深刻的进展。然而,视觉信息在推理中真正起到了多大作用仍然不清楚。现有的基准报告了整体性能的强大,但很少孤立地隔离图像模态的作用,这使得人们质疑VLMs是否真正利用了视觉理解,还是仅仅依赖于语言先验。为了解决这个问题,我们提出了MathSight,这是一个大学级的多模态数学推理基准,旨在分离并量化视觉输入的影响。每个问题都包括多种视觉变体——原始的、手绘的、照片捕捉的——以及一个纯文本条件,以便进行受控比较。对最先进的VLMs的实验揭示了一个一致的趋势:随着问题难度的增加,视觉信息的贡献逐渐减少。令人惊讶的是,没有图像输入的Qwen3-VL超过了其多模态变体和GPT-5,这强调了需要像MathSight这样的基准来推动未来模型中真正的视觉基础推理的发展。
Summary / 总结
MathSight is a benchmark designed to evaluate the role of visual information in university-level mathematical reasoning. It includes multiple visual variants of each problem and a text-only condition to isolate the visual contribution. Experiments on state-of-the-art VLMs show that the use of visual information decreases as problem difficulty increases. Notably, Qwen3-VL without any image input outperforms its multimodal variants and GPT-5, highlighting the need for benchmarks like MathSight to promote genuine vision-grounded reasoning in future models.
MathSight 是一个用于评估视觉信息在大学级数学推理中作用的基准。它包括每个问题的多种视觉变体和文本条件。实验显示,随着问题难度的增加,视觉信息的贡献逐渐减少,而 Qwen3-VL 不使用任何图像输入的表现超过了其多模态变体和 GPT-5,突显了需要像 MathSight 这样的基准来推动模型中的视觉导向推理的发展。
SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models
Authors: Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, Zizhuang Wei
First: 2025-11-28T11:04:21+00:00 · Latest: 2025-11-28T11:04:21+00:00
Abstract
Large vision-language models (VLMs) show strong multimodal understanding but still struggle with 3D spatial reasoning, such as distance estimation, size comparison, and cross-view consistency. Existing 3D-aware methods either depend on auxiliary 3D information or enhance RGB-only VLMs with geometry encoders through shallow feature fusion. We propose SpaceMind, a multimodal large language model explicitly designed for spatial reasoning solely from RGB inputs. The model adopts a dual-encoder architecture, integrating VGGT as a spatial understanding encoder and InternViT as a 2D visual encoder. The key idea is to treat the camera representation as an active guiding modality rather than passive metadata. Specifically, SpaceMind introduces a lightweight Camera-Guided Modality Fusion module before the language model to replace shallow fusion. It applies camera-conditioned biasing to spatial tokens, assigns query-independent weights reflecting their geometric importance, and uses the camera embedding to gate the fused representation. Empirically, SpaceMind establishes new state-of-the-art results on VSI-Bench, SQA3D and SPBench, surpassing both open and proprietary systems on VSI-Bench and SPBench by large margins and achieving state-of-the-art performance on SQA3D. These results demonstrate that camera-guided modality fusion is an effective and practical inductive bias for equipping VLMs with genuinely spatially grounded intelligence. We will release code and model checkpoints to support future research.
中文标题/摘要
标题:SpaceMind:基于相机引导的模态融合在视觉-语言模型中进行空间推理
大型视觉-语言模型(VLMs)在多模态理解方面表现出色,但在三维空间推理方面仍然存在困难,例如距离估计、大小比较和跨视图一致性。现有的三维感知方法要么依赖于辅助的三维信息,要么通过浅层特征融合将几何编码器增强到仅基于RGB的VLMs中。我们提出了一种名为SpaceMind的多模态大型语言模型,该模型专门从RGB输入中进行空间推理。该模型采用双编码器架构,将VGGT作为空间理解编码器,将InternViT作为二维视觉编码器。关键思想是将相机表示视为一种主动引导模态,而不是被动元数据。具体而言,SpaceMind在语言模型之前引入了一个轻量级的相机引导模态融合模块,以替代浅层融合。它对空间标记应用相机条件偏差,分配反映其几何重要性的查询独立权重,并使用相机嵌入门控融合表示。实验结果表明,SpaceMind在VSI-Bench、SQA3D和SPBench上建立了新的最佳结果,分别在VSI-Bench和SPBench上大幅超越了开源和专有系统,并在SQA3D上实现了最佳性能。这些结果表明,相机引导的模态融合是为VLMs提供真正空间化智能的有效且实用的归纳偏置。我们将发布代码和模型检查点以支持未来的研究。
Summary / 总结
SpaceMind is a multimodal large language model designed for spatial reasoning using RGB inputs. It employs a dual-encoder architecture with VGGT for spatial understanding and InternViT for 2D vision. The model introduces a Camera-Guided Modality Fusion module that uses camera-conditioned biasing and geometric importance weights to enhance spatial reasoning. Empirically, SpaceMind outperforms existing methods on VSI-Bench, SQA3D, and SPBench, demonstrating the effectiveness of camera-guided modality fusion for spatially grounded intelligence in VLMs.
SpaceMind 是一种使用 RGB 输入增强视觉语言模型空间推理能力的多模态大语言模型。它采用 VGGT 进行空间理解、InternViT 进行 2D 视觉处理的双编码器架构。关键创新在于一个基于相机引导的模态融合模块,利用相机表示对空间标记进行偏置并控制融合表示。实验表明,SpaceMind 在 VSI-Bench、SQA3D 和 SPBench 上超越现有方法,建立了新的最佳性能,证明了基于相机引导的融合对于在 VLM 中实现真正空间导向智能的有效性和实用性。
Buffer replay enhances the robustness of multimodal learning under missing-modality
Authors: Hongye Zhu, Xuan Liu, Yanwen Ba, Jingye Xue, Shigeng Zhang
First: 2025-11-28T10:55:31+00:00 · Latest: 2025-11-28T10:55:31+00:00
Abstract
Missing modalities consistently lead to significant performance degradation in multimodal models. Existing approaches either synthesize missing modalities at high computational cost or apply prompt-based fine-tuning that relies only on adjacent-layer features and overlooks long-distance contextual information, which may offer additional tolerance to errors when one or more modalities are missing. To address this, we introduce REplay Prompting (REP): (1) construct modality-wise feature buffers via a residual bypass to cache early-layer representations and replay them in deeper layers, mitigating information loss as network depth increases; (2) employ a private-shared feature decoupling strategy, where private buffers preserve modality-specific signals and shared buffers encode cross-modal semantics; and (3) design a task-aware dynamic initialization mechanism to configure these buffers differently, improving stability and generalization under diverse missing-modality conditions. Experiments on vision-language, vision-language-audio, and temporal multimodal benchmarks demonstrate that REP consistently outperforms prior methods under both single- and multi-modality missing scenarios, while introducing only negligible parameter overhead. These results establish REP as a lightweight and effective paradigm for robust multimodal learning in challenging missing-modality environments.
中文标题/摘要
标题:缓冲重放增强多模态学习在缺失模态下的鲁棒性
缺失模态在多模态模型中会导致显著的性能下降。现有方法要么以高计算成本合成缺失模态,要么采用基于提示的微调,仅依赖相邻层特征而忽视长距离上下文信息,这可能在一种或多种模态缺失时提供额外的容错能力。为解决这一问题,我们引入了重放提示(REplay Prompting, REP):(1) 通过残差旁路构建模态特定的特征缓冲区,缓存早期层表示并在深层层重放,以减轻网络深度增加带来的信息损失;(2) 使用私有-共享特征解耦策略,其中私有缓冲区保留模态特定信号,共享缓冲区编码跨模态语义;(3) 设计任务感知的动态初始化机制,根据不同情况配置这些缓冲区,提高在多种缺失模态条件下的稳定性和泛化能力。在视觉-语言、视觉-语言-音频和时序多模态基准上的实验表明,REP 在单模态和多模态缺失场景下均优于先前方法,同时仅引入微不足道的参数开销。这些结果确立了REP 作为在具有挑战性的缺失模态环境中轻量且有效的多模态学习范式的地位。
Summary / 总结
The paper addresses the issue of performance degradation in multimodal models when one or more modalities are missing. It proposes REplay Prompting (REP), which constructs modality-wise feature buffers to cache early-layer representations and replay them in deeper layers, and employs a private-shared feature decoupling strategy to handle cross-modal semantics. The method also includes a task-aware dynamic initialization mechanism. Experimental results on various benchmarks show that REP outperforms existing methods with minimal parameter overhead, enhancing the robustness of multimodal learning under missing-modality conditions.
论文针对多模态模型在某一或多个模态缺失时性能下降的问题,提出了REplay Prompting (REP) 方法,通过构建模态特异性的特征缓冲区来缓存早期层的表示并在深层层中重放,同时采用私有-共享特征解耦策略以增强鲁棒性。实验结果表明,REP 在单模态和多模态缺失场景下均优于先前方法,并且参数开销很小。
MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents
Authors: Ruoxuan Zhang, Qiyun Zheng, Zhiyu Zhou, Ziqi Liao, Siyu Wu, Jian-Yu Jiang-Lin, Bin Wen, Hongxia Xie, Jianlong Fu, Wen-Huang Cheng
First: 2025-11-28T10:24:44+00:00 · Latest: 2025-11-28T10:24:44+00:00
Abstract
Theory of Mind (ToM) refers to the ability to infer others' mental states, such as beliefs, desires, and intentions. Current vision-language embodied agents lack ToM-based decision-making, and existing benchmarks focus solely on human mental states while ignoring the agent's own perspective, hindering coherent decision and action generation. To address this, we propose MindPower, a Robot-Centric framework integrating Perception, Mental Reasoning, Decision Making and Action. Given multimodal inputs, MindPower first perceives the environment and human states, then performs ToM Reasoning to model both self and others, and finally generates decisions and actions guided by inferred mental states. Furthermore, we introduce Mind-Reward, a novel optimization objective that encourages VLMs to produce consistent ToM Reasoning and behavior. Our model outperforms GPT-4o by 12.77% in decision making and 12.49% in action generation.
中文标题/摘要
标题:MindPower:在基于VLM的具身代理中实现心智理论推理
心智理论(ToM)是指推断他人心理状态的能力,如信念、欲望和意图。当前的视觉-语言具身代理缺乏基于ToM的决策能力,现有的基准测试仅关注人类的心理状态而忽视了代理自身的视角,阻碍了连贯决策和行动的生成。为了解决这一问题,我们提出了MindPower,一种以机器人为中心的框架,整合了感知、心理推理、决策和行动。给定多模态输入,MindPower首先感知环境和人类状态,然后进行ToM推理以建模自我和他人,并最终根据推断的心理状态生成决策和行动。此外,我们引入了Mind-Reward,一种新的优化目标,鼓励VLMs产生一致的ToM推理和行为。我们的模型在决策上比GPT-4o高出12.77%,在行动生成上高出12.49%。
Summary / 总结
The research aims to enhance vision-language embodied agents with Theory of Mind (ToM) reasoning to better model and respond to human mental states. MindPower, a Robot-Centric framework, integrates perception, mental reasoning, decision making, and action generation. It processes multimodal inputs to perceive the environment and human states, performs ToM reasoning to model both self and others, and generates decisions and actions based on inferred mental states. The model outperforms GPT-4o by 12.77% in decision making and 12.49% in action generation.
MindPower 是一个机器人中心框架,整合了感知、心理推理、决策和行动,以在视觉语言体态代理中实现心理理论推理。它处理多模态输入以感知环境和人类状态,进行心理推理以建模自我和他人,并根据推断的心理状态生成决策和行动。该模型在决策制定上的表现比 GPT-4o 高 12.77%,在行动生成上的表现高 12.49%。
From Illusion to Intention: Visual Rationale Learning for Vision-Language Reasoning
Authors: Changpeng Wang, Haozhe Wang, Xi Chen, Junhan Liu, Taofeng Xue, Chong Peng, Donglian Qi, Fangzhen Lin, Yunfeng Yan
First: 2025-11-28T09:52:56+00:00 · Latest: 2025-11-28T09:52:56+00:00
Comments: 19 pages, 15 figures
Abstract
Recent advances in vision-language reasoning underscore the importance of thinking with images, where models actively ground their reasoning in visual evidence. Yet, prevailing frameworks treat visual actions as optional tools, boosting metrics but leaving reasoning ungrounded and crops ineffective. This gap gives rise to the illusion of thinking with images: models seem visually grounded but rely on context-agnostic actions that neither refine perception nor guide reasoning toward correct answers. We address this problem by reframing visual actions as core reasoning primitives rather than optional tools, which we term visual rationalization, the visual analogue of textual Chain-of-Thought. Building on this insight, we propose Visual Rationale Learning (ViRL), an end-to-end paradigm that grounds training in the visual rationale itself. ViRL integrates (1) Process Supervision with ground-truth rationales, (2) Objective Alignment via step-level reward shaping, and (3) Fine-Grained Credit Assignment to distinguish correct, redundant, and erroneous actions. By ensuring each action contributes meaningfully to the reasoning chain, ViRL enables models to "get the right answer for the right visual reason". Trained purely with end-to-end RL, ViRL achieves state-of-the-art results across benchmarks spanning perception, hallucination, and reasoning. This work establishes visual rationalization as a task-agnostic, process-grounded paradigm for building transparent, verifiable, and trustworthy vision-language models.
中文标题/摘要
标题:从幻觉到意图:视觉推理学习在视觉语言推理中的视觉理据学习
视觉语言推理的最新进展强调了图像思维的重要性,其中模型在其推理中积极地将视觉证据作为基础。然而,现有的框架将视觉操作视为可选工具,虽然可以提升指标,但会使推理缺乏根基,裁剪无效。这种差距导致了图像思维的幻觉:模型看似视觉上得到了支撑,但实际上依赖于与上下文无关的操作,这些操作既不改进感知,也不引导推理走向正确答案。我们通过将视觉操作重新定义为核心推理原语,而不是可选工具,来解决这一问题,这被称为视觉理据化,是文本链式思考的视觉对应。基于这一见解,我们提出了视觉理据学习(ViRL),这是一种端到端的范式,将训练基础放在视觉理据本身上。ViRL 结合了(1)基于真实理据的过程监督,(2)通过步骤级奖励塑造的目标对齐,以及(3)细粒度的功劳分配,以区分正确的、冗余的和错误的操作。通过确保每个操作对推理链都有所贡献,ViRL 使模型能够“为正确的视觉原因给出正确的答案”。仅通过端到端的强化学习训练,ViRL 在涵盖感知、幻觉和推理的基准测试中达到了最先进的结果。这项工作确立了视觉理据化作为一种任务无关的、过程基础的范式,用于构建透明、可验证和可信赖的视觉语言模型。
Summary / 总结
The research aims to address the issue of models in vision-language reasoning being visually grounded but relying on context-agnostic actions, which do not effectively refine perception or guide reasoning. The authors propose Visual Rationale Learning (ViRL), which treats visual actions as core reasoning primitives and integrates process supervision, objective alignment, and fine-grained credit assignment. ViRL achieves state-of-the-art results across various benchmarks, demonstrating its effectiveness in ensuring models provide correct answers based on visual evidence.
研究解决了视觉语言推理模型在视觉证据上不接地的问题,这导致了视觉推理的表象。提出Visual Rationale Learning (ViRL),将视觉动作视为核心推理原语,整合过程监督、目标对齐和细粒度的信用分配。ViRL在各种基准测试中取得了最先进的成果,展示了其通过视觉证据引导推理并确保模型提供正确答案的能力。
Guiding Visual Autoregressive Models through Spectrum Weakening
Authors: Chaoyang Wang, Tianmeng Yang, Jingdong Wang, Yunhai Tong
First: 2025-11-28T08:52:50+00:00 · Latest: 2025-11-28T08:52:50+00:00
Abstract
Classifier-free guidance (CFG) has become a widely adopted and practical approach for enhancing generation quality and improving condition alignment. Recent studies have explored guidance mechanisms for unconditional generation, yet these approaches remain fundamentally tied to assumptions specific to diffusion models. In this work, we propose a spectrum-weakening framework for visual autoregressive (AR) models. This method works without the need for re-training, specific conditions, or any architectural modifications. It achieves this by constructing a controllable weak model in the spectral domain. We theoretically show that invertible spectral transformations preserve information, while selectively retaining only a subset of spectrum introduces controlled information reduction. Based on this insight, we perform spectrum selection along the channel dimension of internal representations, which avoids the structural constraints imposed by diffusion models. We further introduce two spectrum renormalization strategies that ensures numerical stability during the weakening process. Extensive experiments were conducted on both discrete and continuous AR models, with text or class conditioning. The results demonstrate that our method enables high-quality unconditional generation while maintaining strong prompt alignment for conditional generation.
中文标题/摘要
标题:通过频谱减弱引导视觉自回归模型
无分类器引导(CFG)已成为提高生成质量和改善条件对齐的广泛采用和实用方法。近期研究探索了无条件生成的引导机制,但这些方法仍主要基于扩散模型的特定假设。在本文中,我们提出了一种适用于视觉自回归(AR)模型的频谱减弱框架。该方法无需重新训练、特定条件或任何架构修改,而是通过在频谱域中构建可控的弱模型来实现。我们理论证明可逆的频谱变换保留信息,而选择性地保留频谱的子集则引入了可控的信息减少。基于这一见解,我们在内部表示的通道维度上进行频谱选择,从而避免了扩散模型施加的结构约束。我们还引入了两种频谱重正化策略,以确保在减弱过程中保持数值稳定性。我们在离散和连续AR模型上进行了广泛的实验,其中包含文本或类条件。结果表明,我们的方法能够实现高质量的无条件生成,同时保持强烈的提示对齐效果用于条件生成。
Summary / 总结
This paper introduces a spectrum-weakening framework for visual autoregressive models to enhance generation quality and condition alignment without re-training or architectural modifications. By selectively retaining a subset of the spectrum in the spectral domain, the method achieves controlled information reduction and avoids the structural constraints of diffusion models. Experiments on both discrete and continuous AR models show that the proposed method enables high-quality unconditional generation and strong prompt alignment for conditional generation.
本文提出了一种用于视觉自回归模型的频谱减弱框架,以提高生成质量和条件对齐,无需重新训练或架构修改。该方法在频谱域中选择性地保留谱的一部分,确保信息的保留和可控的信息减少。实验表明,该方法在离散和连续自回归模型上实现了高质量的无条件生成和强提示对齐的有条件生成。
HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model
Authors: Chen Li, Eric Peh, Basura Fernando
First: 2025-11-28T08:06:20+00:00 · Latest: 2025-11-28T08:06:20+00:00
Abstract
Recent advances in large vision-language models (VLMs) have shown significant promise for 3D scene understanding. Existing VLM-based approaches typically align 3D scene features with the VLM's embedding space. However, this implicit alignment often yields suboptimal performance due to the scarcity of 3D data and the inherent complexity of spatial relationships in 3D environments. To address these limitations, we propose a novel hierarchical multimodal representation for 3D scene reasoning that explicitly aligns with VLMs at the input space by leveraging both multi-view images and text descriptions. The text descriptions capture spatial relationships by referencing the 3D coordinates of detected objects, while the multi-view images include a top-down perspective and four directional views (forward, left, right, and backward), ensuring comprehensive scene coverage. Additionally, we introduce a hierarchical feature representation that aggregates patch-level image features into view-level and scene-level representations, enabling the model to reason over both local and global scene context. Experimental results on both situated 3D Q&A and general 3D Q&A benchmarks demonstrate the effectiveness of our approach.
中文标题/摘要
标题:HMR3D:用于大型视觉-语言模型的3D场景理解的分层多模态表示
大型视觉-语言模型(VLMs)的最新进展显示了其在3D场景理解方面的巨大潜力。现有的基于VLM的方法通常将3D场景特征与VLM的嵌入空间对齐。然而,这种隐式的对齐往往由于3D数据的稀缺性和3D环境中的空间关系的内在复杂性而表现不佳。为了解决这些限制,我们提出了一种新的分层多模态表示方法,通过利用多视角图像和文本描述在输入空间中显式地与VLM对齐。文本描述通过引用检测到的对象的3D坐标来捕捉空间关系,而多视角图像包括俯视图和四个方向视图(前、左、右、后),确保场景的全面覆盖。此外,我们引入了一种分层特征表示,将像素级图像特征聚合到视图级和场景级表示中,使模型能够推理局部和全局场景上下文。在基于位置的3D问答和通用3D问答基准测试上的实验结果表明了我们方法的有效性。
Summary / 总结
This paper aims to improve 3D scene understanding by addressing the limitations of existing vision-language models (VLMs) in handling spatial relationships. The proposed HMR3D method introduces a hierarchical multimodal representation that aligns 3D scene features with VLMs at the input space using both multi-view images and text descriptions. The method captures spatial relationships through text descriptions and ensures comprehensive scene coverage with multi-view images. Experimental results show that HMR3D outperforms existing approaches on both situated and general 3D question-answering benchmarks.
该论文提出了HMR3D方法,通过结合多视角图像和文本描述,显式地将3D场景特征与大型视觉-语言模型对齐。该方法通过文本描述捕捉空间关系,并使用分层特征表示将图像特征聚合为视图级和场景级表示。实验结果表明,HMR3D在定位和通用3D问答基准测试中均优于现有方法。
A Trainable Centrality Framework for Modern Data
Authors: Minh Duc Vu, Mingshuo Liu, Doudou Zhou
First: 2025-11-28T08:04:38+00:00 · Latest: 2025-11-28T08:04:38+00:00
Abstract
Measuring how central or typical a data point is underpins robust estimation, ranking, and outlier detection, but classical depth notions become expensive and unstable in high dimensions and are hard to extend beyond Euclidean data. We introduce Fused Unified centrality Score Estimation (FUSE), a neural centrality framework that operates on top of arbitrary representations. FUSE combines a global head, trained from pairwise distance-based comparisons to learn an anchor-free centrality score, with a local head, trained by denoising score matching to approximate a smoothed log-density potential. A single parameter between 0 and 1 interpolates between these calibrated signals, yielding depth-like centrality from different views via one forward pass. Across synthetic distributions, real images, time series, and text data, and standard outlier detection benchmarks, FUSE recovers meaningful classical ordering, reveals multi-scale geometric structures, and attains competitive performance with strong classical baselines while remaining simple and efficient.
中文标题/摘要
标题:一种可训练的中心性框架用于现代数据
衡量数据点的中心性或典型性是稳健估计、排名和异常检测的基础,但经典的深度概念在高维空间中变得昂贵且不稳定,并且难以扩展到非欧几里得数据。我们引入了融合统一中心性分数估计(FUSE),这是一种基于任意表示的神经中心性框架。FUSE 结合了一个全局头部,通过基于成对距离的比较进行训练以学习无锚点的中心性分数,以及一个局部头部,通过去噪评分匹配进行训练以近似平滑的对数密度势。介于 0 和 1 之间的单个参数在这些校准信号之间进行插值,通过一次前向传播从不同视角获得类似深度的中心性。在合成分布、真实图像、时间序列和文本数据以及标准异常检测基准中,FUSE 恢复了有意义的经典排序,揭示了多尺度几何结构,并在保持简单高效的同时达到了与强大经典基线相当的性能。
Summary / 总结
The paper addresses the challenge of measuring the centrality of data points in high-dimensional spaces, where classical methods become unstable. It introduces FUSED Unified centrality Score Estimation (FUSE), a neural framework that combines a global head for pairwise distance-based comparisons and a local head for score matching. FUSE achieves meaningful centrality scores across various data types and benchmarks, revealing multi-scale geometric structures and matching strong classical baselines in outlier detection tasks.
论文提出了FUSE(Fused Unified centrality Score Estimation)框架,用于衡量数据点的中心性。FUSE结合了通过成对距离比较训练的全局头和通过去噪评分匹配训练的局部头,并通过一个参数在两者之间进行插值。实验表明,FUSE在各种数据类型和基准测试上恢复了经典排序,揭示了多尺度几何结构,并在保持简单性和效率的同时与经典方法竞争性地表现。
PhysX-3D: Physical-Grounded 3D Asset Generation
Authors: Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu
Venue: NeurIPS 2025 Spotlight
First: 2025-07-16T17:59:35+00:00 · Latest: 2025-11-28T07:27:23+00:00
Comments: Accepted by NeurIPS 2025, Spotlight Project page: https://physx-3d.github.io/
Abstract
3D modeling is moving from virtual to physical. Existing 3D generation primarily emphasizes geometries and textures while neglecting physical-grounded modeling. Consequently, despite the rapid development of 3D generative models, the synthesized 3D assets often overlook rich and important physical properties, hampering their real-world application in physical domains like simulation and embodied AI. As an initial attempt to address this challenge, we propose \textbf{PhysX-3D}, an end-to-end paradigm for physical-grounded 3D asset generation. 1) To bridge the critical gap in physics-annotated 3D datasets, we present PhysXNet - the first physics-grounded 3D dataset systematically annotated across five foundational dimensions: absolute scale, material, affordance, kinematics, and function description. In particular, we devise a scalable human-in-the-loop annotation pipeline based on vision-language models, which enables efficient creation of physics-first assets from raw 3D assets.2) Furthermore, we propose \textbf{PhysXGen}, a feed-forward framework for physics-grounded image-to-3D asset generation, injecting physical knowledge into the pre-trained 3D structural space. Specifically, PhysXGen employs a dual-branch architecture to explicitly model the latent correlations between 3D structures and physical properties, thereby producing 3D assets with plausible physical predictions while preserving the native geometry quality. Extensive experiments validate the superior performance and promising generalization capability of our framework. All the code, data, and models will be released to facilitate future research in generative physical AI.
中文标题/摘要
标题:PhysX-3D:基于物理的3D资产生成
3D建模正从虚拟转向物理。现有的3D生成主要强调几何和纹理,而忽视了基于物理的建模。因此,尽管3D生成模型迅速发展,合成的3D资产往往忽略了丰富的物理特性,阻碍了它们在物理领域如仿真和具身AI中的实际应用。为应对这一挑战,我们提出**PhysX-3D**,一种端到端的基于物理的3D资产生成范式。1) 为弥合关键的物理标注3D数据集缺口,我们提出了PhysXNet——首个系统性标注在五个基础维度上的物理标注3D数据集:绝对尺度、材料、功能、运动学和功能描述。特别是,我们基于视觉语言模型设计了一种可扩展的人机协作标注流水线,从而能够从原始3D资产高效创建物理优先的资产。2) 此外,我们提出了**PhysXGen**,一种用于基于物理的图像到3D资产生成的前馈框架,将物理知识注入预训练的3D结构空间。具体而言,PhysXGen 使用双分支架构明确建模3D结构与物理属性之间的潜在关联,从而生成具有合理物理预测的3D资产,同时保留原始几何质量。大量实验验证了我们框架的优越性能和强大的泛化能力。所有代码、数据和模型将被发布,以促进生成物理AI领域的未来研究。
Summary / 总结
The research aims to address the lack of physical properties in 3D assets generated by existing models, which limits their application in physical domains. To achieve this, the authors propose PhysX-3D, an end-to-end framework for physical-grounded 3D asset generation. They introduce PhysXNet, a physics-grounded 3D dataset, and PhysXGen, a feed-forward framework that injects physical knowledge into pre-trained 3D models. The experiments show that PhysX-3D outperforms existing methods in generating 3D assets with plausible physical predictions while maintaining geometry quality, demonstrating its superior performance and generalization capability.
研究旨在解决现有3D生成模型忽视物理属性的问题,阻碍了其在实际应用中的发展。为此,作者提出了PhysX-3D,一种端到端的物理导向3D资产生成范式。他们引入了PhysXNet,这是第一个在五个维度上系统标注的物理导向3D数据集,以及PhysXGen,这是一种前馈框架,将物理知识注入3D结构空间,生成具有合理物理预测的资产。实验表明,所提出框架具有优越的性能和良好的泛化能力。
Artwork Interpretation with Vision Language Models: A Case Study on Emotions and Emotion Symbols
Authors: Sebastian Padó, Kerstin Thomas
First: 2025-11-28T07:04:09+00:00 · Latest: 2025-11-28T07:04:09+00:00
Comments: Accepted for publication at the IJCNLP-AACL workshop on Multimodal Models for Low-Resource Contexts and Social Impact
Abstract
Emotions are a fundamental aspect of artistic expression. Due to their abstract nature, there is a broad spectrum of emotion realization in artworks. These are subject to historical change and their analysis requires expertise in art history. In this article, we investigate which aspects of emotional expression can be detected by current (2025) vision language models (VLMs). We present a case study of three VLMs (Llava-Llama and two Qwen models) in which we ask these models four sets of questions of increasing complexity about artworks (general content, emotional content, expression of emotions, and emotion symbols) and carry out a qualitative expert evaluation. We find that the VLMs recognize the content of the images surprisingly well and often also which emotions they depict and how they are expressed. The models perform best for concrete images but fail for highly abstract or highly symbolic images. Reliable recognition of symbols remains fundamentally difficult. Furthermore, the models continue to exhibit the well-known LLM weakness of providing inconsistent answers to related questions.
中文标题/摘要
标题:视觉语言模型中的艺术作品解读:情绪与情绪符号案例研究
情绪是艺术表达的基本方面。由于情绪的抽象性质,艺术作品中的情绪实现具有广泛的多样性。这些情绪随历史变迁而变化,其分析需要艺术史方面的专业知识。本文探讨了当前(2025年)视觉语言模型(VLMs)能够检测哪些情绪表达方面。我们对三种VLMs(Llava-Llama和两个Qwen模型)进行了案例研究,要求这些模型对艺术品提出四组复杂度递增的问题(一般内容、情绪内容、情绪表达和情绪符号),并进行了定性专家评估。我们发现,这些模型对图像内容的识别效果令人惊讶,通常也能识别出它们所表达的情绪及其表达方式。模型在具体图像上表现最佳,但在高度抽象或高度象征性的图像上表现不佳。符号的可靠识别仍然根本困难。此外,模型继续表现出已知的LLM弱点,即对相关问题提供不一致的答案。
Summary / 总结
This study investigates the ability of current vision language models to detect emotional expression in artworks. Three VLMs (Llava-Llama and two Qwen models) were asked four sets of questions of increasing complexity about artworks, and a qualitative expert evaluation was conducted. The models performed well in recognizing the content and emotions depicted in the images, especially for concrete images, but struggled with highly abstract or symbolic images and the reliable recognition of symbols. The models also showed inconsistency in providing answers to related questions.
研究考察了当前视觉语言模型在识别艺术作品中情感表达方面的能力。通过向三种模型(Llava-Llama 和两个 Qwen 模型)提出复杂性递增的问题,研究人员发现这些模型能够很好地识别图像内容和情感,尤其是对于具体图像,但在高度抽象或象征性的图像以及符号的可靠识别方面存在困难。此外,这些模型在回答相关问题时也表现出已知的语言模型的一致性差的问题。
Leveraging Textual Compositional Reasoning for Robust Change Captioning
Authors: Kyu Ri Park, Jiyoung Park, Seong Tae Kim, Hong Joo Lee, Jung Uk Kim
Venue: AAAI 2026
First: 2025-11-28T06:11:23+00:00 · Latest: 2025-11-28T06:11:23+00:00
Comments: Accepted at AAAI 2026
Abstract
Change captioning aims to describe changes between a pair of images. However, existing works rely on visual features alone, which often fail to capture subtle but meaningful changes because they lack the ability to represent explicitly structured information such as object relationships and compositional semantics. To alleviate this, we present CORTEX (COmpositional Reasoning-aware TEXt-guided), a novel framework that integrates complementary textual cues to enhance change understanding. In addition to capturing cues from pixel-level differences, CORTEX utilizes scene-level textual knowledge provided by Vision Language Models (VLMs) to extract richer image text signals that reveal underlying compositional reasoning. CORTEX consists of three key modules: (i) an Image-level Change Detector that identifies low-level visual differences between paired images, (ii) a Reasoning-aware Text Extraction (RTE) module that use VLMs to generate compositional reasoning descriptions implicit in visual features, and (iii) an Image-Text Dual Alignment (ITDA) module that aligns visual and textual features for fine-grained relational reasoning. This enables CORTEX to reason over visual and textual features and capture changes that are otherwise ambiguous in visual features alone.
中文标题/摘要
标题:利用文本组合推理实现稳健的变化字幕
变化字幕旨在描述一对图像之间的变化。然而,现有工作仅依赖视觉特征,往往无法捕捉到细微但有意义的变化,因为它们缺乏表示显式结构信息(如对象关系和组合语义)的能力。为了解决这一问题,我们提出了CORTEX(COmpositional Reasoning-aware TEXt-guided)这一新颖框架,该框架结合了互补的文本线索以增强变化理解。除了捕捉像素级差异的线索外,CORTEX还利用视觉语言模型(VLMs)提供的场景级文本知识来提取更丰富的图像文本信号,揭示潜在的组合推理。CORTEX包含三个关键模块:(i) 图像级变化检测器,用于识别配对图像之间的低级视觉差异;(ii) 一种基于VLMs的推理感知文本提取(RTE)模块,用于生成隐含在视觉特征中的组合推理描述;(iii) 图像-文本双对齐(ITDA)模块,用于对齐视觉和文本特征以进行细粒度关系推理。这使CORTEX能够推理视觉和文本特征,并捕捉仅凭视觉特征无法明确的变化。
Summary / 总结
The paper addresses the challenge of change captioning by introducing CORTEX, a framework that leverages textual cues to enhance change understanding. It consists of an Image-level Change Detector, a Reasoning-aware Text Extraction module using Vision Language Models, and an Image-Text Dual Alignment module. The framework captures both pixel-level differences and scene-level textual knowledge, enabling more accurate and detailed change descriptions.
论文通过引入CORTEX框架来解决变化描述的挑战,该框架利用文本线索增强变化理解。它包括图像级变化检测器、使用视觉语言模型的推理感知文本提取模块以及图像-文本双对齐模块。该框架同时捕捉像素级差异和场景级文本知识,从而实现更准确和详细的变更描述。
ABM-LoRA: Activation Boundary Matching for Fast Convergence in Low-Rank Adaptation
Authors: Dongha Lee, Jinhee Park, Minjun Kim, Junseok Kwon
First: 2025-11-24T14:09:42+00:00 · Latest: 2025-11-28T06:10:47+00:00
Comments: 16 pages, 5 figures, under review
Abstract
We propose Activation Boundary Matching for Low-Rank Adaptation (ABM-LoRA), a principled initialization strategy that substantially accelerates the convergence of low-rank adapters. While LoRA offers high parameter efficiency, its random initialization restricts gradient updates to a mismatched tangent space, causing significant information loss and hindering early convergence. Our ABM-LoRA addresses this by aligning the adapter's activation boundaries with those of the pretrained model before downstream training, thereby maximizing the projection of full-parameter gradients into the adapter subspace. This alignment sharply reduces information loss at initialization, yields a lower starting loss, and accelerates convergence. We demonstrate ABM-LoRA's effectiveness across diverse architectures and tasks: language understanding (T5-Base on GLUE), dialogue generation (LLaMA2-7B on WizardLM), and vision recognition (ViT-B/16 on VTAB-1K). On VTAB-1K, it achieves the highest accuracy among all methods, with strong gains on structured reasoning tasks requiring geometric understanding.
中文标题/摘要
标题:ABM-LoRA:低秩适应中的激活边界匹配以实现快速收敛
我们提出了低秩适应中的激活边界匹配(ABM-LoRA),这是一种原理性的初始化策略,显著加速了低秩适配器的收敛速度。尽管LoRA具有高参数效率,但其随机初始化限制了梯度更新到一个不匹配的切空间中,导致大量信息丢失并阻碍了早期收敛。我们的ABM-LoRA通过在下游训练前使适配器的激活边界与预训练模型的激活边界对齐,从而最大化全参数梯度在适配器子空间中的投影。这种对齐显著减少了初始化时的信息丢失,降低了初始损失,并加速了收敛。我们在多种架构和任务上展示了ABM-LoRA的有效性:语言理解(T5-Base在GLUE上),对话生成(LLaMA2-7B在WizardLM上),以及视觉识别(ViT-B/16在VTAB-1K上)。在VTAB-1K上,它在所有方法中达到了最高的准确性,并在需要几何理解的结构化推理任务上取得了显著的提升。
Summary / 总结
ABM-LoRA is a method that initializes low-rank adapters by aligning their activation boundaries with those of the pretrained model, which accelerates convergence and reduces information loss. Experiments across language understanding, dialogue generation, and vision recognition tasks show that ABM-LoRA achieves the highest accuracy on VTAB-1K, especially on structured reasoning tasks requiring geometric understanding.
ABM-LoRA通过使低秩适配器的激活边界与预训练模型的激活边界对齐来初始化,从而加速收敛并减少信息损失。实验表明,ABM-LoRA在语言理解、对话生成和视觉识别等多种任务中表现出色,特别是在VTAB-1K上达到了最高的准确率。
Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective
Authors: Nhat Chung, Taisei Hanyu, Toan Nguyen, Huy Le, Frederick Bumgarner, Duy Minh Ho Nguyen, Khoa Vo, Kashu Yamazaki, Chase Rainwater, Tung Kieu, Anh Nguyen, Ngan Le
Venue: AAAI 2026
First: 2025-11-14T16:56:01+00:00 · Latest: 2025-11-28T05:39:17+00:00
Comments: Accepted at AAAI 2026
Abstract
As embodied agents operate in increasingly complex environments, the ability to perceive, track, and reason about individual object instances over time becomes essential, especially in tasks requiring sequenced interactions with visually similar objects. In these non-Markovian settings, key decision cues are often hidden in object-specific histories rather than the current scene. Without persistent memory of prior interactions (what has been interacted with, where it has been, or how it has changed) visuomotor policies may fail, repeat past actions, or overlook completed ones. To surface this challenge, we introduce LIBERO-Mem, a non-Markovian task suite for stress-testing robotic manipulation under object-level partial observability. It combines short- and long-horizon object tracking with temporally sequenced subgoals, requiring reasoning beyond the current frame. However, vision-language-action (VLA) models often struggle in such settings, with token scaling quickly becoming intractable even for tasks spanning just a few hundred frames. We propose Embodied-SlotSSM, a slot-centric VLA framework built for temporal scalability. It maintains spatio-temporally consistent slot identities and leverages them through two mechanisms: (1) slot-state-space modeling for reconstructing short-term history, and (2) a relational encoder to align the input tokens with action decoding. Together, these components enable temporally grounded, context-aware action prediction. Experiments show Embodied-SlotSSM's baseline performance on LIBERO-Mem and general tasks, offering a scalable solution for non-Markovian reasoning in object-centric robotic policies.
中文标题/摘要
标题:重新思考机器人操作中记忆状态的进展:以对象为中心的观点
随着嵌入式代理在日益复杂的环境中操作,感知、跟踪和随时间推移对个体对象实例进行推理的能力变得至关重要,特别是在需要与视觉上相似的对象进行顺序交互的任务中。在这些非马尔可夫环境中,关键决策线索往往隐藏在对象特定的历史中,而不是当前场景中。如果没有持续的记忆(之前交互过什么,它在哪里,或者它如何变化),视觉运动策略可能会失败,重复过去的动作,或者忽略已完成的动作。为了揭示这一挑战,我们引入了LIBERO-Mem,这是一种非马尔可夫任务套件,用于在对象级别部分可观测性下对机器人操作进行压力测试。它结合了短期和长期的对象跟踪以及时间序列的子目标,要求超越当前帧进行推理。然而,视觉-语言-动作(VLA)模型在这些环境中往往难以应对,即使对于仅跨越几百帧的任务,标记缩放也很快变得不可行。我们提出了一种以槽为中心的VLA框架——Embodied-SlotSSM,旨在实现时间上的可扩展性。它保持时空一致的槽身份,并通过两种机制利用它们:(1)槽状态空间建模以重构短期历史,(2)关系编码器将输入标记与动作解码对齐。这些组件共同实现了基于时间的、上下文相关的动作预测。实验表明,Embodied-SlotSSM 在 LIBERO-Mem 和通用任务上的基线性能,提供了一种在对象中心的机器人策略中进行非马尔可夫推理的可扩展解决方案。
Summary / 总结
The paper addresses the challenge of robotic manipulation in non-Markovian environments where object-specific histories are crucial for decision-making. It introduces LIBERO-Mem, a task suite that tests robotic manipulation under partial observability, and proposes Embodied-SlotSSM, a slot-centric vision-language-action framework that maintains spatio-temporally consistent slot identities and uses slot-state-space modeling and a relational encoder for temporally grounded action prediction. Experiments demonstrate that Embodied-SlotSSM outperforms existing models on both LIBERO-Mem and general tasks, providing a scalable solution for non-Markovian reasoning in object-centric robotic policies.
该论文针对非马尔可夫环境中物体交互记忆持续性的挑战,引入了LIBERO-Mem任务套件来测试物体部分可观测条件下的机器人操作。为了解决视觉-语言-动作模型在这些环境中的可扩展性问题,作者提出了Embodied-SlotSSM,这是一种基于槽的框架,保持槽的一致性身份,并使用槽状态空间建模和关系编码器实现时间上的可扩展性动作预测。实验表明,Embodied-SlotSSM在LIBERO-Mem和通用任务上表现出色,提供了一种非马尔可夫推理在物体中心机器人策略中的可扩展解决方案。
History
20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553