arXiv 论文速递

2026-01-01 03:29
Snapshot: 20260101_0329
Same or Not? Enhancing Visual Perception in Vision-Language Models
Authors: Damiano Marsili, Aditya Mehta, Ryan Y. Lin, Georgia Gkioxari
First: 2025-12-29T16:43:47+00:00 · Latest: 2025-12-29T16:43:47+00:00
Comments: Project webpage: https://glab-caltech.github.io/twin/
Abstract
Vision-language models (VLMs) excel at broad visual understanding but remain coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition ("Is it a cat or a dog?") over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a benchmark suite of 12,000 queries that repurposes fine-grained recognition and retrieval datasets from multiple domains. While existing VLMs struggle on FGVQA, when fine-tuned on TWIN they improve by up to 19.3%, without compromising performance on general VQA benchmarks. Finally, our TWIN dataset scales favorably with object annotations, and our analysis shows that scale is key to performance. We envision TWIN as a drop-in addition to open-source VLM training corpora, advancing perceptual precision of future models. Project webpage: https://glab-caltech.github.io/twin/
中文标题/摘要
标题:同或不同?提升视觉语言模型的视觉感知能力
视觉语言模型(VLMs)在广泛的视觉理解方面表现出色,但仍然较为粗略,存在视觉偏见,并且忽略了一些细微的视觉细节。现有的训练语料库通过强调一般识别(“是猫还是狗?”)而不是精细的感知,强化了这一局限性。为了解决这个问题,我们引入了一个新的训练语料库和任务,旨在增强VLMs的感知能力。TWIN是一个包含561,000个图像对查询的大规模数据集,要求模型判断两个视觉相似的图像是否描绘同一个物体,鼓励关注细微的视觉线索。该数据集涵盖了各种日常物体在不同上下文、视角和外观下的广泛范围。在TWIN上微调VLMs在精细识别方面取得了显著进步,即使在未见过的领域如艺术、动物、植物和地标也是如此。为了量化这些进步,我们引入了FGVQA,这是一个包含12,000个查询的基准套件,重新利用了多个领域中的精细识别和检索数据集。虽然现有的VLMs在FGVQA上表现不佳,但在TWIN上微调后,它们的性能提高了高达19.3%,而不会影响通用VQA基准的性能。最后,我们的TWIN数据集在对象注释方面具有可扩展性,我们的分析表明,规模是性能的关键。我们设想TWIN可以作为开源VLM训练语料库的即插即用添加,推动未来模型感知精度的进步。项目网页:https://glab-caltech.github.io/twin/
Summary / 总结
The paper introduces TWIN, a new dataset of 561,000 image-pair queries designed to enhance the fine-grained perception of vision-language models (VLMs). By training VLMs on TWIN, the models show significant improvements in recognizing subtle visual details, even in unseen domains like art and plants. The authors also introduce FGVQA, a benchmark suite, to measure these improvements, demonstrating that fine-tuning on TWIN can boost performance by up to 19.3% without affecting general VQA tasks. The dataset's scale is crucial for performance, and TWIN is proposed as an addition to VLM training corpora to improve perceptual precision. Project webpage: https://glab-caltech.github.io/twin/
该论文引入了TWIN,这是一个包含561,000对图像查询的新数据集,旨在增强视觉语言模型(VLM)的细粒度感知能力。通过要求模型判断两张相似图像是否描绘同一个物体,TWIN促使模型关注细微的视觉细节。通过在TWIN上微调VLM,其在细粒度识别上的性能得到提升,甚至在未见过的领域中也是如此,FGVQA基准上的改进幅度可达19.3%,且不会影响通用VQA性能。数据集的规模对性能至关重要,TWIN被提议作为VLM训练数据集的补充,以推进感知精度的提升。
Instruction-Following Evaluation of Large Vision-Language Models
Authors: Daiki Shiono, Shumpei Miyawaki, Ryota Tanaka, Jun Suzuki
First: 2025-12-29T16:12:33+00:00 · Latest: 2025-12-29T16:12:33+00:00
Comments: 21 pages, 7 figures
Abstract
Following the initial flourishing of large language models (LLMs), there has been a surge in proposed large vision-language models (LVLMs) that integrate LLMs with vision capabilities. However, it has been observed that LVLMs, after tuning to visual instruction using commonly used training datasets, often fail to exhibit the instruction-following ability that was present in the LLM before integration, leading to results in which they do not follow task instructions as expected. This study quantitatively demonstrates that LVLMs' instruction-following ability declines after fine-tuning and analyzes its underlying causes. In particular, we constructed new training datasets highlighting whether the output format is specified. Then, we investigated how explicitly indicating the output format during fine-tuning affects LVLMs' instruction-following ability. Our quantitative evaluation confirmed that LVLMs' instruction-following ability declines after fine-tuning with commonly used datasets. Furthermore, we found that LVLMs trained with datasets, including instructions on output format, tend to follow instructions more accurately than models that do not. These findings suggest that including samples with instructions on output format during (visual) instruction tuning may help mitigate the decline in instruction-following abilities.
中文标题/摘要
标题:大型视觉语言模型的指令遵循评估
在大型语言模型(LLMs)初期繁荣之后,出现了大量结合了视觉能力的大型视觉语言模型(LVLMs)。然而,观察到在使用常用训练数据集调整视觉指令后,LVLMs 的指令遵循能力往往不如集成前的LLMs,导致它们不能像预期那样遵循任务指令。本研究定量证明了LVLMs在微调后指令遵循能力下降,并分析了其背后的原因。特别地,我们构建了新的训练数据集,突出显示输出格式是否被指定。然后,我们研究了在微调期间明确指示输出格式如何影响LVLMs的指令遵循能力。我们的定量评估证实,使用常用数据集微调后,LVLMs的指令遵循能力下降。此外,我们发现使用包括输出格式指示的指令的数据集训练的LVLMs比没有这些指示的模型更准确地遵循指令。这些发现表明,在(视觉)指令调优过程中包括输出格式指示的样本可能有助于缓解指令遵循能力下降。
Summary / 总结
This study investigates the decline in instruction-following ability of large vision-language models (LVLMs) after fine-tuning using commonly used datasets. The authors constructed new training datasets that specify the output format and found that LVLMs trained with these datasets exhibit better instruction-following abilities compared to those trained with standard datasets. The research highlights the importance of including output format instructions during fine-tuning to improve LVLMs' performance in following task instructions.
研究探讨了大型视觉语言模型(LVLMs)在使用常用数据集进行微调后指令遵循能力下降的问题。通过构建指定输出格式的新训练数据集并评估LVLMs,研究证实了LVLMs在微调后的指令遵循能力下降。研究还发现,包含输出格式指令的训练数据集训练的LVLMs在遵循指令方面表现更好,而没有此类指令的数据集训练的模型则表现较差。
VL-RouterBench: A Benchmark for Vision-Language Model Routing
Authors: Zhehao Huang, Baijiong Lin, Jingyuan Zhang, Jingying Wang, Yuhang Liu, Ning Lu, Tao Li, Xiaolin Huang
First: 2025-12-29T16:01:19+00:00 · Latest: 2025-12-29T16:01:19+00:00
Abstract
Multi-model routing has evolved from an engineering technique into essential infrastructure, yet existing work lacks a systematic, reproducible benchmark for evaluating vision-language models (VLMs). We present VL-RouterBench to assess the overall capability of VLM routing systems systematically. The benchmark is grounded in raw inference and scoring logs from VLMs and constructs quality and cost matrices over sample-model pairs. In scale, VL-RouterBench covers 14 datasets across 3 task groups, totaling 30,540 samples, and includes 15 open-source models and 2 API models, yielding 519,180 sample-model pairs and a total input-output token volume of 34,494,977. The evaluation protocol jointly measures average accuracy, average cost, and throughput, and builds a ranking score from the harmonic mean of normalized cost and accuracy to enable comparison across router configurations and cost budgets. On this benchmark, we evaluate 10 routing methods and baselines and observe a significant routability gain, while the best current routers still show a clear gap to the ideal Oracle, indicating considerable room for improvement in router architecture through finer visual cues and modeling of textual structure. We will open-source the complete data construction and evaluation toolchain to promote comparability, reproducibility, and practical deployment in multimodal routing research.
中文标题/摘要
标题:VL-RouterBench:视觉-语言模型路由基准
多模型路由已从工程技术演变为必不可少的基础架构,但现有工作缺乏系统且可重复的基准来评估视觉-语言模型(VLMs)。我们提出了VL-RouterBench以系统地评估VLM路由系统的整体能力。基准以视觉-语言模型的原始推理和评分日志为基础,构建了样本-模型对的质量和成本矩阵。在规模上,VL-RouterBench覆盖了3个任务组中的14个数据集,总计30,540个样本,包括15个开源模型和2个API模型,产生了519,180个样本-模型对,总输入输出标记量为34,494,977。评估协议联合测量平均准确率、平均成本和吞吐量,并通过归一化成本和准确率的调和平均值构建排名分数,以在不同的路由配置和成本预算下进行比较。在该基准上,我们评估了10种路由方法和基线,并观察到显著的可路由性提升,而当前最佳的路由器仍与理想的Oracle存在明显差距,表明通过更精细的视觉线索和文本结构建模,路由架构仍有很大的改进空间。我们将开源完整的数据构建和评估工具链,以促进多模态路由研究中的可比性、可重复性和实际部署。
Summary / 总结
VL-RouterBench is designed to evaluate vision-language model routing systems by analyzing raw inference and scoring logs, covering 14 datasets and 15 models. It measures accuracy, cost, and throughput, ranking routers based on a harmonic mean of normalized cost and accuracy. The study finds significant routability gains but notes a clear gap between current routers and the ideal Oracle, suggesting room for improvement in router architecture.
VL-RouterBench 通过分析原始推理和评分日志来评估视觉语言模型路由系统,涵盖了14个数据集和15个模型。它衡量准确率、成本和吞吐量,基于标准化准确率和成本的调和平均值进行排名。研究发现显著的路由能力提升,但指出当前路由器与理想的Oracle之间仍有明显差距,表明在路由架构中通过更精细的视觉线索和文本结构建模有改进空间。
Toward Trustworthy Agentic AI: A Multimodal Framework for Preventing Prompt Injection Attacks
Authors: Toqeer Ali Syed, Mishal Ateeq Almutairi, Mahmoud Abdel Moaty
First: 2025-12-29T15:54:33+00:00 · Latest: 2025-12-29T15:54:33+00:00
Comments: It is accepted in a conference paper, ICCA 2025 in Bahrain on 21 to 23 December
Abstract
Powerful autonomous systems, which reason, plan, and converse using and between numerous tools and agents, are made possible by Large Language Models (LLMs), Vision-Language Models (VLMs), and new agentic AI systems, like LangChain and GraphChain. Nevertheless, this agentic environment increases the probability of the occurrence of multimodal prompt injection (PI) attacks, in which concealed or malicious instructions carried in text, pictures, metadata, or agent-to-agent messages may spread throughout the graph and lead to unintended behavior, a breach of policy, or corruption of state. In order to mitigate these risks, this paper suggests a Cross-Agent Multimodal Provenanc- Aware Defense Framework whereby all the prompts, either user-generated or produced by upstream agents, are sanitized and all the outputs generated by an LLM are verified independently before being sent to downstream nodes. This framework contains a Text sanitizer agent, visual sanitizer agent, and output validator agent all coordinated by a provenance ledger, which keeps metadata of modality, source, and trust level throughout the entire agent network. This architecture makes sure that agent-to-agent communication abides by clear trust frames such such that injected instructions are not propagated down LangChain or GraphChain-style-workflows. The experimental assessments show that multimodal injection detection accuracy is significantly enhanced, and the cross-agent trust leakage is minimized, as well as, agentic execution pathways become stable. The framework, which expands the concept of provenance tracking and validation to the multi-agent orchestration, enhances the establishment of secure, understandable and reliable agentic AI systems.
中文标题/摘要
标题:迈向可信赖的代理AI:预防提示注入攻击的多模态框架
大型语言模型(LLMs)、视觉-语言模型(VLMs)和新的代理AI系统(如LangChain和GraphChain)使得能够使用和在众多工具和代理之间进行推理、规划和对话的强大自主系统成为可能。然而,这种代理环境增加了多模态提示注入(PI)攻击的可能性,其中隐藏或恶意指令可能通过图传播,导致意外行为、政策违规或状态破坏。为了减轻这些风险,本文提出了一种跨代理多模态来源感知防御框架,该框架确保所有提示(无论是用户生成的还是由上游代理生成的)在发送到下游节点之前被净化和验证。该框架包括一个文本净化代理、视觉净化代理和输出验证代理,所有这些代理都由一个来源账本协调,该账本在整个代理网络中保留模态、来源和信任级别的元数据。这种架构确保了代理间的通信遵循清晰的信任框架,防止注入指令在LangChain或GraphChain风格的工作流中传播。实验评估表明,多模态注入检测准确性显著提高,跨代理信任泄露最小化,代理执行路径变得稳定。该框架将来源跟踪和验证的概念扩展到多代理编排,增强了安全、可理解且可靠的代理AI系统的建立。
Summary / 总结
This paper addresses the risk of multimodal prompt injection attacks in agentic AI systems, which can lead to unintended behavior or policy breaches. It proposes a Cross-Agent Multimodal Provenance-Aware Defense Framework that sanitizes prompts and verifies outputs before they are sent to downstream nodes. The framework includes text and visual sanitizers, and an output validator coordinated by a provenance ledger. Experimental results show improved detection accuracy, minimized cross-agent trust leakage, and more stable agentic execution pathways. This framework extends provenance tracking and validation to multi-agent orchestration, enhancing the security and reliability of agentic AI systems.
本文针对多模态提示注入攻击在代理AI系统中的风险,提出了一种跨代理多模态溯源感知防御框架。该框架包括文本和视觉净化器以及输出验证器,并由一个溯源日志协调,以确保信任并防止恶意指令的传播。实验结果表明,检测准确性显著提高,跨代理信任泄露减少,代理执行路径更加稳定。
Scaling Laws for Energy Efficiency of Local LLMs
Authors: Ander Alvarez, Alessandro Genuardi, Nilotpal Sinha, Antonio Tiene, Mikail Okyay, Bakbergen Ryskulov, David Montero, Samuel Mugel, Román Orús
First: 2025-12-18T13:40:33+00:00 · Latest: 2025-12-29T15:54:23+00:00
Abstract
Deploying local large language models and vision-language models on edge devices requires balancing accuracy with constrained computational and energy budgets. Although graphics processors dominate modern artificial-intelligence deployment, most consumer hardware--including laptops, desktops, industrial controllers, and embedded systems--relies on central processing units. Despite this, the computational laws governing central-processing-unit-only inference for local language and vision-language workloads remain largely unexplored. We systematically benchmark large language and vision-language models on two representative central-processing-unit tiers widely used for local inference: a MacBook Pro M2, reflecting mainstream laptop-class deployment, and a Raspberry Pi 5, representing constrained, low-power embedded settings. Using a unified methodology based on continuous sampling of processor and memory usage together with area-under-curve integration, we characterize how computational load scales with input text length for language models and with image resolution for vision-language models. We uncover two empirical scaling laws: (1) computational cost for language-model inference scales approximately linearly with token length; and (2) vision-language models exhibit a preprocessing-driven "resolution knee", where compute remains constant above an internal resolution clamp and decreases sharply below it. Beyond these laws, we show that quantum-inspired compression reduces processor and memory usage by up to 71.9% and energy consumption by up to 62%, while preserving or improving semantic accuracy. These results provide a systematic quantification of multimodal central-processing-unit-only scaling for local language and vision-language workloads, and they identify model compression and input-resolution preprocessing as effective, low-cost levers for sustainable edge inference.
中文标题/摘要
标题:局部LLM能效的标度律
在边缘设备上部署局部大型语言模型和视觉-语言模型需要在准确性与受限的计算和能源预算之间进行权衡。尽管图形处理器主导了现代人工智能部署,但大多数消费级硬件——包括笔记本电脑、台式机、工业控制器和嵌入式系统——仍然依赖于中央处理器。尽管如此,仅中央处理器的推理计算法则对局部语言和视觉-语言工作负载的研究仍然相对较少。我们系统地在两个广泛用于局部推理的中央处理器级别上对大型语言和视觉-语言模型进行了基准测试:一台搭载M2芯片的MacBook Pro,代表主流笔记本电脑级部署,以及一个Raspberry Pi 5,代表受限的、低功耗嵌入式设置。我们采用了一种基于连续采样处理器和内存使用情况并结合面积-曲线积分的统一方法,来表征计算负载随输入文本长度的变化情况(对于语言模型)和图像分辨率的变化情况(对于视觉-语言模型)。我们发现了两条经验标度律:(1)语言模型推理的计算成本大约与标记长度成线性关系;(2)视觉-语言模型表现出一种预处理驱动的“分辨率拐点”,其中计算量在内部分辨率限制以上保持恒定,在以下则急剧下降。除了这些定律之外,我们还表明,基于量子启发的压缩可以将处理器和内存使用量最多减少71.9%,将能耗最多减少62%,同时保持或提高语义准确性。这些结果为局部语言和视觉-语言工作负载的单一中央处理器多模态标度提供了系统量化,并指出了模型压缩和输入分辨率预处理作为可持续边缘推理的有效、低成本杠杆。
Summary / 总结
The study aims to explore the energy efficiency of deploying large language models and vision-language models on edge devices, focusing on balancing accuracy with computational and energy constraints. The research benchmarks these models on a MacBook Pro M2 and a Raspberry Pi 5, using a unified methodology to measure computational and memory usage. Key findings include linear scaling of computational cost with input text length for language models and a preprocessing-driven 'resolution knee' for vision-language models, where compute remains constant above a certain resolution and decreases below it. Additionally, quantum-inspired compression reduces processor and memory usage by up to 71.9% and energy consumption by up to 62%, while maintaining or improving semantic accuracy.
研究旨在探索在边缘设备上部署大型语言模型和视觉-语言模型时的能效问题,重点在于平衡准确性和计算及能源限制。研究在MacBook Pro M2和Raspberry Pi 5上对这些模型进行了基准测试,使用统一的方法来测量计算和内存使用情况。关键发现包括语言模型的计算成本随输入文本长度线性增长,以及视觉-语言模型的预处理驱动的‘分辨率拐点’,其中计算在某一分辨率以上保持不变,在以下则急剧下降。此外,量子启发式压缩可将处理器和内存使用量最多减少71.9%,能量消耗最多减少62%,同时保持或提高语义准确性。
PurifyGen: A Risk-Discrimination and Semantic-Purification Model for Safe Text-to-Image Generation
Authors: Zongsheng Cao, Yangfan He, Anran Liu, Jun Xie, Feng Chen, Zepeng Wang
First: 2025-12-29T15:37:26+00:00 · Latest: 2025-12-29T15:37:26+00:00
Abstract
Recent advances in diffusion models have notably enhanced text-to-image (T2I) generation quality, but they also raise the risk of generating unsafe content. Traditional safety methods like text blacklisting or harmful content classification have significant drawbacks: they can be easily circumvented or require extensive datasets and extra training. To overcome these challenges, we introduce PurifyGen, a novel, training-free approach for safe T2I generation that retains the model's original weights. PurifyGen introduces a dual-stage strategy for prompt purification. First, we evaluate the safety of each token in a prompt by computing its complementary semantic distance, which measures the semantic proximity between the prompt tokens and concept embeddings from predefined toxic and clean lists. This enables fine-grained prompt classification without explicit keyword matching or retraining. Tokens closer to toxic concepts are flagged as risky. Second, for risky prompts, we apply a dual-space transformation: we project toxic-aligned embeddings into the null space of the toxic concept matrix, effectively removing harmful semantic components, and simultaneously align them into the range space of clean concepts. This dual alignment purifies risky prompts by both subtracting unsafe semantics and reinforcing safe ones, while retaining the original intent and coherence. We further define a token-wise strategy to selectively replace only risky token embeddings, ensuring minimal disruption to safe content. PurifyGen offers a plug-and-play solution with theoretical grounding and strong generalization to unseen prompts and models. Extensive testing shows that PurifyGen surpasses current methods in reducing unsafe content across five datasets and competes well with training-dependent approaches. The code can refer to https://github.com/AI-Researcher-Team/PurifyGen.
中文标题/摘要
标题:PurifyGen:一种安全文本到图像生成的风险歧视和语义净化模型
扩散模型的最新进展显著提升了文本到图像(T2I)生成的质量,但也增加了生成不安全内容的风险。传统的安全方法如文本黑名单或有害内容分类存在明显缺陷:它们容易被规避或需要大量数据集和额外训练。为克服这些挑战,我们提出了PurifyGen,这是一种无需训练的新颖方法,用于安全的T2I生成,同时保留模型的原始权重。PurifyGen引入了一种双重阶段的提示净化策略。首先,我们通过计算每个提示词与预定义的有毒和干净概念列表的概念嵌入之间的互补语义距离来评估提示的安全性,这衡量了提示词与概念嵌入之间的语义接近度。这使得在无需显式关键词匹配或重新训练的情况下实现细粒度的提示分类。接近有毒概念的词被标记为风险词。其次,对于风险提示,我们应用双重空间变换:将与有毒概念对齐的嵌入投影到有毒概念矩阵的零空间中,从而有效去除有害的语义成分,并同时将它们对齐到干净概念范围空间中。这种双重对齐通过减去不安全的语义并强化安全语义来净化风险提示,同时保留原始意图和连贯性。我们进一步定义了一种词级策略,仅选择性地替换风险词嵌入,确保对安全内容的最小干扰。PurifyGen提供了一种即插即用的解决方案,具有理论依据和强大的泛化能力,适用于未见过的提示和模型。广泛的测试表明,PurifyGen在五个数据集中减少了不安全内容,与依赖训练的方法竞争。代码可参考https://github.com/AI-Researcher-Team/PurifyGen。
Summary / 总结
PurifyGen is a training-free model designed to enhance the safety of text-to-image generation by evaluating and purifying prompts. It uses a dual-stage approach to classify and transform tokens in prompts, ensuring they are safe without altering the original intent. By projecting toxic embeddings into the null space of toxic concepts and aligning them with clean concepts, PurifyGen effectively removes harmful semantic components while preserving the original meaning. Experimental results demonstrate that PurifyGen outperforms existing methods in reducing unsafe content across multiple datasets and is competitive with training-dependent approaches.
PurifyGen 是一个无需训练的模型,用于安全的文本到图像生成,通过双重阶段策略评估和净化提示以降低生成不安全内容的风险。首先,它计算每个词元的语义距离以识别风险词元;其次,它应用双重空间变换去除有害的语义成分并强化安全成分,同时保留原始意图和连贯性。广泛的测试表明,PurifyGen 在五个数据集上优于现有方法,减少了不安全内容的生成,并保持了图像的原始意图和连贯性。
PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis
Authors: Shengyi Hua, Jianfeng Wu, Tianle Shen, Kangzhe Hu, Zhongzhen Huang, Shujuan Ni, Zhihong Zhang, Yuan Li, Zhe Wang, Xiaofan Zhang
First: 2025-12-29T15:34:27+00:00 · Latest: 2025-12-29T15:34:27+00:00
Abstract
Recent pathological foundation models have substantially advanced visual representation learning and multimodal interaction. However, most models still rely on a static inference paradigm in which whole-slide images are processed once to produce predictions, without reassessment or targeted evidence acquisition under ambiguous diagnoses. This contrasts with clinical diagnostic workflows that refine hypotheses through repeated slide observations and further examination requests. We propose PathFound, an agentic multimodal model designed to support evidence-seeking inference in pathological diagnosis. PathFound integrates the power of pathological visual foundation models, vision-language models, and reasoning models trained with reinforcement learning to perform proactive information acquisition and diagnosis refinement by progressing through the initial diagnosis, evidence-seeking, and final decision stages. Across several large multimodal models, adopting this strategy consistently improves diagnostic accuracy, indicating the effectiveness of evidence-seeking workflows in computational pathology. Among these models, PathFound achieves state-of-the-art diagnostic performance across diverse clinical scenarios and demonstrates strong potential to discover subtle details, such as nuclear features and local invasions.
中文标题/摘要
标题:PathFound:一种促进证据寻求的病理性诊断代理多模态模型
近期的病理性基础模型在视觉表示学习和多模态交互方面取得了显著进展。然而,大多数模型仍然依赖于静态推理范式,在这种范式中,全切片图像仅处理一次以生成预测,而不会在模糊诊断下进行重新评估或有针对性的证据获取。这与临床诊断工作流程形成对比,后者通过反复观察切片和进一步的检查请求来细化假设。我们提出PathFound,这是一种设计用于支持病理性诊断中证据寻求推理的代理多模态模型。PathFound 结合了病理视觉基础模型、视觉语言模型以及通过强化学习训练的推理模型的力量,通过初始诊断、证据寻求和最终决策阶段的进展来进行主动信息获取和诊断细化。在多个大型多模态模型中采用这种策略始终提高了诊断准确性,表明在计算病理学中证据寻求工作流程的有效性。在这些模型中,PathFound 在多种临床场景中实现了最先进的诊断性能,并展示了发现细微特征(如核特征和局部侵袭)的强大潜力。
Summary / 总结
PathFound is an agentic multimodal model that enhances pathological diagnosis by integrating visual foundation models, vision-language models, and reinforcement learning-based reasoning models. It supports evidence-seeking inference through multiple stages, including initial diagnosis, evidence-seeking, and final decision-making. This approach improves diagnostic accuracy across various models and clinical scenarios, highlighting the effectiveness of evidence-seeking workflows in computational pathology.
研究动机是通过引入PathFound,一种主动型多模态模型,来解决病理诊断模型中静态推理的局限性。PathFound 结合视觉、语言和推理模型,通过证据寻求阶段逐步优化诊断。关键实验发现表明,该模型在各种场景下提高了诊断准确性,并在识别核特征和局部侵袭等细微细节方面表现出色,超越了现有模型。
AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization
Authors: Binhe Yu, Zhen Wang, Kexin Li, Yuqian Yuan, Wenqiao Zhang, Long Chen, Juncheng Li, Jun Xiao, Yueting Zhuang
First: 2025-12-29T15:26:25+00:00 · Latest: 2025-12-29T15:26:25+00:00
Abstract
Multi-subject customization aims to synthesize multiple user-specified subjects into a coherent image. To address issues such as subjects missing or conflicts, recent works incorporate layout guidance to provide explicit spatial constraints. However, existing methods still struggle to balance three critical objectives: text alignment, subject identity preservation, and layout control, while the reliance on additional training further limits their scalability and efficiency. In this paper, we present AnyMS, a novel training-free framework for layout-guided multi-subject customization. AnyMS leverages three input conditions: text prompt, subject images, and layout constraints, and introduces a bottom-up dual-level attention decoupling mechanism to harmonize their integration during generation. Specifically, global decoupling separates cross-attention between textual and visual conditions to ensure text alignment. Local decoupling confines each subject's attention to its designated area, which prevents subject conflicts and thus guarantees identity preservation and layout control. Moreover, AnyMS employs pre-trained image adapters to extract subject-specific features aligned with the diffusion model, removing the need for subject learning or adapter tuning. Extensive experiments demonstrate that AnyMS achieves state-of-the-art performance, supporting complex compositions and scaling to a larger number of subjects.
中文标题/摘要
标题:AnyMS:基于布局引导和无需训练的多主题定制
多主题定制旨在将多个用户指定的主题合成到一个连贯的图像中。为了解决主题缺失或冲突等问题,最近的工作引入了布局指导以提供明确的空间约束。然而,现有方法仍然难以平衡文本对齐、主题身份保留和布局控制这三个关键目标,而对额外训练的依赖进一步限制了其可扩展性和效率。在本文中,我们提出了AnyMS,这是一种新颖的无需训练的布局引导多主题定制框架。AnyMS 利用三种输入条件:文本提示、主题图像和布局约束,并引入了一种自底向上的双层注意力解耦机制,以在生成过程中协调它们的整合。具体而言,全局解耦将文本和视觉条件之间的跨注意力分离,以确保文本对齐。局部解耦将每个主题的注意力限制在其指定区域内,从而防止主题冲突,从而保证身份保留和布局控制。此外,AnyMS 使用预训练的图像适配器来提取与扩散模型对齐的主题特定特征,从而无需学习主题或调整适配器。大量实验表明,AnyMS 达到了最先进的性能,支持复杂的组合并扩展到更多的主题数量。
Summary / 总结
AnyMS is a training-free framework for layout-guided multi-subject customization that addresses the challenges of text alignment, subject identity preservation, and layout control. It uses a bottom-up dual-level attention decoupling mechanism to integrate text prompts, subject images, and layout constraints, ensuring text alignment through global decoupling and preventing subject conflicts through local decoupling. Experimental results show that AnyMS outperforms existing methods in supporting complex compositions and handling a larger number of subjects.
AnyMS 是一个无需训练的框架,用于指导布局的多主体定制,解决了文本对齐、主体身份保留和布局控制的挑战。它使用自底向上的双层注意力解耦机制来整合文本提示、主体图像和布局约束。全局解耦确保文本对齐,而局部解耦防止主体冲突,从而保留身份和布局控制。使用预训练的图像适配器来提取与扩散模型对齐的主体特定特征,无需进行主体学习或适配器调优。实验表明,AnyMS 在处理复杂组合和扩展到多个主体方面优于现有方法。
Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution
Authors: Hexin Zhang, Dong Li, Jie Huang, Bingzhou Wang, Xueyang Fu, Zhengjun Zha
First: 2025-12-29T15:09:20+00:00 · Latest: 2025-12-29T15:09:20+00:00
Abstract
Diffusion models have become a leading paradigm for image super-resolution (SR), but existing methods struggle to guarantee both the high-frequency perceptual quality and the low-frequency structural fidelity of generated images. Although inference-time scaling can theoretically improve this trade-off by allocating more computation, existing strategies remain suboptimal: reward-driven particle optimization often causes perceptual over-smoothing, while optimal-path search tends to lose structural consistency. To overcome these difficulties, we propose Iterative Diffusion Inference-Time Scaling with Adaptive Frequency Steering (IAFS), a training-free framework that jointly leverages iterative refinement and frequency-aware particle fusion. IAFS addresses the challenge of balancing perceptual quality and structural fidelity by progressively refining the generated image through iterative correction of structural deviations. Simultaneously, it ensures effective frequency fusion by adaptively integrating high-frequency perceptual cues with low-frequency structural information, allowing for a more accurate and balanced reconstruction across different image details. Extensive experiments across multiple diffusion-based SR models show that IAFS effectively resolves the perception-fidelity conflict, yielding consistently improved perceptual detail and structural accuracy, and outperforming existing inference-time scaling methods.
中文标题/摘要
标题:迭代推理时缩放与自适应频率导向的图像超分辨率
扩散模型已成为图像超分辨率(SR)的主要范式,但现有方法难以同时保证生成图像的高频感知质量和低频结构保真度。尽管推理时缩放理论上可以通过分配更多计算资源来改善这种权衡,但现有策略仍不尽如人意:基于奖励的粒子优化往往导致感知过度平滑,而最优路径搜索则倾向于失去结构一致性。为克服这些困难,我们提出了一种无需训练的框架——迭代扩散推理时缩放与自适应频率导向(IAFS),该框架联合利用迭代细化和频率感知粒子融合。IAFS通过逐步修正结构偏差来逐步细化生成的图像,以解决感知质量和结构保真度之间的平衡挑战。同时,通过自适应地将高频感知线索与低频结构信息融合,确保有效的频率融合,从而在不同图像细节上实现更准确和平衡的重建。在多个基于扩散的SR模型上的广泛实验表明,IAFS有效解决了感知与保真度的冲突,一致地提高了感知细节和结构准确性,并优于现有推理时缩放方法。
Summary / 总结
The research aims to improve the trade-off between perceptual quality and structural fidelity in image super-resolution using diffusion models. The proposed method, Iterative Diffusion Inference-Time Scaling with Adaptive Frequency Steering (IAFS), iteratively refines the generated image and adaptively integrates high-frequency perceptual cues with low-frequency structural information. Experiments demonstrate that IAFS effectively balances perceptual quality and structural accuracy, outperforming existing methods.
研究旨在通过扩散模型提高图像超分辨率中感知质量和结构保真度之间的权衡。提出的迭代扩散推理时缩放与自适应频率融合方法(IAFS)通过迭代细化生成的图像并自适应地整合高频感知线索与低频结构信息来解决这一问题。实验表明,IAFS能够有效平衡感知质量和结构准确性,优于现有方法。
UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?
Authors: Fengjiao Chen, Minhao Jing, Weitao Lu, Yan Feng, Xiaoyu Li, Xuezhi Cao
First: 2025-12-29T14:49:50+00:00 · Latest: 2025-12-29T14:49:50+00:00
Abstract
Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified model with a concise structure, UniHetero, under large-scale pretraining (>200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. (2) Generation reveals a superior Data Scaling trend and higher Data Utilization. (3) Autoregression on Input Embedding is effective to capture visual details.
中文标题/摘要
标题:UniHetero:生成能否在大规模数据下增强视觉-语言模型的理解?
视觉-语言大型模型正朝着统一视觉理解与生成任务的方向发展。然而,在大规模数据下,生成是否能增强理解仍是一个未被充分探索的问题。在本工作中,我们分析了在大规模预训练(>200M样本)下具有简洁结构的统一模型UniHetero。我们的主要观察结果是:(1) 生成可以提高理解,但只有生成语义,而不是像素。(2) 生成揭示了更优越的数据扩展趋势和更高的数据利用率。(3) 输入嵌入的自回归有助于捕捉视觉细节。
Summary / 总结
This study explores whether generation can enhance understanding in large-scale vision-language models. Using the UniHetero model with over 200 million samples, the research finds that generating semantics rather than pixels improves understanding. It also shows that generation has a better data scaling trend and higher data utilization, and that autoregression on input embedding effectively captures visual details.
研究探讨了在大规模数据下生成是否能增强视觉-语言模型的理解能力。使用包含超过2亿样本的UniHetero模型,研究发现生成语义而非像素可以提高理解能力。此外,生成展示了更好的数据扩展趋势和更高的数据利用率,输入嵌入的自回归有效捕捉视觉细节。
TV-RAG: A Temporal-aware and Semantic Entropy-Weighted Framework for Long Video Retrieval and Understanding
Authors: Zongsheng Cao, Yangfan He, Anran Liu, Feng Chen, Zepeng Wang, Jun Xie
First: 2025-12-29T14:10:22+00:00 · Latest: 2025-12-29T14:10:22+00:00
Abstract
Large Video Language Models (LVLMs) have rapidly emerged as the focus of multimedia AI research. Nonetheless, when confronted with lengthy videos, these models struggle: their temporal windows are narrow, and they fail to notice fine-grained semantic shifts that unfold over extended durations. Moreover, mainstream text-based retrieval pipelines, which rely chiefly on surface-level lexical overlap, ignore the rich temporal interdependence among visual, audio, and subtitle channels. To mitigate these limitations, we propose TV-RAG, a training-free architecture that couples temporal alignment with entropy-guided semantics to improve long-video reasoning. The framework contributes two main mechanisms: \emph{(i)} a time-decay retrieval module that injects explicit temporal offsets into the similarity computation, thereby ranking text queries according to their true multimedia context; and \emph{(ii)} an entropy-weighted key-frame sampler that selects evenly spaced, information-dense frames, reducing redundancy while preserving representativeness. By weaving these temporal and semantic signals together, TV-RAG realises a dual-level reasoning routine that can be grafted onto any LVLM without re-training or fine-tuning. The resulting system offers a lightweight, budget-friendly upgrade path and consistently surpasses most leading baselines across established long-video benchmarks such as Video-MME, MLVU, and LongVideoBench, confirming the effectiveness of our model. The code can be found at https://github.com/AI-Researcher-Team/TV-RAG.
中文标题/摘要
标题:TV-RAG:一种时间感知和语义熵加权框架,用于长视频检索与理解
大型视频语言模型(LVLMs)迅速成为多媒体AI研究的焦点。然而,当面对长视频时,这些模型表现不佳:它们的时间窗口狭窄,无法注意到长时间内发生的细微语义变化。此外,主流基于文本的检索管道主要依赖于表面词汇重叠,忽略了视觉、音频和字幕通道之间的丰富时间依赖性。为了解决这些限制,我们提出了TV-RAG,这是一种无需训练的架构,将时间对齐与基于熵的语义相结合,以提高长视频推理能力。该框架贡献了两种主要机制:\emph{(i)} 一种时间衰减检索模块,将显式的时间偏移注入相似性计算中,从而根据其真实的多媒体上下文对文本查询进行排序;\emph{(ii)} 一种熵加权关键帧采样器,选择均匀分布、信息密集的帧,减少冗余同时保持代表性。通过将这些时间和语义信号结合起来,TV-RAG 实现了一种双层推理机制,可以无缝地附加到任何 LVLM 而无需重新训练或微调。由此产生的系统提供了一种轻量级、经济实惠的升级路径,并在诸如Video-MME、MLVU和LongVideoBench等公认长视频基准测试中始终超越大多数领先基准,证实了我们模型的有效性。代码可以在https://github.com/AI-Researcher-Team/TV-RAG/找到。
Summary / 总结
TV-RAG is a training-free framework that enhances long-video retrieval and understanding by integrating temporal alignment and entropy-weighted semantics. It introduces a time-decay retrieval module to better rank text queries based on their multimedia context and an entropy-weighted key-frame sampler to select informative frames. Experimental results show that TV-RAG outperforms most leading baselines on long-video benchmarks, demonstrating its effectiveness in handling lengthy videos with fine-grained semantic shifts and temporal dependencies. The code is available at https://github.com/AI-Researcher-Team/TV-RAG.
TV-RAG 是一个无需训练的框架,通过结合时间对齐和熵加权语义来增强长视频的检索和理解。它引入了时间衰减检索模块以更好地根据多媒体上下文对文本查询进行排序,并使用熵加权关键帧采样器选择信息密集的帧。实验结果表明,TV-RAG 在长视频基准测试(如 Video-MME、MLVU 和 LongVideoBench)上优于大多数领先基线,证明了其在处理具有细粒度语义变化和时间依赖性的长视频方面的有效性。代码可在 https://github.com/AI-Researcher-Team/TV-RAG 获取。
CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models
Authors: Zongsheng Cao, Yangfan He, Anran Liu, Jun Xie, Feng Chen, Zepeng Wang
First: 2025-12-29T13:23:20+00:00 · Latest: 2025-12-29T13:23:20+00:00
Abstract
Large Vision-Language Models (LVLMs) have achieved impressive progress in multi-modal understanding and generation. However, they still tend to produce hallucinated content that is inconsistent with the visual input, which limits their reliability in real-world applications. We propose \textbf{CoFi-Dec}, a training-free decoding framework that mitigates hallucinations by integrating generative self-feedback with coarse-to-fine visual conditioning. Inspired by the human visual process from global scene perception to detailed inspection, CoFi-Dec first generates two intermediate textual responses conditioned on coarse- and fine-grained views of the original image. These responses are then transformed into synthetic images using a text-to-image model, forming multi-level visual hypotheses that enrich grounding cues. To unify the predictions from these multiple visual conditions, we introduce a Wasserstein-based fusion mechanism that aligns their predictive distributions into a geometrically consistent decoding trajectory. This principled fusion reconciles high-level semantic consistency with fine-grained visual grounding, leading to more robust and faithful outputs. Extensive experiments on six hallucination-focused benchmarks show that CoFi-Dec substantially reduces both entity-level and semantic-level hallucinations, outperforming existing decoding strategies. The framework is model-agnostic, requires no additional training, and can be seamlessly applied to a wide range of LVLMs. The implementation is available at https://github.com/AI-Researcher-Team/CoFi-Dec.
中文标题/摘要
标题:CoFi-Dec:通过从粗到细生成反馈的解码框架减少幻觉
大型视觉-语言模型(LVLMs)在多模态理解和生成方面取得了显著进展。然而,它们仍然倾向于生成与视觉输入不一致的幻觉内容,这限制了它们在实际应用中的可靠性。我们提出了**CoFi-Dec**,一种无需训练的解码框架,通过将生成的自反馈与从粗到细的视觉条件相结合来减轻幻觉。受人类视觉过程从全局场景感知到详细检查的启发,CoFi-Dec 首先根据原始图像的粗略和精细视图生成两个中间的文本响应。这些响应然后使用文本到图像模型转换为合成图像,形成多层次的视觉假设,丰富了语义线索。为了统一这些多种视觉条件下的预测,我们引入了一种基于Wasserstein的融合机制,将它们的预测分布对齐到几何一致的解码轨迹。这种原理性的融合实现了高层语义一致性和精细视觉语义线索的统一,从而产生更稳健和忠实的输出。在六个幻觉重点基准上的广泛实验表明,CoFi-Dec 显著减少了实体级和语义级的幻觉,优于现有解码策略。该框架是模型无关的,无需额外训练,并可以无缝应用于各种LVLMs。实现代码可在https://github.com/AI-Researcher-Team/CoFi-Dec 获取。
Summary / 总结
CoFi-Dec is a training-free decoding framework designed to reduce hallucinations in large vision-language models by integrating generative self-feedback with coarse-to-fine visual conditioning. It generates intermediate textual responses based on different levels of image views, transforms them into synthetic images, and uses a Wasserstein-based fusion mechanism to align their predictive distributions, ensuring both semantic consistency and fine-grained visual grounding. Experiments on six benchmarks demonstrate that CoFi-Dec significantly reduces both entity-level and semantic-level hallucinations, outperforming existing strategies.
CoFi-Dec 是一个无需训练的解码框架,通过结合生成自反馈和从粗到细的视觉条件来减少大型视觉语言模型中的幻觉。它基于不同级别的图像视图生成中间文本响应,将其转换为合成图像,并使用 Wasserstein 基础融合机制对预测分布进行对齐,确保语义一致性和精细的视觉定位。实验在六个基准上表明,CoFi-Dec 显著减少了实体级和语义级的幻觉,优于现有策略。
OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving
Authors: Zhenguo Zhang, Haohan Zheng, Yishen Wang, Le Xu, Tianchen Deng, Xuefeng Chen, Qu Chen, Bo Zhang, Wuxiong Huang
First: 2025-12-16T03:19:28+00:00 · Latest: 2025-12-29T12:27:08+00:00
Abstract
The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning. While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels. Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and "zoom in" on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model's significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.
中文标题/摘要
标题:OmniDrive-R1:强化驱动的交织多模态链式思考方法及其在可信视觉语言自动驾驶中的应用
在自动驾驶(AD)等安全关键领域部署视觉语言模型(VLMs)受到可靠性故障的严重阻碍,尤其是对象幻觉。这种故障源于它们依赖于基于文本的链式思考(CoT)推理。虽然现有的多模态CoT方法试图缓解这一问题,但它们存在两个根本缺陷:(1)感知和推理阶段的分离,这妨碍了端到端联合优化,(2)依赖昂贵的密集定位标签。因此,我们提出了OmniDrive-R1,这是一种专为自动驾驶设计的端到端VLM框架,通过交织多模态链式思考(iMCoT)机制统一了感知和推理。我们的核心创新是一种强化驱动的视觉定位能力,使模型能够自主地引导其注意力并“聚焦”在关键区域进行精细分析。这种能力得益于我们纯两阶段强化学习训练管道和Clip-GRPO算法。关键的是,Clip-GRPO引入了一种无需标注的过程导向定位奖励。这种奖励不仅消除了对密集标签的需求,还通过强制实时跨模态一致性来避免外部工具调用的不稳定性。在DriveLMM-o1上的大量实验表明,我们的模型取得了显著改进。与基线Qwen2.5VL-7B相比,OmniDrive-R1的整体推理得分从51.77%提高到80.35%,最终答案准确性从37.81%提高到73.62%。
Summary / 总结
OmniDrive-R1 is an end-to-end VLM framework for autonomous driving that integrates perception and reasoning through an interleaved Multi-modal Chain-of-Thought mechanism. It introduces a reinforcement-driven visual grounding capability, allowing the model to focus on critical regions for fine-grained analysis. Experiments on DriveLMM-o1 show that OmniDrive-R1 significantly improves reasoning scores and answer accuracy compared to the baseline Qwen2.5VL-7B, with improvements from 51.77% to 80.35% in reasoning scores and from 37.81% to 73.62% in answer accuracy.
OmniDrive-R1 是一种端到端的视觉-语言模型框架,用于自动驾驶,通过交错的多模态链式思考机制将感知和推理结合起来。它引入了一种强化驱动的视觉定位能力,使模型能够聚焦于关键区域进行精细分析。DriveLMM-o1 实验显示,与基线 Qwen2.5VL-7B 相比,OmniDrive-R1 在推理得分和最终答案准确性上有了显著提升,推理得分从 51.77% 提高到 80.35%,最终答案准确性从 37.81% 提高到 73.62%。
IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation
Authors: Zeteng Lin, Xingxing Li, Wen You, Xiaoyang Li, Zehan Lu, Yujun Cai, Jing Tang
First: 2025-10-13T03:19:45+00:00 · Latest: 2025-12-29T11:52:31+00:00
Abstract
Existing vision language models (VLMs), including GPT-4 and DALL.E, often struggle to preserve logic, object identity, and style in multimodal image-text generation. This limitation significantly hinders the generalization capability of VLMs in complex image-text input-output scenarios. To address this issue, we propose IUT-Plug, a module grounded in an Image Understanding Tree (IUT), which enhances existing interleaved VLMs through explicit structured reasoning, thereby mitigating context drift in logic, entity identity, and style. The proposed framework operates in two stages. (1) A dynamic IUT-Plug extraction module parses visual scenes into hierarchical symbolic structures. (2) A coordinated narrative-flow and image synthesis mechanism ensures cross-modal consistency. To evaluate our approach, we construct a novel benchmark based on 3,000 real human-generated question-answer pairs over fine-tuned large models, introducing a dynamic evaluation protocol for quantifying context drift in interleaved VLMs. Experimental results demonstrate that IUT-Plug not only improves accuracy on established benchmarks but also effectively alleviates the three critical forms of context drift across diverse multimodal question answering (QA) scenarios.
中文标题/摘要
标题:IUT-Plug:一种基于图像理解树的插件工具,用于交错图像-文本生成
现有的视觉语言模型(VLMs),包括GPT-4和DALL.E,往往难以在多模态图像-文本生成中保持逻辑、对象身份和风格的一致性。这一限制显著阻碍了VLMs在复杂图像-文本输入输出场景中的泛化能力。为了解决这一问题,我们提出了一种基于图像理解树(IUT)的IUT-Plug模块,通过显式的结构化推理增强现有的交错VLMs,从而减轻逻辑、实体身份和风格中的上下文漂移。该提出的框架分为两个阶段。(1)动态IUT-Plug提取模块将视觉场景解析为分层符号结构。(2)协调的叙述流和图像合成机制确保跨模态一致性。为了评估我们的方法,我们基于3,000个真实的人类生成的问题-答案对构建了一个新的基准,引入了一种动态评估协议,用于量化交错VLMs中的上下文漂移。实验结果表明,IUT-Plug不仅在现有的基准测试中提高了准确性,还在多种多模态问答(QA)场景中有效缓解了三种关键形式的上下文漂移。
Summary / 总结
The paper addresses the limitations of existing vision language models (VLMs) in preserving logic, object identity, and style in image-text generation. To tackle this issue, the authors propose IUT-Plug, a module based on an Image Understanding Tree (IUT) that enhances VLMs through explicit structured reasoning. The framework consists of two stages: a dynamic IUT-Plug extraction module that parses visual scenes into hierarchical symbolic structures, and a coordinated narrative-flow and image synthesis mechanism that ensures cross-modal consistency. The effectiveness of IUT-Plug is evaluated using a novel benchmark with 3,000 real human-generated question-answer pairs, showing improvements in accuracy and alleviation of context drift in various multimodal QA scenarios.
论文针对现有视觉语言模型(VLMs)在图像-文本生成中难以保持逻辑、对象身份和风格的问题,提出了基于图像理解树(IUT)的IUT-Plug模块,通过结构化推理增强VLMs。框架分为两个阶段:将视觉场景解析为层次符号结构,并确保跨模态一致性。实验表明,IUT-Plug不仅在基准测试中提高了准确性,还在多种多模态问答场景中有效减少了上下文漂移。
SpatialMosaic: A Multiview VLM Dataset for Partial Visibility
Authors: Kanghee Lee, Injae Lee, Minseok Kwak, Kwonyoung Ryu, Jungi Hong, Jaesik Park
First: 2025-12-29T10:48:54+00:00 · Latest: 2025-12-29T10:48:54+00:00
Abstract
The rapid progress of Multimodal Large Language Models (MLLMs) has unlocked the potential for enhanced 3D scene understanding and spatial reasoning. However, existing approaches often rely on pre-constructed 3D representations or off-the-shelf reconstruction pipelines, which constrain scalability and real-world applicability. A recent line of work explores learning spatial reasoning directly from multi-view images, enabling Vision-Language Models (VLMs) to understand 3D scenes without explicit 3D reconstructions. Nevertheless, key challenges that frequently arise in real-world environments, such as partial visibility, occlusion, and low-overlap conditions that require spatial reasoning from fragmented visual cues, remain under-explored. To address these limitations, we propose a scalable multi-view data generation and annotation pipeline that constructs realistic spatial reasoning QAs, resulting in SpatialMosaic, a comprehensive instruction-tuning dataset featuring 2M QA pairs. We further introduce SpatialMosaic-Bench, a challenging benchmark for evaluating multi-view spatial reasoning under realistic and challenging scenarios, consisting of 1M QA pairs across 6 tasks. In addition, we present SpatialMosaicVLM, a hybrid framework that integrates 3D reconstruction models as geometry encoders within VLMs for robust spatial reasoning. Extensive experiments demonstrate that our proposed dataset and VQA tasks effectively enhance spatial reasoning under challenging multi-view conditions, validating the effectiveness of our data generation pipeline in constructing realistic and diverse QA pairs. Code and dataset will be available soon.
中文标题/摘要
标题:SpatialMosaic:多视角VLM数据集以应对部分可见性
多模态大型语言模型(MLLMs)的快速发展为增强三维场景理解和空间推理提供了潜力。然而,现有方法通常依赖于预先构建的三维表示或现成的重建管道,这限制了其可扩展性和现实世界的适用性。近期的研究工作探索了直接从多视角图像中学习空间推理的方法,使视觉语言模型(VLMs)能够在无需显式三维重建的情况下理解三维场景。尽管如此,现实环境中经常出现的关键挑战,如部分可见性、遮挡和低重叠条件,需要从碎片化的视觉线索中进行空间推理,这些挑战仍然未得到充分探索。为了解决这些限制,我们提出了一种可扩展的多视角数据生成和注释管道,构建了真实的空间推理问答对,从而形成了SpatialMosaic,一个包含200万问答对的全面指令调优数据集。我们还引入了SpatialMosaic-Bench,这是一个具有挑战性的基准,用于在现实和具有挑战性的场景中评估多视角空间推理,包含6个任务的100万问答对。此外,我们提出了SpatialMosaicVLM,这是一种混合框架,将3D重建模型作为几何编码器集成到VLMs中,以实现稳健的空间推理。广泛的实验表明,我们提出的数据集和VQA任务在具有挑战性的多视角条件下有效提升了空间推理能力,验证了我们数据生成管道在构建真实和多样化问答对方面的有效性。代码和数据集将很快提供。
Summary / 总结
The paper proposes SpatialMosaic, a dataset for multi-view Vision-Language Models (VLMs) to improve spatial reasoning under partial visibility conditions. The method involves a scalable multi-view data generation and annotation pipeline to create realistic spatial reasoning questions and answers, resulting in 2 million QA pairs. The dataset, SpatialMosaic-Bench, includes 1 million QA pairs across 6 tasks to evaluate multi-view spatial reasoning. Experiments show that the proposed dataset and VQA tasks enhance spatial reasoning in challenging multi-view conditions, validating the effectiveness of the data generation pipeline.
论文提出了SpatialMosaic数据集,旨在增强视觉-语言模型(VLMs)在部分可见条件下的空间推理能力。它提出了一种多视角数据生成和标注管道,创建了200万对真实的空间推理问题和答案。该数据集还包含SpatialMosaic-Bench基准,包含100万对问题和答案,覆盖6个任务。此外,还提出了结合3D重建模型的混合框架SpatialMosaicVLM,将其作为几何编码器集成到VLM中。实验表明,该数据集和VQA任务在具有挑战性的多视角条件下提高了空间推理能力,验证了所提数据生成管道的有效性。
Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt
Authors: Shangxun Li, Youngjung Uh
First: 2025-12-18T11:55:06+00:00 · Latest: 2025-12-29T10:07:18+00:00
Abstract
Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.
中文标题/摘要
标题:文本嵌入的空间解缠用于单个提示生成主题一致的文本到图像生成
文本到图像的扩散模型在从自然语言描述生成高质量图像方面表现出色,但在多个输出中保持主题一致性方面经常失败,限制了其在视觉叙事中的应用。现有方法依赖于模型微调或图像条件化,这在计算上昂贵且需要针对每个主题进行优化。1Prompt1Story 是一种无需训练的方法,将所有场景描述连接成一个提示并重新缩放标记嵌入,但这种方法遭受语义泄露的问题,即帧间嵌入变得纠缠,导致文本对齐不良。在本文中,我们提出了一种简单而有效的无需训练的方法,从几何学角度出发,通过细化文本嵌入来抑制不需要的语义,从而解决语义纠缠问题。广泛的实验表明,我们的方法在主题一致性和文本对齐方面显著优于现有基线。
Summary / 总结
The research aims to improve subject consistency in text-to-image generation by addressing semantic entanglement. The method refines text embeddings to suppress unwanted semantics from a geometric perspective, using a single prompt for multiple scene descriptions. Experiments show that this approach enhances subject consistency and text alignment compared to existing methods without requiring model fine-tuning or per-subject optimization.
研究旨在通过解决语义缠绕问题来提高文本到图像生成中的主题一致性。方法是从几何角度精炼文本嵌入以抑制不必要的语义,无需进行模型微调或针对每个主题进行优化。实验表明,这种方法在主题一致性和文本对齐方面优于现有方法。
Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control
Authors: Yoonpyo Lee, Kazuma Kobayashi, Sai Puppala, Sajedul Talukder, Seid Koric, Souvik Chakraborty, Syed Bahauddin Alam
First: 2025-12-29T08:26:27+00:00 · Latest: 2025-12-29T08:26:27+00:00
Abstract
The prevailing paradigm in AI for physical systems, scaling general-purpose foundation models toward universal multimodal reasoning, confronts a fundamental barrier at the control interface. Recent benchmarks show that even frontier vision-language models achieve only 50-53% accuracy on basic quantitative physics tasks, behaving as approximate guessers that preserve semantic plausibility while violating physical constraints. This input unfaithfulness is not a scaling deficiency but a structural limitation. Perception-centric architectures optimize parameter-space imitation, whereas safety-critical control demands outcome-space guarantees over executed actions. Here, we present a fundamentally different pathway toward domain-specific foundation models by introducing compact language models operating as Agentic Physical AI, in which policy optimization is driven by physics-based validation rather than perceptual inference. We train a 360-million-parameter model on synthetic reactor control scenarios, scaling the dataset from 10^3 to 10^5 examples. This induces a sharp phase transition absent in general-purpose models. Small-scale systems exhibit high-variance imitation with catastrophic tail risk, while large-scale models undergo variance collapse exceeding 500x reduction, stabilizing execution-level behavior. Despite balanced exposure to four actuation families, the model autonomously rejects approximately 70% of the training distribution and concentrates 95% of runtime execution on a single-bank strategy. Learned representations transfer across distinct physics and continuous input modalities without architectural modification.
中文标题/摘要
标题:面向核反应堆控制的专用领域基础模型的代理物理AI
当前AI在物理系统中的范式,将通用基础模型扩展到多模态通用推理,面临控制接口的基本障碍。最近的基准测试显示,即使是前沿的视觉-语言模型,在基本的定量物理任务上也只能达到50-53%的准确率,它们更像是近似猜测者,保持语义合理性的同时违反物理约束。这种输入不忠实不是扩展不足,而是结构性限制。以感知为中心的架构优化参数空间模仿,而安全关键控制则需要执行动作的结果空间保证。在这里,我们通过引入作为代理物理AI的紧凑语言模型,提出了通往专用领域基础模型的不同路径,在这种模型中,策略优化由基于物理的验证驱动,而不是感知推理。我们训练了一个3.6亿参数的模型,在合成的反应堆控制场景上,将数据集从10^3扩展到10^5个例子。这在通用基础模型中没有出现相变。小型系统表现出高方差模仿,伴随灾难性尾部风险,而大型模型经历方差崩溃,超过500倍的减少,稳定执行级行为。尽管对四种执行家族有均衡的暴露,模型自主拒绝了大约70%的训练分布,并将95%的运行时执行集中在单一银行策略上。学习到的表示在不同的物理和连续输入模态之间转移,无需架构修改。
Summary / 总结
The research aims to address the limitations of general-purpose AI models in controlling physical systems, particularly nuclear reactors, by introducing Agentic Physical AI. This approach uses compact language models that optimize for physics-based validation rather than perceptual imitation. The study trains a 360-million-parameter model on synthetic reactor control scenarios, achieving a significant reduction in variance and stabilizing execution. Despite exposure to various actuation methods, the model focuses on a single strategy, demonstrating robust performance and transferability across different physical conditions and input modalities.
研究旨在通过引入Agentic Physical AI解决通用AI模型在控制物理系统(尤其是核反应堆)方面的局限性。该方法使用紧凑的语言模型,基于物理验证优化策略,而非感知推理。研究通过在合成的反应堆控制场景中训练一个3.6亿参数的模型,实现了显著的方差减少和执行层面行为的稳定。尽管接触了多种执行家族,模型仍集中于单一策略,展示了在不同物理和输入模态下的可迁移性,无需修改架构。
Plug-and-Play Fidelity Optimization for Diffusion Transformer Acceleration via Cumulative Error Minimization
Authors: Tong Shao, Yusen Fu, Guoying Sun, Jingde Kong, Zhuotao Tian, Jingyong Su
First: 2025-12-29T07:36:36+00:00 · Latest: 2025-12-29T07:36:36+00:00
Abstract
Although Diffusion Transformer (DiT) has emerged as a predominant architecture for image and video generation, its iterative denoising process results in slow inference, which hinders broader applicability and development. Caching-based methods achieve training-free acceleration, while suffering from considerable computational error. Existing methods typically incorporate error correction strategies such as pruning or prediction to mitigate it. However, their fixed caching strategy fails to adapt to the complex error variations during denoising, which limits the full potential of error correction. To tackle this challenge, we propose a novel fidelity-optimization plugin for existing error correction methods via cumulative error minimization, named CEM. CEM predefines the error to characterize the sensitivity of model to acceleration jointly influenced by timesteps and cache intervals. Guided by this prior, we formulate a dynamic programming algorithm with cumulative error approximation for strategy optimization, which achieves the caching error minimization, resulting in a substantial improvement in generation fidelity. CEM is model-agnostic and exhibits strong generalization, which is adaptable to arbitrary acceleration budgets. It can be seamlessly integrated into existing error correction frameworks and quantized models without introducing any additional computational overhead. Extensive experiments conducted on nine generation models and quantized methods across three tasks demonstrate that CEM significantly improves generation fidelity of existing acceleration models, and outperforms the original generation performance on FLUX.1-dev, PixArt-$α$, StableDiffusion1.5 and Hunyuan. The code will be made publicly available.
中文标题/摘要
标题:通过累积误差最小化实现的插即用保真度优化以加速扩散变换器
尽管扩散变换器(DiT)已成为图像和视频生成的主要架构,但其迭代去噪过程导致推理速度缓慢,这阻碍了其更广泛的适用性和发展。基于缓存的方法实现了无需训练的加速,但会遭受显著的计算误差。现有方法通常通过剪枝或预测等错误校正策略来减轻这种误差。然而,它们固定的缓存策略无法适应去噪过程中复杂的误差变化,这限制了错误校正的全部潜力。为应对这一挑战,我们提出了一种新的基于累积误差最小化的保真度优化插件,命名为CEM。CEM 预定义了误差以表征模型对加速的敏感性,该敏感性由时间步和缓存间隔共同影响。在这一先验的指导下,我们提出了一个基于累积误差近似的动态规划算法来进行策略优化,从而实现缓存误差最小化,显著提高了生成保真度。CEM 是模型无关的,并且表现出强大的泛化能力,可以适应任意的加速预算。它可以无缝集成到现有的错误校正框架和量化模型中,而不会引入任何额外的计算开销。在三个任务上对九个生成模型和量化方法进行的广泛实验表明,CEM 显著提高了现有加速模型的生成保真度,并在 FLUX.1-dev、PixArt-$α$、StableDiffusion1.5 和 Hunyuan 上优于原始生成性能。代码将公开发布。
Object-Centric Representation Learning for Enhanced 3D Scene Graph Prediction
Authors: KunHo Heo, GiHyun Kim, SuYeon Kim, MyeongAh Cho
Venue: NeurIPS 2025
First: 2025-10-06T11:33:09+00:00 · Latest: 2025-12-29T07:15:13+00:00
Comments: Accepted by NeurIPS 2025. Code: https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes
Abstract
3D Semantic Scene Graph Prediction aims to detect objects and their semantic relationships in 3D scenes, and has emerged as a crucial technology for robotics and AR/VR applications. While previous research has addressed dataset limitations and explored various approaches including Open-Vocabulary settings, they frequently fail to optimize the representational capacity of object and relationship features, showing excessive reliance on Graph Neural Networks despite insufficient discriminative capability. In this work, we demonstrate through extensive analysis that the quality of object features plays a critical role in determining overall scene graph accuracy. To address this challenge, we design a highly discriminative object feature encoder and employ a contrastive pretraining strategy that decouples object representation learning from the scene graph prediction. This design not only enhances object classification accuracy but also yields direct improvements in relationship prediction. Notably, when plugging in our pretrained encoder into existing frameworks, we observe substantial performance improvements across all evaluation metrics. Additionally, whereas existing approaches have not fully exploited the integration of relationship information, we effectively combine both geometric and semantic features to achieve superior relationship prediction. Comprehensive experiments on the 3DSSG dataset demonstrate that our approach significantly outperforms previous state-of-the-art methods. Our code is publicly available at https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes.
中文标题/摘要
标题:面向对象的表示学习以增强3D场景图预测
3D语义场景图预测旨在检测3D场景中的对象及其语义关系,并已成为机器人技术和AR/VR应用中的关键技术。尽管先前的研究解决了数据集限制并探索了各种方法,包括开放式词汇设置,但它们经常未能优化对象和关系特征的表示能力,过度依赖图神经网络,尽管其区分能力不足。在本工作中,我们通过广泛的分析表明,对象特征的质量对整体场景图准确性起着关键作用。为了解决这一挑战,我们设计了一种高度区分的对象特征编码器,并采用对比预训练策略,将对象表示学习与场景图预测分离。这一设计不仅提高了对象分类准确性,还直接提高了关系预测。值得注意的是,当将我们的预训练编码器插入现有框架时,我们观察到所有评估指标上都取得了显著性能提升。此外,与现有方法未能充分利用关系信息的整合不同,我们有效结合了几何和语义特征,实现了更优的关系预测。在3DSSG数据集上的全面实验表明,我们的方法显著优于先前的最先进方法。我们的代码可在https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes公开获取。
Summary / 总结
This paper addresses the challenge of 3D semantic scene graph prediction by focusing on the quality of object features. It introduces a discriminative object feature encoder and a contrastive pretraining strategy that improves both object and relationship prediction. The approach significantly outperforms previous methods across all evaluation metrics on the 3DSSG dataset.
该研究针对3D语义场景图预测中的对象特征质量不足问题,提出了一种区分性对象特征编码器和对比预训练策略,以提升对象和关系预测。在3DSSG数据集上的实验表明,该方法显著优于之前的方法,并在所有评估指标上取得了显著的性能提升。
ASemConsist: Adaptive Semantic Feature Control for Training-Free Identity-Consistent Generation
Authors: Shin seong Kim, Minjung Shin, Hyunin Cho, Youngjung Uh
First: 2025-12-29T07:06:57+00:00 · Latest: 2025-12-29T07:06:57+00:00
Abstract
Recent text-to-image diffusion models have significantly improved visual quality and text alignment. However, generating a sequence of images while preserving consistent character identity across diverse scene descriptions remains a challenging task. Existing methods often struggle with a trade-off between maintaining identity consistency and ensuring per-image prompt alignment. In this paper, we introduce a novel framework, ASemconsist, that addresses this challenge through selective text embedding modification, enabling explicit semantic control over character identity without sacrificing prompt alignment. Furthermore, based on our analysis of padding embeddings in FLUX, we propose a semantic control strategy that repurposes padding embeddings as semantic containers. Additionally, we introduce an adaptive feature-sharing strategy that automatically evaluates textual ambiguity and applies constraints only to the ambiguous identity prompt. Finally, we propose a unified evaluation protocol, the Consistency Quality Score (CQS), which integrates identity preservation and per-image text alignment into a single comprehensive metric, explicitly capturing performance imbalances between the two metrics. Our framework achieves state-of-the-art performance, effectively overcoming prior trade-offs. Project page: https://minjung-s.github.io/asemconsist
中文标题/摘要
标题:ASemConsist: 自适应语义特征控制以实现无需训练的身份一致生成
近期的文本到图像扩散模型在视觉质量和文本对齐方面取得了显著进步。然而,在多样场景描述下生成一系列图像并保持角色身份一致仍然是一个具有挑战性的任务。现有方法往往在保持身份一致性与确保单张图像提示对齐之间存在权衡。在本文中,我们提出了一种新颖框架ASemconsist,通过选择性地修改文本嵌入,实现对角色身份的显式语义控制,同时不牺牲提示对齐。此外,基于对FLUX中填充嵌入分析,我们提出了一种语义控制策略,将填充嵌入重新利用为语义容器。我们还引入了一种自适应特征共享策略,自动评估文本的模糊性,并仅对模糊身份提示施加约束。最后,我们提出了一种统一的评估协议,一致性质量分数(CQS),将身份保留和单张图像文本对齐整合为一个综合指标,明确捕捉两个指标之间的性能失衡。我们的框架实现了最先进的性能,有效克服了先前的权衡。
Summary / 总结
The research aims to address the challenge of generating a sequence of images with consistent character identity across diverse scene descriptions while maintaining prompt alignment. The ASemConsist framework uses selective text embedding modification to enable explicit semantic control over character identity without compromising prompt alignment. Key findings include the repurposing of padding embeddings as semantic containers and an adaptive feature-sharing strategy that evaluates textual ambiguity. The proposed Consistency Quality Score (CQS) integrates identity preservation and per-image text alignment into a single metric, demonstrating state-of-the-art performance in overcoming previous trade-offs.
研究旨在解决在多样场景描述下生成一系列具有一致人物身份的图像的同时保持提示对齐的问题。ASemConsist框架通过选择性文本嵌入修改来实现对人物身份的显式语义控制,而不牺牲提示对齐。关键发现包括提高了人物身份一致性和每张图像的文本对齐,以及引入了一致性质量评分(CQS)来全面评估性能。该框架在处理身份一致性和提示对齐之间的权衡方面优于先前的方法。
ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing
Authors: Xingwei Ma, Shiyang Feng, Bo Zhang, Bin Wang
First: 2025-12-29T06:58:46+00:00 · Latest: 2025-12-29T06:58:46+00:00
Abstract
Remote sensing change detection (RSCD), a complex multi-image inference task, traditionally uses pixel-based operators or encoder-decoder networks that inadequately capture high-level semantics and are vulnerable to non-semantic perturbations. Although recent multimodal and vision-language model (VLM)-based approaches enhance semantic understanding of change regions by incorporating textual descriptions, they still suffer from challenges such as inaccurate spatial localization, imprecise pixel-level boundary delineation, and limited interpretability. To address these issues, we propose ViLaCD-R1, a two-stage framework comprising a Multi-Image Reasoner (MIR) and a Mask-Guided Decoder (MGD). Specifically, the VLM is trained through supervised fine-tuning (SFT) and reinforcement learning (RL) on block-level dual-temporal inference tasks, taking dual-temporal image patches as input and outputting a coarse change mask. Then, the decoder integrates dual-temporal image features with this coarse mask to predict a precise binary change map. Comprehensive evaluations on multiple RSCD benchmarks demonstrate that ViLaCD-R1 substantially improves true semantic change recognition and localization, robustly suppresses non-semantic variations, and achieves state-of-the-art accuracy in complex real-world scenarios.
中文标题/摘要
标题:ViLaCD-R1:一种用于遥感领域语义变化检测的视觉-语言框架
遥感变化检测(RSCD),一个复杂的多图像推理任务,传统上使用基于像素的操作符或编码-解码网络,这些方法难以捕捉高层语义且容易受到非语义干扰的影响。尽管最近的多模态和视觉-语言模型(VLM)方法通过引入文本描述来增强变化区域的语义理解,但它们仍然面临诸如不准确的空间定位、不精确的像素级边界划分和有限的可解释性等挑战。为了解决这些问题,我们提出了一种两阶段框架ViLaCD-R1,该框架包括多图像推理器(MIR)和掩码引导解码器(MGD)。具体而言,VLM 通过监督微调(SFT)和强化学习(RL)在块级双时相推理任务上进行训练,以双时相图像块作为输入并输出粗略的变化掩码。然后,解码器将双时相图像特征与该粗略掩码结合起来,预测精确的二元变化图。在多个RSCD基准上的全面评估表明,ViLaCD-R1 显著提高了真实语义变化的识别和定位,稳健地抑制了非语义变化,并在复杂的真实世界场景中达到了最先进的准确率。
Summary / 总结
The paper addresses the limitations of traditional pixel-based and encoder-decoder approaches in remote sensing change detection by proposing ViLaCD-R1, a two-stage framework that integrates a Multi-Image Reasoner and a Mask-Guided Decoder. The VLM is trained using supervised fine-tuning and reinforcement learning to generate a coarse change mask, which is then refined by the decoder to produce a precise binary change map. Experimental results show that ViLaCD-R1 enhances true semantic change recognition and localization, suppresses non-semantic variations, and achieves state-of-the-art accuracy in complex scenarios.
论文提出了一种名为ViLaCD-R1的两阶段框架,用于解决遥感变化检测中的问题,通过结合多图像推理器和掩码引导解码器。该框架使用通过监督微调和强化学习训练的视觉-语言模型生成粗略的变化掩码,然后通过解码器细化生成精确的二元变化图。实验表明,ViLaCD-R1增强了真正的语义变化识别和定位,抑制了非语义变化,并在复杂场景中达到了最先进的准确性。
Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism
Authors: Siyu Zhang, Ying Chen, Lianlei Shan, Runhe Qiu
First: 2025-12-29T06:51:20+00:00 · Latest: 2025-12-29T06:51:20+00:00
Abstract
Multimodal fusion of remote sensing images serves as a core technology for overcoming the limitations of single-source data and improving the accuracy of surface information extraction, which exhibits significant application value in fields such as environmental monitoring and urban planning. To address the deficiencies of existing methods, including the failure of fixed resolutions to balance efficiency and detail, as well as the lack of semantic hierarchy in single-scale alignment, this study proposes a Vision-language Model (VLM) framework integrated with two key innovations: the Dynamic Resolution Input Strategy (DRIS) and the Multi-scale Vision-language Alignment Mechanism (MS-VLAM).Specifically, the DRIS adopts a coarse-to-fine approach to adaptively allocate computational resources according to the complexity of image content, thereby preserving key fine-grained features while reducing redundant computational overhead. The MS-VLAM constructs a three-tier alignment mechanism covering object, local-region and global levels, which systematically captures cross-modal semantic consistency and alleviates issues of semantic misalignment and granularity imbalance.Experimental results on the RS-GPT4V dataset demonstrate that the proposed framework significantly improves the accuracy of semantic understanding and computational efficiency in tasks including image captioning and cross-modal retrieval. Compared with conventional methods, it achieves superior performance in evaluation metrics such as BLEU-4 and CIDEr for image captioning, as well as R@10 for cross-modal retrieval. This technical framework provides a novel approach for constructing efficient and robust multimodal remote sensing systems, laying a theoretical foundation and offering technical guidance for the engineering application of intelligent remote sensing interpretation.
中文标题/摘要
标题:遥感图像多模态解释:动态分辨率输入策略和多尺度视觉-语言对齐机制
遥感图像的多模态融合是一种克服单一数据源限制、提高地表信息提取准确性的核心技术,在环境监测和城市规划等领域具有显著的应用价值。为解决现有方法存在的固定分辨率难以平衡效率与细节、单尺度对齐缺乏语义层次等缺陷,本研究提出了一种结合两种关键创新的视觉-语言模型(VLM)框架:动态分辨率输入策略(DRIS)和多尺度视觉-语言对齐机制(MS-VLAM)。具体而言,DRIS采用由粗到细的方法,根据图像内容的复杂性适配性地分配计算资源,从而保留关键的细粒度特征,同时减少冗余计算开销。MS-VLAM构建了涵盖对象、局部区域和全局三个层级的对齐机制,系统地捕捉跨模态语义一致性,缓解语义错位和粒度失衡的问题。在RS-GPT4V数据集上的实验结果表明,所提出的框架在图像描述和跨模态检索等任务中显著提高了语义理解和计算效率。与传统方法相比,它在图像描述任务的BLEU-4和CIDEr评估指标以及跨模态检索任务的R@10方面均表现出更优的性能。该技术框架为构建高效稳健的多模态遥感系统提供了新的方法,为智能遥感解释的工程应用奠定了理论基础并提供了技术指导。
Summary / 总结
This study addresses the limitations of fixed resolutions in remote sensing image processing and the lack of semantic hierarchy in single-scale alignment by proposing a Vision-language Model (VLM) framework with Dynamic Resolution Input Strategy (DRIS) and Multi-scale Vision-language Alignment Mechanism (MS-VLAM). The DRIS allocates computational resources adaptively based on image content complexity, preserving fine-grained features while reducing computational overhead. The MS-VLAM constructs a three-tier alignment mechanism to capture cross-modal semantic consistency, addressing semantic misalignment and granularity imbalance. Experimental results show that the proposed framework improves semantic understanding and computational efficiency, outperforming conventional methods in image captioning and cross-modal retrieval tasks.
本研究通过提出结合动态分辨率输入策略(DRIS)和多尺度视觉语言对齐机制(MS-VLAM)的视觉语言模型(VLM)框架,解决了固定分辨率在遥感图像处理中的局限性和单尺度对齐中缺乏语义层次的问题。DRIS根据图像内容的复杂性动态分配计算资源,保留细粒度特征并减少计算开销。MS-VLAM构建了涵盖对象、局部区域和全局三个层级的对齐机制,以捕捉跨模态语义一致性。实验结果表明,该框架在语义理解和计算效率方面均优于传统方法,在图像描述和跨模态检索任务中表现出色。
RS-Prune: Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models
Authors: Fan Wei, Runmin Dong, Yushan Lai, Yixiang Yang, Zhaoyang Luo, Jinxiao Zhang, Miao Yang, Shuai Yuan, Jiyao Zhao, Bin Luo, Haohuan Fu
First: 2025-12-29T06:44:06+00:00 · Latest: 2025-12-29T06:44:06+00:00
Abstract
Diffusion-based remote sensing (RS) generative foundation models are cruial for downstream tasks. However, these models rely on large amounts of globally representative data, which often contain redundancy, noise, and class imbalance, reducing training efficiency and preventing convergence. Existing RS diffusion foundation models typically aggregate multiple classification datasets or apply simplistic deduplication, overlooking the distributional requirements of generation modeling and the heterogeneity of RS imagery. To address these limitations, we propose a training-free, two-stage data pruning approach that quickly select a high-quality subset under high pruning ratios, enabling a preliminary foundation model to converge rapidly and serve as a versatile backbone for generation, downstream fine-tuning, and other applications. Our method jointly considers local information content with global scene-level diversity and representativeness. First, an entropy-based criterion efficiently removes low-information samples. Next, leveraging RS scene classification datasets as reference benchmarks, we perform scene-aware clustering with stratified sampling to improve clustering effectiveness while reducing computational costs on large-scale unlabeled data. Finally, by balancing cluster-level uniformity and sample representativeness, the method enables fine-grained selection under high pruning ratios while preserving overall diversity and representativeness. Experiments show that, even after pruning 85\% of the training data, our method significantly improves convergence and generation quality. Furthermore, diffusion foundation models trained with our method consistently achieve state-of-the-art performance across downstream tasks, including super-resolution and semantic image synthesis. This data pruning paradigm offers practical guidance for developing RS generative foundation models.
中文标题/摘要
标题:RS-Prune: 高比例训练免费数据剪枝以提高远程 sensing 扩散基础模型效率
基于扩散的远程 sensing 生成基础模型对于下游任务至关重要。然而,这些模型依赖大量全球代表性数据,这些数据通常包含冗余、噪声和类别不平衡,降低了训练效率并阻碍了收敛。现有的远程 sensing 扩散基础模型通常聚合多个分类数据集或应用简单的去重方法,忽视了生成建模的分布要求以及远程 sensing 图像的异质性。为了解决这些限制,我们提出了一种训练免费的两阶段数据剪枝方法,该方法能够在高剪枝比例下快速选择高质量子集,使初步基础模型能够快速收敛,并作为生成、下游微调和其他应用的多功能骨干。该方法同时考虑了局部信息内容与全局场景级的多样性和代表性。首先,基于熵准则高效移除低信息量样本。然后,利用远程 sensing 场景分类数据集作为参考基准,我们进行场景感知聚类并采用分层抽样以提高聚类效果并减少大规模未标记数据上的计算成本。最后,通过平衡聚类级均匀性和样本代表性,该方法能够在高剪枝比例下实现细粒度选择,同时保持整体多样性和代表性。实验表明,即使剪枝了85%的训练数据,我们的方法也能显著提高收敛性和生成质量。此外,使用我们方法训练的扩散基础模型在包括超分辨率和语义图像合成在内的下游任务中始终实现最先进的性能。该数据剪枝范式为开发远程 sensing 生成基础模型提供了实用指导。
Summary / 总结
The paper proposes RS-Prune, a training-free data pruning method for diffusion-based remote sensing (RS) generative foundation models. It addresses issues of redundancy and class imbalance in large datasets by using an entropy-based criterion and scene-aware clustering with stratified sampling. Experiments show that pruning 85% of the training data significantly improves convergence and generation quality, while still achieving state-of-the-art performance in downstream tasks like super-resolution and semantic image synthesis.
该论文提出了一种名为RS-Prune的训练-free数据剪枝方法,用于基于扩散的遥感(RS)生成基础模型。该方法通过使用熵基准则和场景感知聚类来解决大数据集中的冗余和类别不平衡问题。实验表明,即使剪枝掉85%的训练数据,该方法仍能提高收敛性和生成质量,并且使用RS-Prune训练的模型在超分辨率和语义图像合成等下游任务中达到最先进的性能。
GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation
Authors: Tianchen Deng, Xuefeng Chen, Yi Chen, Qu Chen, Yuyao Xu, Lijin Yang, Le Xu, Yu Zhang, Bo Zhang, Wuxiong Huang, Hesheng Wang
First: 2025-12-29T03:40:05+00:00 · Latest: 2025-12-29T03:40:05+00:00
Abstract
Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality alignment. In addition, we design a novel task-aware language-guided sampling strategy that removes redundant 3D Gaussians and injects accurate and compact 3D tokens into LLM. Furthermore, we design a dual-condition multi-modal generation model, where the information captured by our vision-language model is leveraged as a high-level language condition in combination with a low-level image condition, jointly guiding the multi-modal generation process. We conduct comprehensive studies on the nuScenes, and NuInteract datasets to validate the effectiveness of our framework. Our method achieves state-of-the-art performance. We will release the code publicly on GitHub https://github.com/dtc111111/GaussianDWM.
中文标题/摘要
标题:GaussianDWM:基于3D高斯场景表示的统一场景理解和多模态生成世界模型
生成模型的发展推动了驾驶世界模型(DWMs)的快速发展。然而,现有的DWMs缺乏3D场景理解能力,只能根据输入数据生成内容,而无法解释或推理驾驶环境。此外,当前方法使用点云或BEV特征表示3D空间信息,无法准确对齐文本信息与底层3D场景。为解决这些限制,我们提出了一种基于3D高斯场景表示的新型统一DWM框架,该框架能够同时实现3D场景理解和多模态场景生成,并且能够为理解和生成任务提供上下文增强。我们的方法通过将丰富的语言特征嵌入到每个高斯原语中,直接将文本信息与3D场景对齐,从而实现早期模态对齐。此外,我们设计了一种新的任务感知语言引导采样策略,该策略移除了冗余的3D高斯,并将准确且紧凑的3D标记注入到LLM中。此外,我们设计了一种双条件多模态生成模型,其中由我们的视觉语言模型捕获的信息作为高级语言条件与低级图像条件相结合,共同指导多模态生成过程。我们在nuScenes和NuInteract数据集上进行了全面研究,以验证我们框架的有效性。我们的方法达到了最先进的性能。我们将在GitHub上公开发布代码:https://github.com/dtc111111/GaussianDWM。
Summary / 总结
The paper proposes GaussianDWM, a unified 3D Gaussian Driving World Model that integrates 3D scene understanding and multi-modal generation. It uses 3D Gaussian primitives to align textual information with the scene, and introduces a task-aware language-guided sampling strategy to enhance the model's accuracy. Experimental results on nuScenes and NuInteract datasets show that GaussianDWM outperforms existing methods in both 3D scene understanding and multi-modal generation tasks.
论文提出了GaussianDWM,这是一种结合了3D场景理解和多模态生成的统一3D高斯驾驶世界模型。该模型使用3D高斯原语将文本信息与场景对齐,并引入了一种任务感知的语言引导采样策略以提高模型的准确性。该模型在nuScenes和NuInteract数据集上的表现优于现有方法,展示了其在驾驶场景中理解和生成多模态内容的有效性。
How Much Data Is Enough? Uniform Convergence Bounds for Generative & Vision-Language Models under Low-Dimensional Structure
Authors: Paul M. Thompson
First: 2025-12-28T23:16:22+00:00 · Latest: 2025-12-28T23:16:22+00:00
Comments: 13 pages, 2 figures
Abstract
Modern generative and vision-language models (VLMs) are increasingly used in scientific and medical decision support, where predicted probabilities must be both accurate and well calibrated. Despite strong empirical results with moderate data, it remains unclear when such predictions generalize uniformly across inputs, classes, or subpopulations, rather than only on average-a critical issue in biomedicine, where rare conditions and specific groups can exhibit large errors even when overall loss is low. We study this question from a finite-sample perspective and ask: under what structural assumptions can generative and VLM-based predictors achieve uniformly accurate and calibrated behavior with practical sample sizes? Rather than analyzing arbitrary parameterizations, we focus on induced families of classifiers obtained by varying prompts or semantic embeddings within a restricted representation space. When model outputs depend smoothly on a low-dimensional semantic representation-an assumption supported by spectral structure in text and joint image-text embeddings-classical uniform convergence tools yield meaningful non-asymptotic guarantees. Our main results give finite-sample uniform convergence bounds for accuracy and calibration functionals of VLM-induced classifiers under Lipschitz stability with respect to prompt embeddings. The implied sample complexity depends on intrinsic/effective dimension, not ambient embedding dimension, and we further derive spectrum-dependent bounds that make explicit how eigenvalue decay governs data requirements. We conclude with implications for data-limited biomedical modeling, including when current dataset sizes can support uniformly reliable predictions and why average calibration metrics may miss worst-case miscalibration.
中文标题/摘要
标题:多少数据才够?生成与视觉语言模型在低维结构下的统一收敛边界
现代生成和视觉语言模型(VLMs)在科学和医学决策支持中越来越广泛使用,其中预测概率必须既准确又校准良好。尽管在中等数据量下有很强的经验结果,但仍然不清楚这些预测在输入、类别或子群体之间是否均匀泛化,而不仅仅是平均泛化——这是一个在生物医学中至关重要的问题,在生物医学中,即使总体损失较低,罕见状况和特定群体也可能出现较大的误差。 我们从有限样本的角度研究了这一问题,并问:在什么结构假设下,生成和VLM基预测器可以在实际样本大小下实现均匀准确和校准的行为?我们不是分析任意参数化,而是集中在通过在受限表示空间内变化提示或语义嵌入所获得的分类器族。当模型输出依赖于低维语义表示的平滑变化——这一假设由文本和联合图像-文本嵌入的谱结构支持——经典的一致收敛工具可以提供有意义的非渐近保证。 我们的主要结果给出了VLM诱导分类器在Lipschitz稳定性相对于提示嵌入下的准确性和校准函数的有限样本一致收敛界。隐含的数据复杂性取决于内在/有效维度,而不是环境嵌入维度,我们进一步推导出依赖于谱的界,明确说明了特征值衰减如何影响数据需求。最后,我们讨论了数据有限的生物医学建模的含义,包括当前数据集大小是否可以支持均匀可靠的预测,以及为什么平均校准指标可能无法捕捉到最坏情况的校准偏差。
Summary / 总结
This paper investigates the conditions under which generative and vision-language models can provide uniformly accurate and calibrated predictions, especially in data-limited settings common in biomedical applications. By focusing on the smooth dependence of model outputs on low-dimensional semantic representations, the authors derive finite-sample uniform convergence bounds for these models. Key findings include sample complexity that depends on intrinsic rather than ambient dimensions, and spectrum-dependent bounds that highlight the role of eigenvalue decay in determining data requirements.
该研究探讨了生成和视觉-语言模型在数据有限的生物医学应用中,如何能够提供统一准确和校准的预测。通过关注模型输出对低维语义表示的平滑依赖,作者推导出了这些模型的有限样本统一收敛界。关键发现包括样本复杂性依赖于内在而非外在维度,并且谱依赖界明确展示了特征值衰减对数据需求的影响。
Benchmark Success, Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks, Not Patients
Authors: Armin Berger, Manuela Bergau, Helen Schneider, Saad Ahmad, Tom Anglim Lagones, Gianluca Brugnara, Martha Foltyn-Dumitru, Kai Schlamp, Philipp Vollmuth, Rafet Sifa
First: 2025-12-28T21:57:42+00:00 · Latest: 2025-12-28T21:57:42+00:00
Abstract
Recent Reinforcement Learning (RL) advances for Large Language Models (LLMs) have improved reasoning tasks, yet their resource-constrained application to medical imaging remains underexplored. We introduce ChexReason, a vision-language model trained via R1-style methodology (SFT followed by GRPO) using only 2,000 SFT samples, 1,000 RL samples, and a single A100 GPU. Evaluations on CheXpert and NIH benchmarks reveal a fundamental tension: GRPO recovers in-distribution performance (23% improvement on CheXpert, macro-F1 = 0.346) but degrades cross-dataset transferability (19% drop on NIH). This mirrors high-resource models like NV-Reason-CXR-3B, suggesting the issue stems from the RL paradigm rather than scale. We identify a generalization paradox where the SFT checkpoint uniquely improves on NIH before optimization, indicating teacher-guided reasoning captures more institution-agnostic features. Furthermore, cross-model comparisons show structured reasoning scaffolds benefit general-purpose VLMs but offer minimal gain for medically pre-trained models. Consequently, curated supervised fine-tuning may outperform aggressive RL for clinical deployment requiring robustness across diverse populations.
中文标题/摘要
标题:基准成功,临床失败:当强化学习优化基准而非患者
近期用于大型语言模型(LLMs)的强化学习(RL)进展在推理任务上取得了改进,但其在医疗成像领域的资源受限应用仍被严重忽视。我们引入了ChexReason,这是一种通过R1风格方法(SFT后接GRPO)训练的视觉-语言模型,仅使用了2,000个SFT样本、1,000个RL样本和一个A100 GPU。在CheXpert和NIH基准上的评估揭示了一个根本性的矛盾:GRPO恢复了分布内性能(CheXpert上23%的改进,宏F1分数=0.346),但降低了跨数据集的迁移性(NIH上19%的下降)。这与高资源模型如NV-Reason-CXR-3B的表现相似,表明问题源自RL范式而非规模。我们发现了一种泛化悖论,即SFT检查点在优化前对NIH的性能有所提升,表明教师引导的推理捕捉到了更多机构无关的特征。此外,跨模型比较显示结构化推理框架对通用视觉语言模型有益,但对医学预训练模型的增益有限。因此,精心策划的监督微调可能在需要跨多样人群稳健性的临床部署中优于激进的RL方法。
Summary / 总结
This study explores the application of Reinforcement Learning (RL) to medical imaging using a vision-language model, ChexReason, trained with limited resources. Despite improving in-distribution performance on CheXpert and NIH benchmarks, the model shows reduced cross-dataset transferability, highlighting a fundamental tension between benchmark success and clinical applicability. The findings suggest that RL may not be the optimal approach for clinical deployment due to its limitations in generalizing across different datasets.
研究旨在通过一个名为ChexReason的视觉-语言模型探索在医学影像中应用强化学习的情况,该模型使用有限的数据集和单个GPU进行训练。尽管在CheXpert基准测试中取得了显著的改进,但在NIH基准测试中的表现却下降了,这表明在分布内性能和跨数据集迁移性之间存在根本性的矛盾。研究指出,这一问题可能是强化学习范式本身的问题,而不是规模不足,因此精心策划的监督微调可能更适合需要跨不同人群稳健性的临床应用。
Rethinking Fine-Tuning: Unlocking Hidden Capabilities in Vision-Language Models
Authors: Mingyuan Zhang, Yue Bai, Yifan Wang, Yiyang Huang, Yun Fu
First: 2025-12-28T20:41:22+00:00 · Latest: 2025-12-28T20:41:22+00:00
Abstract
Explorations in fine-tuning Vision-Language Models (VLMs), such as Low-Rank Adaptation (LoRA) from Parameter Efficient Fine-Tuning (PEFT), have made impressive progress. However, most approaches rely on explicit weight updates, overlooking the extensive representational structures already encoded in pre-trained models that remain underutilized. Recent works have demonstrated that Mask Fine-Tuning (MFT) can be a powerful and efficient post-training paradigm for language models. Instead of updating weights, MFT assigns learnable gating scores to each weight, allowing the model to reorganize its internal subnetworks for downstream task adaptation. In this paper, we rethink fine-tuning for VLMs from a structural reparameterization perspective grounded in MFT. We apply MFT to the language and projector components of VLMs with different language backbones and compare against strong PEFT baselines. Experiments show that MFT consistently surpasses LoRA variants and even full fine-tuning, achieving high performance without altering the frozen backbone. Our findings reveal that effective adaptation can emerge not only from updating weights but also from reestablishing connections among the model's existing knowledge. Code available at: https://github.com/Ming-K9/MFT-VLM
中文标题/摘要
标题:重新思考微调:解锁视觉-语言模型中的隐藏能力
参数高效微调(PEFT)中的低秩适应(LoRA)等视觉-语言模型(VLMs)的微调探索取得了显著进展。然而,大多数方法依赖于显式的权重更新,忽视了预训练模型中已编码的、尚未充分利用的广泛表示结构。最近的研究表明,掩码微调(MFT)可以成为语言模型的一种强大且高效的后训练范式。MFT 不更新权重,而是为每个权重分配可学习的门控分数,允许模型重新组织其内部子网络以适应下游任务。在本文中,我们从基于 MFT 的结构重新参数化视角重新思考 VLMs 的微调。我们将 MFT 应用于具有不同语言后端的 VLMs 的语言和投影组件,并与强大的 PEFT 基线进行比较。实验表明,MFT 一致地超越了 LoRA 变体,甚至超过了完整的微调,无需改变冻结的后端即可实现高性能。我们的研究结果表明,有效的适应不仅可以通过更新权重实现,还可以通过重新建立模型现有知识之间的连接来实现。代码可在 https://github.com/Ming-K9/MFT-VLM 获取
Summary / 总结
This paper rethinks fine-tuning for Vision-Language Models (VLMs) by applying Mask Fine-Tuning (MFT), which assigns learnable gating scores to weights instead of updating them. Experiments show that MFT outperforms Parameter Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) and even full fine-tuning, achieving high performance without altering the pre-trained backbone. This suggests that effective adaptation can be achieved by reorganizing the model's internal subnetworks rather than directly updating weights.
本文通过应用Mask Fine-Tuning (MFT)重新思考Vision-Language Models (VLMs)的微调方法,MFT通过赋予可学习的门控分数来重新组织权重而不是更新它们。实验表明,MFT在不改变冻结的主干的情况下,超越了Parameter Efficient Fine-Tuning (PEFT)方法如LoRA和完全微调,实现了高性能。研究揭示了有效的适应可以通过重新组织模型的现有知识来实现,而不仅仅是直接更新权重。
Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion
Authors: Yi Zhou, Xuechao Zou, Shun Zhang, Kai Li, Shiying Wang, Jingming Chen, Congyan Lang, Tengfei Cao, Pin Tao, Yuanchun Shi
First: 2025-12-28T18:24:19+00:00 · Latest: 2025-12-28T18:24:19+00:00
Comments: 13 pages, 5 figures, 10 tables
Abstract
Semi-supervised remote sensing (RS) image semantic segmentation offers a promising solution to alleviate the burden of exhaustive annotation, yet it fundamentally struggles with pseudo-label drift, a phenomenon where confirmation bias leads to the accumulation of errors during training. In this work, we propose Co2S, a stable semi-supervised RS segmentation framework that synergistically fuses priors from vision-language models and self-supervised models. Specifically, we construct a heterogeneous dual-student architecture comprising two distinct ViT-based vision foundation models initialized with pretrained CLIP and DINOv3 to mitigate error accumulation and pseudo-label drift. To effectively incorporate these distinct priors, an explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries to provide explicit and implicit class-level guidance, respectively, thereby jointly enhancing semantic consistency. Furthermore, a global-local feature collaborative fusion strategy is developed to effectively fuse the global contextual information captured by CLIP with the local details produced by DINOv3, enabling the model to generate highly precise segmentation results. Extensive experiments on six popular datasets demonstrate the superiority of the proposed method, which consistently achieves leading performance across various partition protocols and diverse scenarios. Project page is available at https://xavierjiezou.github.io/Co2S/.
中文标题/摘要
标题:通过共引导和共融合实现稳定的半监督遥感分割
半监督遥感(RS)图像语义分割提供了一种缓解全面标注负担的有希望的解决方案,但根本上它面临着伪标签漂移的问题,这是一种在训练过程中由于确认偏差导致错误累积的现象。在本文中,我们提出了一种名为Co2S的稳定半监督RS分割框架,该框架能够协同融合来自视觉语言模型和自监督模型的先验知识。具体而言,我们构建了一个异构双学生架构,包含两个分别基于预训练CLIP和DINOv3初始化的ViT视觉基础模型,以减轻错误累积和伪标签漂移。为了有效结合这些不同的先验知识,我们引入了一种显式-隐式语义共引导机制,利用文本嵌入和可学习查询分别提供显式和隐式的类别级引导,从而共同增强语义一致性。此外,我们还开发了一种全局-局部特征协作融合策略,以有效地融合CLIP捕获的全局上下文信息和DINOv3生成的局部细节,使模型能够生成高度精确的分割结果。在六个流行数据集上的广泛实验表明,所提出的方法在各种分割协议和不同场景中始终表现出优越性,取得领先性能。项目页面可在https://xavierjiezou.github.io/Co2S/访问。
Summary / 总结
This paper addresses the issue of pseudo-label drift in semi-supervised remote sensing image segmentation by proposing Co2S, a framework that integrates priors from vision-language and self-supervised models. It uses a dual-student architecture with CLIP and DINOv3 to reduce error accumulation and introduces a co-guidance mechanism for semantic consistency. Additionally, a feature fusion strategy combines global and local features to enhance segmentation accuracy. Experiments on six datasets show that Co2S outperforms existing methods across various scenarios.
本文提出了一种名为Co2S的框架,通过结合视觉语言模型和自监督模型的先验知识来解决半监督遥感图像分割中的伪标签漂移问题。该框架采用CLIP和DINOv3初始化的双学生架构,减少错误累积。引入了显式-隐式语义协同引导机制和全局-局部特征融合策略,以增强语义一致性并生成高精度分割结果。在六个数据集上的实验表明,Co2S在各种场景下优于现有方法。
An Architecture-Led Hybrid Report on Body Language Detection Project
Authors: Thomson Tong, Diba Darooneh
First: 2025-12-28T18:03:00+00:00 · Latest: 2025-12-28T18:03:00+00:00
Abstract
This report provides an architecture-led analysis of two modern vision-language models (VLMs), Qwen2.5-VL-7B-Instruct and Llama-4-Scout-17B-16E-Instruct, and explains how their architectural properties map to a practical video-to-artifact pipeline implemented in the BodyLanguageDetection repository [1]. The system samples video frames, prompts a VLM to detect visible people and generate pixel-space bounding boxes with prompt-conditioned attributes (emotion by default), validates output structure using a predefined schema, and optionally renders an annotated video. We first summarize the shared multimodal foundation (visual tokenization, Transformer attention, and instruction following), then describe each architecture at a level sufficient to justify engineering choices without speculative internals. Finally, we connect model behavior to system constraints: structured outputs can be syntactically valid while semantically incorrect, schema validation is structural (not geometric correctness), person identifiers are frame-local in the current prompting contract, and interactive single-frame analysis returns free-form text rather than schema-enforced JSON. These distinctions are critical for writing defensible claims, designing robust interfaces, and planning evaluation.
中文标题/摘要
标题:基于架构的混合报告:身体语言检测项目
本报告提供了对两种现代视觉-语言模型(VLMs)——Qwen2.5-VL-7B-Instruct 和 Llama-4-Scout-17B-16E-Instruct——的基于架构的分析,并解释了它们的架构特性如何映射到在 BodyLanguageDetection 仓库 [1] 中实现的视频到制品流水线。该系统采样视频帧,提示 VLM 检测可见的人并生成带有提示条件属性(默认为情绪)的像素空间边界框,使用预定义的模式验证输出结构,并可选地生成带有注释的视频。我们首先总结了共享的多模态基础(视觉标记化、Transformer 注意力和指令遵循),然后在足以证明工程选择的水平上描述每个架构,而不进行推测性的内部描述。最后,我们将模型行为与系统约束联系起来:结构化的输出可以是语法有效的,但语义上可能是错误的;模式验证是结构性的(而不是几何正确性);当前的提示合同中的人标识符是帧局部的;交互式单帧分析返回自由格式的文本而不是模式强制的 JSON。这些区别对于撰写有说服力的声明、设计稳健的界面和规划评估至关重要。
Summary / 总结
This report analyzes two modern vision-language models, Qwen2.5-VL-7B-Instruct and Llama-4-Scout-17B-16E-Instruct, focusing on their architectural properties and how they are applied in a video-to-artifact pipeline for body language detection. The system processes video frames, uses VLMs to detect people and generate bounding boxes with attributes, validates the output structure, and optionally annotates the video. Key findings include the shared multimodal foundation of visual tokenization, Transformer attention, and instruction following, as well as the distinctions between syntactically valid but semantically incorrect structured outputs, schema validation, and frame-local person identifiers.
该报告分析了两个现代视觉语言模型Qwen2.5-VL-7B-Instruct和Llama-4-Scout-17B-16E-Instruct,重点在于它们的架构特性及其在用于身体语言检测的视频到数据管道中的应用。系统处理视频帧,使用VLM检测人员并生成带有属性的边界框,验证输出结构,并可选地标注视频。主要发现包括视觉标记化、Transformer注意力和指令遵循的共享多模态基础,以及结构有效但语义不正确的输出、结构验证和帧内局部人员标识符之间的区别。
Attention Is All You Need for KV Cache in Diffusion LLMs
Authors: Quan Nguyen-Tri, Mukul Ranjan, Zhiqiang Shen
First: 2025-10-16T17:59:48+00:00 · Latest: 2025-12-28T17:27:09+00:00
Comments: Code at: https://github.com/VILA-Lab/Elastic-Cache
Abstract
This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods' decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant ${\bf MASK}$ tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose ${\bf Elastic-Cache}$, a training-free, architecture-agnostic strategy that jointly decides ${when}$ to refresh (via an attention-aware drift test on the most-attended token) and ${where}$ to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: $8.7\times$ on GSM8K (256 tokens), and $45.1\times$ on longer sequences, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput ($6.8\times$ on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.
中文标题/摘要
标题:注意力即是你在扩散大语言模型中所需的一切KV缓存
这项工作研究了如何适应性地重新计算扩散大语言模型(DLMs)的键值(KV)缓存,以最大化预测准确性并最小化解码延迟。先前方法的解码器在每个去噪步骤和层中都重新计算QKV,尽管大多数步骤中KV状态变化不大,尤其是在浅层,导致大量冗余。我们做出了三个观察:(1)远处的${\bf MASK}$标记主要作为长度偏差起作用,并且可以在活动预测窗口之外块状缓存;(2)KV动态随深度增加,表明从较深层开始的选择性刷新是足够的;(3)最关注的标记表现出最小的KV漂移,为其他标记的缓存变化提供保守的下限。基于这些观察,我们提出了${\bf Elastic-Cache}$,这是一种无需训练、架构无关的策略,联合决定何时(通过最关注标记的注意力感知漂移测试)和何地(通过深度感知调度,从选定的层开始重新计算,同时重用浅层缓存和窗口外的${\bf MASK}$缓存)刷新缓存。与固定周期方案不同,Elastic-Cache为扩散大语言模型执行适应性、分层感知的缓存更新,减少冗余计算并加速解码,同时几乎不损失生成质量。在LLaDA-Instruct、LLaDA-1.5和LLaDA-V上的数学推理和代码生成任务实验表明,Elastic-Cache在GSM8K(256个标记)上实现了$8.7\times$的一致加速,在更长序列上实现了$45.1\times$的一致加速,同时保持了比基线更高的准确性。我们的方法在GSM8K上实现了显著更高的吞吐量($6.8\times$),同时保持了生成质量,使扩散大语言模型的实用部署成为可能。
History
20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553