arXiv 论文速递

2025-11-23 03:26
Snapshot: 20251123_0326
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
Authors: Junhao Cheng, Liang Hou, Xin Tao, Jing Liao
First: 2025-11-20T18:59:44+00:00 · Latest: 2025-11-20T18:59:44+00:00
Comments: Project page: https://video-as-answer.github.io/
Abstract
While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video's inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization. Codes are released in https://github.com/KlingTeam/VANS.
中文标题/摘要
标题:视频作答:联合GRPO预测和生成下一视频事件
尽管语言模型在许多实际应用中产生了重大影响,但视频生成仍主要局限于娱乐领域。受视频展示物理世界信息的独特能力启发(例如,仅通过文字很难教人打领带),我们发现将视频作为下一事件预测(NEP)的新答案模态扩展的机会未被充分利用,将其形式化为视频下一事件预测(VNEP)。虽然传统的NEP任务以包含程序性或预测性问题的视频作为输入来预测下一个事件,VNEP则需要动态视频响应。这一从讲述到展示的转变解锁了更直观和个性化的程序学习和创意探索答案。然而,现有模型仍难以完成此任务,因为它要求理解多模态输入、指令条件推理以及生成视觉和语义一致的视频。为解决这一问题,我们引入了VANS模型,该模型利用强化学习将视觉语言模型(VLM)与视频扩散模型(VDM)联合起来用于VNEP。VANS的核心是我们提出的联合GRPO,它协调VLM和VDM作为一个整体运行。通过共享奖励优化VLM,使其生成既准确又易于视觉化的字幕,同时指导VDM生成忠实于这些字幕和输入视觉上下文的视频。为了实现这一学习,我们构建了VANS-Data-100K,一个专门用于VNEP任务的数据集。在程序性和预测性基准上的实验表明,VANS在视频事件预测和可视化方面均达到了最先进的性能。代码发布在https://github.com/KlingTeam/VANS。
Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter
Authors: Qinghao Hu, Shang Yang, Junxian Guo, Xiaozhe Yao, Yujun Lin, Yuxian Gu, Han Cai, Chuang Gan, Ana Klimovic, Song Han
First: 2025-11-20T18:59:25+00:00 · Latest: 2025-11-20T18:59:25+00:00
Abstract
The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at https://github.com/mit-han-lab/fastrl.
中文标题/摘要
标题:驯服长尾效应:通过自适应招募能手实现高效的推理RL训练
大型语言模型(LLMs)的出现标志着推理能力的显著提升,开启了复杂问题解决的新领域。然而,使用强化学习(RL)训练这些推理模型时,遇到了关键的效率瓶颈:RL训练中的响应生成呈现出持久的长尾分布,其中少数非常长的响应主导了执行时间,浪费了资源并增加了成本。为了解决这一问题,我们提出了一种名为TLT的系统,通过集成自适应推测解码来无损地加速推理RL训练。在RL中应用推测解码具有挑战性,因为工作负载动态变化、目标模型不断演进以及招募能手模型的训练开销。TLT通过两个协同工作的组件克服了这些障碍:(1)自适应招募能手,一种在长尾生成期间连续在空闲GPU上训练的轻量级招募能手模型,以零额外成本保持与目标模型的对齐;(2)自适应展开引擎,维护一个内存高效的CUDAGraphs预捕获池,并为每个输入批次适配选择合适的SD策略。评估表明,与最先进的系统相比,TLT实现了超过1.7倍的端到端RL训练加速,保持了模型的准确性,并且生成了一个高质量的招募能手模型作为免费副产品,适合高效部署。代码发布在https://github.com/mit-han-lab/fastrl。
Summary / 总结
The paper addresses the efficiency bottleneck in training reasoning models using RL, where long responses dominate execution time. It introduces TLT, which uses adaptive speculative decoding with an Adaptive Drafter to maintain alignment with the target model and an Adaptive Rollout Engine to select suitable strategies. TLT achieves over 1.7x speedup in end-to-end RL training while preserving model accuracy and providing a high-quality draft model for efficient deployment.
论文解决了使用RL训练推理模型时由于长响应导致的效率瓶颈问题。它提出了TLT,通过使用适应性推测解码和适应性卷出引擎来缓解这一问题。TLT在RL训练中实现了超过1.7倍的加速,保持了模型的准确性,并提供了高质量的草稿模型用于高效部署。
Cognitive Foundations for Reasoning and Their Manifestation in LLMs
Authors: Priyanka Kargupta, Shuyue Stella Li, Haocheng Wang, Jinu Lee, Shan Chen, Orevaoghene Ahia, Dean Light, Thomas L. Griffiths, Max Kleiman-Weiner, Jiawei Han, Asli Celikyilmaz, Yulia Tsvetkov
First: 2025-11-20T18:59:00+00:00 · Latest: 2025-11-20T18:59:00+00:00
Comments: 40 pages, 4 tables, 6 figures
Abstract
Large language models solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning. We synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning computational constraints, meta-cognitive controls, knowledge representations, and transformation operations, then analyze their behavioral manifestations in reasoning traces. We propose a fine-grained cognitive evaluation framework and conduct the first large-scale analysis of 170K traces from 17 models across text, vision, and audio modalities, alongside 54 human think-aloud traces, which we make publicly available. Our analysis reveals systematic structural differences: humans employ hierarchical nesting and meta-cognitive monitoring while models rely on shallow forward chaining, with divergence most pronounced on ill-structured problems. Meta-analysis of 1,598 LLM reasoning papers reveals the research community concentrates on easily quantifiable behaviors (sequential organization: 55%, decomposition: 60%) while neglecting meta-cognitive controls (self-awareness: 16%, evaluation: 8%) that correlate with success. Models possess behavioral repertoires associated with success but fail to deploy them spontaneously. Leveraging these patterns, we develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 60% on complex problems. By bridging cognitive science and LLM research, we establish a foundation for developing models that reason through principled cognitive mechanisms rather than brittle spurious reasoning shortcuts or memorization, opening new directions for both improving model capabilities and testing theories of human cognition at scale.
中文标题/摘要
标题:认知基础与推理及其在大语言模型中的表现
大型语言模型能够解决复杂问题,但在简单变体上却失败,这表明它们通过与人类推理根本不同的机制获得正确输出。我们综合认知科学研究,构建了一个涵盖计算约束、元认知控制、知识表示和转换操作的28种认知元素分类体系,然后分析这些元素在推理轨迹中的行为表现。我们提出了一种精细的认知评估框架,并对来自17种模型的17万条跨文本、视觉和音频模态的推理轨迹进行了首次大规模分析,同时包括54条人类口头思考轨迹,这些数据已公开发布。我们的分析揭示了系统性的结构差异:人类使用层次嵌套和元认知监控,而模型依赖浅层前向链式推理,差异在结构不良问题上最为明显。对1598篇大语言模型推理论文的元分析显示,研究社区集中在可量化的行为(序列组织:55%,分解:60%)上,而忽视了与成功相关的元认知控制(自我意识:16%,评估:8%)。模型拥有与成功相关的行为模式,但无法自发应用。利用这些模式,我们开发了一种推理指导,可在测试时自动搭建成功结构,复杂问题上的性能提升高达60%。通过将认知科学与大语言模型研究相结合,我们为通过原理性的认知机制开发能够推理的模型奠定了基础,而不是脆弱的错误推理捷径或记忆,为提高模型能力并大规模测试人类认知理论提供了新的方向。
Summary / 总结
This study investigates the cognitive foundations of reasoning in large language models (LLMs) by synthesizing cognitive science research into a taxonomy of 28 cognitive elements and analyzing their behavioral manifestations in 170K reasoning traces from 17 models across various modalities. The research reveals that humans use hierarchical nesting and meta-cognitive monitoring, while models rely on shallow forward chaining, especially on ill-structured problems. The study also highlights the research community's focus on easily quantifiable behaviors and neglect of meta-cognitive controls, and proposes a test-time reasoning guidance to improve model performance by up to 60% on complex problems.
研究通过综合认知科学领域的研究成果,构建了一个包含28个认知元素的分类体系,并分析了来自17种模型的17万个不同模态的推理痕迹中的行为表现。研究发现,模型使用浅层的前向链式推理,而人类则使用层次嵌套和元认知监控,这种差异在复杂问题上尤为明显。研究还发现,研究界更关注易于量化的行为,而忽视了与成功相关的元认知控制,并开发了一种推理指导,可将模型在复杂问题上的性能提高多达60%。
Evolution Strategies at the Hyperscale
Authors: Bidipta Sarkar, Mattie Fellows, Juan Agustin Duque, Alistair Letcher, Antonio León Villares, Anya Sims, Dylan Cope, Jarek Liesen, Lukas Seier, Theo Wolf, Uljad Berdica, Alexander David Goldie, Aaron Courville, Karin Sevegnani, Shimon Whiteson, Jakob Nicolaus Foerster
First: 2025-11-20T18:56:05+00:00 · Latest: 2025-11-20T18:56:05+00:00
Comments: 48 pages, 12 figures, Website at https://eshyperscale.github.io/
Abstract
We introduce Evolution Guided General Optimization via Low-rank Learning (EGGROLL), an evolution strategies (ES) algorithm designed to scale backprop-free optimization to large population sizes for modern large neural network architectures with billions of parameters. ES is a set of powerful blackbox optimisation methods that can handle non-differentiable or noisy objectives with excellent scaling potential through parallelisation. Na{ï}ve ES becomes prohibitively expensive at scale due to the computational and memory costs associated with generating matrix perturbations $E\in\mathbb{R}^{m\times n}$ and the batched matrix multiplications needed to compute per-member forward passes. EGGROLL overcomes these bottlenecks by generating random matrices $A\in \mathbb{R}^{m\times r},\ B\in \mathbb{R}^{n\times r}$ with $r\ll \min(m,n)$ to form a low-rank matrix perturbation $A B^\top$ that are used in place of the full-rank perturbation $E$. As the overall update is an average across a population of $N$ workers, this still results in a high-rank update but with significant memory and computation savings, reducing the auxiliary storage from $mn$ to $r(m+n)$ per layer and the cost of a forward pass from $\mathcal{O}(mn)$ to $\mathcal{O}(r(m+n))$ when compared to full-rank ES. A theoretical analysis reveals our low-rank update converges to the full-rank update at a fast $\mathcal{O}\left(\frac{1}{r}\right)$ rate. Our experiments show that (1) EGGROLL does not compromise the performance of ES in tabula-rasa RL settings, despite being faster, (2) it is competitive with GRPO as a technique for improving LLM reasoning, and (3) EGGROLL enables stable pre-training of nonlinear recurrent language models that operate purely in integer datatypes.
Summary / 总结
EGGROLL is an evolution strategies algorithm designed for large-scale neural network optimization. It addresses the computational and memory costs of traditional ES by using low-rank matrix perturbations, reducing the auxiliary storage and forward pass costs. Experiments show that EGGROLL maintains the performance of ES in reinforcement learning and is competitive with GRPO for improving language model reasoning, while also enabling stable pre-training of nonlinear recurrent language models in integer datatypes.
EGGROLL 是一种进化策略算法,旨在通过使用低秩矩阵扰动来优化具有数十亿参数的大规模神经网络,从而减少计算和内存成本。它生成随机矩阵 $A$ 和 $B$ 来形成低秩扰动 $AB^{ op}$,代替全秩扰动 $E$。实验表明,EGGROLL 在强化学习任务中保持了 ES 的性能,在改进语言模型推理方面与 GRPO 竞争,并且能够使用整数数据类型稳定预训练非线性递归语言模型。
Sigma: Semantically Informative Pre-training for Skeleton-based Sign Language Understanding
Authors: Muxin Pu, Mei Kuan Lim, Chun Yong Chong, Chen Change Loy
First: 2025-09-25T14:28:34+00:00 · Latest: 2025-11-20T18:44:59+00:00
Abstract
Pre-training has proven effective for learning transferable features in sign language understanding (SLU) tasks. Recently, skeleton-based methods have gained increasing attention because they can robustly handle variations in subjects and backgrounds without being affected by appearance or environmental factors. Current SLU methods continue to face three key limitations: 1) weak semantic grounding, as models often capture low-level motion patterns from skeletal data but struggle to relate them to linguistic meaning; 2) imbalance between local details and global context, with models either focusing too narrowly on fine-grained cues or overlooking them for broader context; and 3) inefficient cross-modal learning, as constructing semantically aligned representations across modalities remains difficult. To address these, we propose Sigma, a unified skeleton-based SLU framework featuring: 1) a sign-aware early fusion mechanism that facilitates deep interaction between visual and textual modalities, enriching visual features with linguistic context; 2) a hierarchical alignment learning strategy that jointly maximises agreements across different levels of paired features from different modalities, effectively capturing both fine-grained details and high-level semantic relationships; and 3) a unified pre-training framework that combines contrastive learning, text matching and language modelling to promote semantic consistency and generalisation. Sigma achieves new state-of-the-art results on isolated sign language recognition, continuous sign language recognition, and gloss-free sign language translation on multiple benchmarks spanning different sign and spoken languages, demonstrating the impact of semantically informative pre-training and the effectiveness of skeletal data as a stand-alone solution for SLU.
中文标题/摘要
标题:Sigma:基于骨架的 sign 语言理解语义信息预训练
预训练已被证明对 sign 语言理解(SLU)任务中学习可迁移特征非常有效。最近,基于骨架的方法越来越受到关注,因为它们能够稳健地处理主体和背景的变化,而不受外观或环境因素的影响。当前的 SLU 方法仍然面临三个关键限制:1)语义基础较弱,模型通常从骨架数据中捕捉低级运动模式,但难以将它们与语言意义联系起来;2)局部细节与全局上下文之间的不平衡,模型要么过于关注细微线索,要么忽视它们以获得更广泛的上下文;3)跨模态学习效率低下,因为跨模态构建语义对齐表示仍然具有挑战性。为了解决这些问题,我们提出了 Sigma,这是一种统一的基于骨架的 SLU 框架,包括:1)一种手语意识的早期融合机制,促进视觉和文本模态之间的深层交互,用语言上下文丰富视觉特征;2)一种分层对齐学习策略,联合最大化不同模态配对特征在不同层次上的一致性,有效地捕捉细微线索和高层次的语义关系;3)一种统一的预训练框架,结合对比学习、文本匹配和语言建模,促进语义一致性和泛化。Sigma 在多个基准上的孤立手语识别、连续手语识别和无词手语翻译任务中达到了新的最佳结果,展示了语义信息预训练的影响以及骨架数据作为 SLU 单独解决方案的有效性。
Summary / 总结
Sigma addresses limitations in sign language understanding by proposing a unified framework that includes a sign-aware early fusion mechanism, hierarchical alignment learning, and a unified pre-training framework combining contrastive learning, text matching, and language modeling. This results in new state-of-the-art performance on various SLU tasks, highlighting the importance of semantically informative pre-training and the effectiveness of skeletal data alone.
Sigma 提出了一种统一框架,包括一种手语意识的早期融合机制、层次对齐学习以及结合对比学习、文本匹配和语言建模的统一预训练框架。这在多种 SLU 任务上达到了新的最佳性能,突显了语义信息性预训练的重要性以及骨骼数据作为独立解决方案的有效性。
Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization
Authors: Yi Zhang, Che Liu, Xiancong Ren, Hanchu Ni, Yingji Zhang, Shuai Zhang, Zeyuan Ding, Jiayu Hu, Haozhe Shan, Junbo Qi, Yan Bai, Dengjie Li, Jiachen Luo, Yidong Wang, Yong Dai, Zenglin Xu, Bin Shen, Qifan Wang, Jian Tang, Xiaozhu Ju
First: 2025-11-20T17:58:04+00:00 · Latest: 2025-11-20T17:58:04+00:00
Abstract
Developing a universal and versatile embodied intelligence system presents two primary challenges: the critical embodied data bottleneck, where real-world data is scarce and expensive, and the algorithmic inefficiency of existing methods, which are resource-prohibitive. To address these limitations, we introduce Deliberate Practice Policy Optimization (DPPO), a metacognitive ``Metaloop'' training framework that dynamically alternates between supervised fine-tuning (competence expansion) and reinforcement learning (skill refinement). This enables automatic weakness identification and targeted resource allocation, specifically designed to maximize learning efficiency from sparse, finite data. Theoretically, DPPO can be formalised as a unified preference-learning framework. Empirically, training a vision-language embodied model with DPPO, referred to as Pelican-VL 1.0, yields a 20.3% performance improvement over the base model and surpasses open-source models at the 100B-parameter scale by 10.6%. We are open-sourcing both the models and code, providing the first systematic framework that alleviates the data and resource bottleneck and enables the community to build versatile embodied agents efficiently.
中文标题/摘要
标题:用刻意练习策略优化连接VLM和具身智能
开发通用且多功能的具身智能系统面临两大挑战:具身数据瓶颈,即现实世界数据稀缺且昂贵,以及现有方法的算法低效性,这些方法资源消耗巨大。为解决这些限制,我们提出了刻意练习策略优化(DPPO),这是一种元认知的“元循环”训练框架,动态交替进行监督微调(能力扩展)和强化学习(技能精炼)。这使得自动识别弱点和目标资源分配成为可能,特别设计以最大化从稀疏有限数据中学习的效率。理论上,DPPO 可以被形式化为统一的偏好学习框架。实验上,使用 DPPO 训练的视觉语言具身模型 Pelican-VL 1.0 在基线模型上提高了 20.3% 的性能,并在 100B 参数规模上超越开源模型 10.6%。我们开源了模型和代码,提供了第一个系统框架,缓解了数据和资源瓶颈,使社区能够高效地构建多功能具身代理。
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
Authors: Boshen Xu, Zihan Xiao, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, Qin Jin
First: 2025-11-20T17:48:21+00:00 · Latest: 2025-11-20T17:48:21+00:00
Comments: Project page: https://xuboshen.github.io/TimeViper
Abstract
We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.
中文标题/摘要
标题:TimeViper:一种混合Mamba-Transformer视觉语言模型,用于高效理解长视频
我们介绍了TimeViper,一种混合视觉语言模型,旨在解决长视频理解的挑战。处理长视频需要高效的模型架构和有效的机制来处理扩展的时间上下文。为此,TimeViper采用了一种混合Mamba-Transformer骨干,结合了状态空间模型的效率和注意力机制的表达能力。通过这种混合设计,我们揭示了视觉到文本信息聚合的现象,其中信息随着LLM深度增加,从视觉标记逐渐流向文本标记,导致视觉标记冗余严重。受此观察的启发,我们提出了TransV,一种标记信息传输模块,将视觉标记转换并压缩为指令标记,同时保持多模态理解能力。这种设计使TimeViper能够处理超过10,000帧的长达一小时的视频。在多个基准上的广泛实验表明,TimeViper在与最先进的模型竞争的同时,扩展了帧数。我们进一步分析了Mamba和Transformer层的注意力行为,提供了关于混合模型可解释性的新见解。这项工作代表了开发、解释和压缩混合Mamba-Transformer架构的初步步骤。
Summary / 总结
TimeViper is a hybrid Mamba-Transformer model designed for efficient long video understanding. It combines the efficiency of state-space models with the expressivity of attention mechanisms. The model reveals a vision-to-text information aggregation phenomenon and introduces TransV, a token information transfer module that compresses vision tokens into instruction tokens while preserving multimodal understanding. TimeViper can process hour-long videos with over 10,000 frames and outperforms state-of-the-art models on multiple benchmarks. The work also provides insights into the attention behaviors of Mamba and Transformer layers.
TimeViper 是一种结合了状态空间模型效率和注意力机制表达性的混合 Mamba-Transformer 模型,旨在高效处理长视频理解任务。模型揭示了视觉信息向文本信息的逐步聚合现象,并提出 TransV 模块,将视觉信息压缩为指令信息的同时保持多模态理解能力。实验表明,TimeViper 可以处理小时级的长视频,并在性能上与最先进的模型竞争。此外,该工作还分析了 Mamba 和 Transformer 层的注意力行为,为混合模型的可解释性提供了新的见解。
Formal Abductive Latent Explanations for Prototype-Based Networks
Authors: Jules Soria, Zakaria Chihani, Julien Girard-Satabin, Alban Grastien, Romain Xu-Darme, Daniela Cancila
Venue: AAAI
First: 2025-11-20T17:42:41+00:00 · Latest: 2025-11-20T17:42:41+00:00
Comments: Accepted at AAAI-26
Abstract
Case-based reasoning networks are machine-learning models that make predictions based on similarity between the input and prototypical parts of training samples, called prototypes. Such models are able to explain each decision by pointing to the prototypes that contributed the most to the final outcome. As the explanation is a core part of the prediction, they are often qualified as ``interpretable by design". While promising, we show that such explanations are sometimes misleading, which hampers their usefulness in safety-critical contexts. In particular, several instances may lead to different predictions and yet have the same explanation. Drawing inspiration from the field of formal eXplainable AI (FXAI), we propose Abductive Latent Explanations (ALEs), a formalism to express sufficient conditions on the intermediate (latent) representation of the instance that imply the prediction. Our approach combines the inherent interpretability of case-based reasoning models and the guarantees provided by formal XAI. We propose a solver-free and scalable algorithm for generating ALEs based on three distinct paradigms, compare them, and present the feasibility of our approach on diverse datasets for both standard and fine-grained image classification. The associated code can be found at https://github.com/julsoria/ale
中文标题/摘要
标题:形式演绎潜在解释在原型基网络中的应用
基于案例推理的网络是机器学习模型,它们根据输入与训练样本中原型部分的相似性来做预测。这类模型能够通过指出对最终结果贡献最大的原型来解释每个决策。由于解释是预测的核心部分,它们通常被认为是“设计上可解释的”。尽管前景广阔,但我们发现这样的解释有时会误导人,这限制了它们在关键安全领域中的实用性。特别是,几个实例可能会导致不同的预测,但具有相同解释。受到形式可解释人工智能(FXAI)领域的启发,我们提出了演绎潜在解释(ALEs),这是一种表达实例中间(潜在)表示的充分条件的形式化方法,这些条件暗示了预测。我们的方法结合了基于案例推理模型的固有可解释性和形式XAI提供的保证。我们提出了一种无需求解器且可扩展的算法来生成ALEs,并基于三个不同的范式进行了比较,展示了我们方法在多种数据集上的可行性,包括标准和细粒度图像分类。相关代码可在https://github.com/julsoria/ale找到。
Summary / 总结
The paper addresses the issue of misleading explanations in case-based reasoning networks, which are often considered interpretable by design. It proposes Abductive Latent Explanations (ALEs) as a formalism to provide more accurate and reliable explanations by focusing on the intermediate latent representation of the input. The method is scalable and solver-free, and it is evaluated on various datasets for image classification, showing the feasibility of the approach in both standard and fine-grained settings.
论文针对案例推理网络中可能存在误导性解释的问题,这些网络通常被认为具有内置的可解释性。它提出了Abductive Latent Explanations (ALEs)作为一种形式主义,通过关注输入的中间潜在表示来提供更准确和可靠的解释。该方法是可扩展且无需求解器的,并在多种数据集上的图像分类任务中进行了评估,展示了该方法在标准和细粒度分类中的可行性。
On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation
Authors: Liyao Tang, Zhe Chen, Dacheng Tao
Venue: Neurips 2025
First: 2025-05-28T15:08:36+00:00 · Latest: 2025-11-20T17:35:54+00:00
Comments: Neurips 2025; available at https://github.com/LiyaoTang/GEM
Abstract
The emergence of large-scale pre-trained point cloud models has significantly advanced 3D scene understanding, but adapting these models to specific downstream tasks typically demands full fine-tuning, incurring high computational and storage costs. Parameter-efficient fine-tuning (PEFT) techniques, successful in natural language processing and 2D vision tasks, would underperform when naively applied to 3D point cloud models due to significant geometric and spatial distribution shifts. Existing PEFT methods commonly treat points as orderless tokens, neglecting important local spatial structures and global geometric contexts in 3D modeling. To bridge this gap, we introduce the Geometric Encoding Mixer (GEM), a novel geometry-aware PEFT module specifically designed for 3D point cloud transformers. GEM explicitly integrates fine-grained local positional encodings with a lightweight latent attention mechanism to capture comprehensive global context, thereby effectively addressing the spatial and geometric distribution mismatch. Extensive experiments demonstrate that GEM achieves performance comparable to or sometimes even exceeding full fine-tuning, while only updating 1.6% of the model's parameters, fewer than other PEFT methods. With significantly reduced training time and memory requirements, our approach thus sets a new benchmark for efficient, scalable, and geometry-aware fine-tuning of large-scale 3D point cloud models. Code is available at https://github.com/LiyaoTang/GEM.
中文标题/摘要
标题:几何增强参数高效微调在3D场景分割中的应用
大规模预训练点云模型的出现显著推进了3D场景理解,但将这些模型适应特定下游任务通常需要全面微调,这会带来高昂的计算和存储成本。参数高效微调(PEFT)技术在自然语言处理和2D视觉任务中取得成功,但在直接应用于3D点云模型时会表现不佳,因为存在显著的几何和空间分布差异。现有PEFT方法通常将点视为无序标记,忽略了3D建模中的重要局部空间结构和全局几何上下文。为解决这一问题,我们提出了几何编码混合器(GEM),这是一种新型的几何感知PEFT模块,专门设计用于3D点云变换器。GEM明确地将细粒度的局部位置编码与轻量级的潜在注意力机制结合,以捕捉全面的全局上下文,从而有效解决了空间和几何分布不匹配的问题。大量实验表明,GEM在性能上可与甚至超过全面微调,同时仅更新模型参数的1.6%,少于其他PEFT方法。通过显著减少训练时间和内存需求,我们的方法为大规模3D点云模型的高效、可扩展和几何感知微调设定了新的基准。代码可在https://github.com/LiyaoTang/GEM获取。
Summary / 总结
This paper addresses the challenge of efficiently fine-tuning large-scale pre-trained point cloud models for 3D scene segmentation. It proposes the Geometric Encoding Mixer (GEM), a novel geometry-aware parameter-efficient fine-tuning module. GEM integrates local positional encodings with a lightweight attention mechanism to capture comprehensive global context, thereby reducing the need for full fine-tuning. Experiments show that GEM achieves performance comparable to full fine-tuning while updating only 1.6% of the model parameters, significantly reducing training time and memory requirements.
本文通过引入几何编码混合器(GEM),一种新颖的几何感知参数高效微调模块,解决了大规模预训练点云模型在3D场景分割中的高效微调问题。GEM将局部位置编码与轻量级注意力机制结合,以捕捉全面的全局上下文,从而减少全微调的需求。实验表明,GEM在更新模型参数的1.6%的情况下,性能与全微调相当,显著减少了训练时间和内存需求。
vMFCoOp: Towards Equilibrium on a Unified Hyperspherical Manifold for Prompting Biomedical VLMs
Authors: Minye Shao, Sihan Guo, Xinrun Li, Xingyu Miao, Haoran Duan, Yang Long
Venue: AAAI 2026 Oral Presentation
First: 2025-11-12T18:38:33+00:00 · Latest: 2025-11-20T17:09:57+00:00
Comments: Accepted as an Oral Presentation at AAAI 2026 Main Technical Track (this version is not peer-reviewed; it is the extended version)
Abstract
Recent advances in context optimization (CoOp) guided by large language model (LLM)-distilled medical semantic priors offer a scalable alternative to manual prompt engineering and full fine-tuning for adapting biomedical CLIP-based vision-language models (VLMs). However, prompt learning in this context is challenged by semantic misalignment between LLMs and CLIP variants due to divergent training corpora and model architectures; it further lacks scalability across continuously evolving families of foundation models. More critically, pairwise multimodal alignment via conventional Euclidean-space optimization lacks the capacity to model unified representations or apply localized geometric constraints, which tends to amplify modality gaps in complex biomedical imaging and destabilize few-shot adaptation. In this work, we propose vMFCoOp, a framework that inversely estimates von Mises-Fisher (vMF) distributions on a shared Hyperspherical Manifold, aligning semantic biases between arbitrary LLMs and CLIP backbones via Unified Semantic Anchors to achieve robust biomedical prompting and superior few-shot classification. Grounded in three complementary constraints, vMFCoOp demonstrates consistent improvements across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming state-of-the-art methods in accuracy, generalization, and clinical applicability. This work aims to continuously expand to encompass more downstream applications, and the corresponding resources are intended to be shared through https://github.com/VinyehShaw/UniEqui.
中文标题/摘要
标题:vMFCoOp:在统一超球面流形上朝均衡方向努力,以指导生物医学VLMs的提示
基于大型语言模型(LLM)提炼的医学语义先验的上下文优化(CoOp)的最新进展为使用生物医学CLIP基视觉语言模型(VLMs)进行手动提示工程和全面微调提供了可扩展的替代方案。然而,这种上下文中的提示学习受到LLM和CLIP变体之间语义不匹配的挑战,这归因于不同的训练语料库和模型架构;此外,它在不断演进的基础模型家族中缺乏可扩展性。更严重的是,通过传统的欧几里得空间优化进行的两模态对齐缺乏建模统一表示或应用局部几何约束的能力,这在复杂的生物医学成像中往往会放大模态差距并导致少量样本适应不稳定。在本文中,我们提出vMFCoOp框架,该框架在共享的超球面流形上逆向估计von Mises-Fisher(vMF)分布,通过统一语义锚点对任意LLM和CLIP主干之间的语义偏差进行对齐,以实现稳健的生物医学提示和优越的少量样本分类。基于三个互补约束,vMFCoOp在14个医学数据集、12种医学成像模态和13个解剖区域上表现出一致的改进,其准确度、泛化能力和临床适用性均优于现有最佳方法。本文旨在不断扩展以涵盖更多的下游应用,相应的资源将通过https://github.com/VinyehShaw/UniEqui共享。
Summary / 总结
vMFCoOp proposes a framework that uses von Mises-Fisher distributions on a shared Hyperspherical Manifold to align semantic biases between LLMs and CLIP backbones, achieving robust biomedical prompting and superior few-shot classification across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming existing methods in accuracy and generalization. This work aims to continuously expand to more downstream applications.
vMFCoOp 提出了一种框架,使用共享超球面流形上的 von Mises-Fisher 分布来对齐 LLM 和 CLIP 后端之间的语义偏差,实现了在 14 个医学数据集、12 种医学成像模态和 13 个解剖区域中的稳健生物医学提示和优越的少量样本分类,超越现有方法在准确性和泛化能力方面的表现。这项工作旨在不断扩展到更多的下游应用。
DiffuSyn Bench: Evaluating Vision-Language Models on Real-World Complexities with Diffusion-Generated Synthetic Benchmarks
Authors: Haokun Zhou, Yipeng Hong
First: 2024-06-06T19:50:33+00:00 · Latest: 2025-11-20T16:45:08+00:00
Abstract
This study assesses the ability of Large Vision-Language Models (LVLMs) to differentiate between AI-generated and human-generated images. It introduces a new automated benchmark construction method for this evaluation. The experiment compared common LVLMs with human participants using a mixed dataset of AI and human-created images. Results showed that LVLMs could distinguish between the image types to some extent but exhibited a rightward bias, and perform significantly worse compared to humans. To build on these findings, we developed an automated benchmark construction process using AI. This process involved topic retrieval, narrative script generation, error embedding, and image generation, creating a diverse set of text-image pairs with intentional errors. We validated our method through constructing two caparable benchmarks. This study highlights the strengths and weaknesses of LVLMs in real-world understanding and advances benchmark construction techniques, providing a scalable and automatic approach for AI model evaluation.
中文标题/摘要
标题:DiffuSyn 基准:使用扩散生成合成基准评估大型视觉-语言模型在现实世界复杂性中的表现
本研究评估了大型视觉-语言模型(LVLM)区分AI生成和人类生成图像的能力。它引入了一种新的自动化基准构建方法进行此评估。实验使用混合数据集中的AI和人类创作的图像,将常见的LVLM与人类参与者进行了比较。结果显示,LVLM在一定程度上能够区分图像类型,但表现出右偏,并且与人类相比表现显著较差。为了进一步研究,我们开发了一种使用AI的自动化基准构建过程。该过程包括主题检索、叙述脚本生成、错误嵌入和图像生成,创建了一组具有故意错误的多样化文本-图像对。我们通过构建两个可比基准验证了该方法。本研究突显了LVLM在现实世界理解中的优势和劣势,并推进了基准构建技术,提供了一种可扩展和自动化的AI模型评估方法。
Summary / 总结
This study evaluates how Large Vision-Language Models (LVLMs) can distinguish between AI-generated and human-generated images, introducing a new automated benchmark construction method. LVLMs could differentiate the image types but showed a rightward bias and performed worse than humans. The research developed an automated process involving topic retrieval, narrative script generation, error embedding, and image generation to create diverse text-image pairs with intentional errors, validating the method through constructing two benchmarks. This study highlights LVLMs' strengths and weaknesses and advances benchmark construction techniques.
该研究评估了大型视觉-语言模型(LVLMs)在区分AI生成和人类生成图像方面的能力,引入了一种新的自动化基准构建方法。实验使用混合数据集将LVLMs与人类参与者进行了比较,发现虽然LVLMs能够区分图像类型,但表现出右偏倾向,并且在性能上远逊于人类。研究还开发了一种自动化基准构建过程,涉及主题检索、叙述脚本生成、错误嵌入和图像生成,创建了包含故意错误的多样化的图文对,并通过构建两个可比的基准验证了该方法。
Contrastive vision-language learning with paraphrasing and negation
Authors: Kwun Ho Ngan, Saman Sadeghi Afgeh, Joe Townsend, Artur d'Avila Garcez
First: 2025-11-20T16:41:36+00:00 · Latest: 2025-11-20T16:41:36+00:00
Abstract
Contrastive vision-language models continue to be the dominant approach for image and text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space. Recent results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning. This poses a significant challenge for improving the evaluation results and alignment of vision-language models. To address this challenge, this paper evaluates the combination of paraphrasing and negation, proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original, paraphrased and negated textual captions to CLIP-like training models. The approach, called SemCLIP, is shown to move paraphrased captions towards the original image embeddings while pushing negated captions further away in embedding space. Empirically, SemCLIP is shown to be capable of preserving CLIP's performance while increasing considerably the distances to negated captions. On the CC-Neg benchmark using an original over negation image-retrieval accuracy metric, SemCLIP improves accuracy from 68.1% to 78.1%. Although results are mixed when compared with CLIP on the Sugarcrepe++ benchmark, SemCLIP's performance is generally better than the models trained with negated captions. This robustness to negation extends to downstream zero-shot classification tasks where SemCLIP pre-trained on Sugarcrepe++ performs better than CLIP on all tested downstream tasks. These results indicate that SemCLIP can achieve significant robustness to semantic transformations.
中文标题/摘要
标题:对比学习中的视图语言学习与改写和否定
对比学习的视图语言模型仍然是图像和文本检索的主导方法。对比语言-图像预训练(CLIP)通过对比方式训练两个神经网络,使其在共享的潜在空间中对齐图像和文本嵌入。最近对CLIP在否定或改写文本上的评估结果显示了混合性能,因为否定会通过最小的词汇变化极大地改变意义,而改写则可以创造出具有相同意图意义但非常不同的文本表达。这为提高评估结果和视图语言模型的对齐带来了重大挑战。为应对这一挑战,本文评估了改写和否定的结合,提出了一种新的CLIP对比损失函数,该函数考虑了改写和否定,应用了由原始、改写和否定文本描述组成的LLM生成训练三元组对CLIP样式的训练模型。该方法称为SemCLIP,能够将改写描述推向原始图像嵌入,同时将否定描述推向更远的嵌入空间。实验证明,SemCLIP能够保持CLIP的性能,同时显著增加与否定描述之间的距离。在CC-Neg基准上使用原始与否定图像检索准确率指标,SemCLIP的准确率从68.1%提高到78.1%。尽管与CLIP在Sugarcrepe++基准上的结果混合,但SemCLIP的性能通常优于使用否定描述训练的模型。这种对否定的鲁棒性扩展到了下游零样本分类任务,其中SemCLIP在Sugarcrepe++上预训练的性能优于CLIP在所有测试的下游任务中。这些结果表明,SemCLIP可以实现对语义变换的显著鲁棒性。
Summary / 总结
This paper addresses the challenge of evaluating and improving the performance of vision-language models on negated and paraphrased text. It introduces SemCLIP, which combines paraphrasing and negation in training, and uses LLM-generated triples to train CLIP-like models. SemCLIP improves image-retrieval accuracy from 68.1% to 78.1% on the CC-Neg benchmark and performs better than models trained with negated captions on downstream zero-shot classification tasks, demonstrating robustness to semantic transformations.
本文通过结合同义替换和否定来解决对比视觉-语言模型在处理否定或同义替换文本时的性能提升问题,引入了SemCLIP,该方法使用新的对比损失函数和由LLM生成的训练三元组来对齐图像和文本嵌入。SemCLIP在CC-Neg基准上的图像检索准确率从68.1%提高到78.1%,并且在下游零样本分类任务中的表现优于CLIP,尤其是在处理否定文本时表现出更好的效果。
Automatically Detecting Online Deceptive Patterns
Authors: Asmit Nayak, Shirley Zhang, Yash Wani, Rishabh Khandelwal, Kassem Fawaz
First: 2024-11-11T23:49:02+00:00 · Latest: 2025-11-20T16:26:59+00:00
Abstract
Deceptive patterns in digital interfaces manipulate users into making unintended decisions, exploiting cognitive biases and psychological vulnerabilities. These patterns have become ubiquitous on various digital platforms. While efforts to mitigate deceptive patterns have emerged from legal and technical perspectives, a significant gap remains in creating usable and scalable solutions. We introduce our AutoBot framework to address this gap and help web stakeholders navigate and mitigate online deceptive patterns. AutoBot accurately identifies and localizes deceptive patterns from a screenshot of a website without relying on the underlying HTML code. AutoBot employs a two-stage pipeline that leverages the capabilities of specialized vision models to analyze website screenshots, identify interactive elements, and extract textual features. Next, using a large language model, AutoBot understands the context surrounding these elements to determine the presence of deceptive patterns. We also use AutoBot, to create a synthetic dataset to distill knowledge from 'teacher' LLMs to smaller language models. Through extensive evaluation, we demonstrate AutoBot's effectiveness in detecting deceptive patterns on the web, achieving an F1-score of 0.93 when detecting deceptive patterns, underscoring its potential as an essential tool for mitigating online deceptive patterns. We implement AutoBot, across three downstream applications targeting different web stakeholders: (1) a local browser extension providing users with real-time feedback, (2) a Lighthouse audit to inform developers of potential deceptive patterns on their sites, and (3) as a measurement tool designed for researchers and regulators.
中文标题/摘要
标题:自动检测在线欺骗模式
数字界面中的欺骗模式操纵用户做出非预期的决策,利用认知偏差和心理脆弱性。这些模式在各种数字平台上变得无处不在。尽管从法律和技术角度已经出现了减轻欺骗模式的努力,但在创建实用和可扩展的解决方案方面仍存在显著差距。我们介绍了我们的AutoBot框架来填补这一差距,并帮助网络相关方导航和减轻在线欺骗模式。AutoBot无需依赖底层HTML代码即可从网站截图中准确识别和定位欺骗模式。AutoBot采用两阶段管道,利用专门视觉模型的能力来分析网站截图、识别交互元素并提取文本特征。接下来,使用大型语言模型,AutoBot理解这些元素周围的上下文以确定是否存在欺骗模式。我们还使用AutoBot创建了一个合成数据集,从“教师”大语言模型中提炼知识传递给较小的语言模型。通过广泛的评估,我们展示了AutoBot在检测网络上的欺骗模式方面的有效性,检测欺骗模式的F1分数达到0.93,突显了其作为减轻在线欺骗模式工具的潜力。我们将在三个针对不同网络相关方的下游应用中实现AutoBot:(1)一个本地浏览器扩展程序为用户提供实时反馈,(2)一个Lighthouse审核以告知开发人员其网站上潜在的欺骗模式,以及(3)一个测量工具设计用于研究人员和监管机构。
Summary / 总结
The research aims to address the issue of deceptive patterns in digital interfaces that manipulate users, by introducing AutoBot, a framework that automatically detects and localizes these patterns from website screenshots. AutoBot uses a two-stage pipeline involving specialized vision models and a large language model to identify and understand the context of interactive elements, achieving an F1-score of 0.93 in detecting deceptive patterns. This framework is implemented in three applications to assist users, developers, and researchers in mitigating online deception.
研究旨在解决数字界面中的欺骗模式问题,这些模式会操纵用户。AutoBot框架被引入以自动从网站截图中检测和定位这些模式。它使用两阶段管道,包括专门的视觉模型和大型语言模型来识别和理解欺骗元素的上下文。AutoBot在检测欺骗模式方面的F1分数达到0.93,使其成为缓解在线欺骗的重要工具。它被应用于三个下游应用程序:用户浏览器扩展、开发者Lighthouse审核以及研究人员和监管者的测量工具。
ODE-ViT: Plug & Play Attention Layer from the Generalization of the ViT as an Ordinary Differential Equation
Authors: Carlos Boned Riera, David Romero Sanchez, Oriol Ramos Terrades
First: 2025-11-20T16:19:41+00:00 · Latest: 2025-11-20T16:19:41+00:00
Abstract
In recent years, increasingly large models have achieved outstanding performance across CV tasks. However, these models demand substantial computational resources and storage, and their growing complexity limits our understanding of how they make decisions. Most of these architectures rely on the attention mechanism within Transformer-based designs. Building upon the connection between residual neural networks and ordinary differential equations (ODEs), we introduce ODE-ViT, a Vision Transformer reformulated as an ODE system that satisfies the conditions for well-posed and stable dynamics. Experiments on CIFAR-10 and CIFAR-100 demonstrate that ODE-ViT achieves stable, interpretable, and competitive performance with up to one order of magnitude fewer parameters, surpassing prior ODE-based Transformer approaches in classification tasks. We further propose a plug-and-play teacher-student framework in which a discrete ViT guides the continuous trajectory of ODE-ViT by treating the intermediate representations of the teacher as solutions of the ODE. This strategy improves performance by more than 10% compared to training a free ODE-ViT from scratch.
中文标题/摘要
标题:ODE-ViT:从ViT作为常微分方程的泛化得到的即插即用注意力层
近年来,越来越大的模型在CV任务中取得了出色的表现。然而,这些模型需要大量的计算资源和存储空间,并且其日益复杂的结构限制了我们对其决策过程的理解。大多数这些架构依赖于基于Transformer的设计中的注意力机制。基于残差神经网络与常微分方程(ODEs)之间的联系,我们提出了ODE-ViT,这是一种作为ODE系统的视觉Transformer,满足了良好定义和稳定动力学的条件。在CIFAR-10和CIFAR-100上的实验表明,ODE-ViT在参数量减少一个数量级的情况下实现了稳定、可解释且竞争力的表现,超越了先前的基于ODE的Transformer方法。我们进一步提出了一种即插即用的教师-学生框架,在该框架中,离散的ViT通过将教师的中间表示视为ODE的解来引导ODE-ViT的连续轨迹。这种方法相比从头训练一个自由的ODE-ViT提高了超过10%的性能。
Summary / 总结
The research aims to address the computational challenges of large models in computer vision tasks by reformulating Vision Transformers (ViTs) as ordinary differential equations (ODEs). The method, ODE-ViT, achieves stable and interpretable performance with significantly fewer parameters than traditional ViTs, outperforming previous ODE-based Transformer approaches. Additionally, a teacher-student framework enhances ODE-ViT's performance by guiding its continuous trajectory with discrete ViT representations, improving results by over 10% compared to training ODE-ViT from scratch.
研究旨在通过将视觉变换器(ViTs)重新表述为常微分方程(ODEs),解决大型模型在计算机视觉任务中的计算和可解释性挑战。方法ODE-ViT在CIFAR-10和CIFAR-100数据集上的分类任务中实现了稳定且具有竞争力的表现,参数量比传统ViTs少一个数量级。此外,通过将离散ViT的中间表示作为ODE-ViT的连续轨迹的指导,插件即用的教师-学生框架进一步提升了性能,相比从零开始训练ODE-ViT,提高了超过10%的准确率。
Unsupervised Discovery of Long-Term Spatiotemporal Periodic Workflows in Human Activities
Authors: Fan Yang, Quanting Xie, Atsunori Moteki, Shoichi Masui, Shan Jiang, Kanji Uchino, Yonatan Bisk, Graham Neubig
Venue: WACV 2026
First: 2025-11-18T22:07:30+00:00 · Latest: 2025-11-20T16:11:12+00:00
Comments: accepted to WACV 2026
Abstract
Periodic human activities with implicit workflows are common in manufacturing, sports, and daily life. While short-term periodic activities -- characterized by simple structures and high-contrast patterns -- have been widely studied, long-term periodic workflows with low-contrast patterns remain largely underexplored. To bridge this gap, we introduce the first benchmark comprising 580 multimodal human activity sequences featuring long-term periodic workflows. The benchmark supports three evaluation tasks aligned with real-world applications: unsupervised periodic workflow detection, task completion tracking, and procedural anomaly detection. We also propose a lightweight, training-free baseline for modeling diverse periodic workflow patterns. Experiments show that: (i) our benchmark presents significant challenges to both unsupervised periodic detection methods and zero-shot approaches based on powerful large language models (LLMs); (ii) our baseline outperforms competing methods by a substantial margin in all evaluation tasks; and (iii) in real-world applications, our baseline demonstrates deployment advantages on par with traditional supervised workflow detection approaches, eliminating the need for annotation and retraining. Our project page is https://sites.google.com/view/periodicworkflow.
中文标题/摘要
标题:无监督发现人类活动中长期时空周期性工作流
在制造、体育和日常生活中,具有隐含工作流的周期性人类活动很常见。虽然短期周期性活动因其简单的结构和高对比度模式而被广泛研究,但具有低对比度模式的长期周期性工作流仍然很大程度上未被探索。为了解决这一问题,我们引入了首个包含580个多模态人类活动序列的基准,这些序列展示了长期周期性工作流。该基准支持三项与实际应用对齐的评估任务:无监督周期性工作流检测、任务完成跟踪和程序异常检测。我们还提出了一种轻量级、无需训练的基础模型,用于建模多样的周期性工作流模式。实验表明:(i) 我们的基准对无监督周期性检测方法和基于强大大型语言模型(LLMs)的零样本方法提出了重大挑战;(ii) 我们的基础模型在所有评估任务中均显著优于竞争方法;(iii) 在实际应用中,我们的基础模型在部署方面与传统的监督工作流检测方法相当,无需标注和重新训练。我们的项目页面为 https://sites.google.com/view/periodicworkflow。
Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models
Authors: Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang
First: 2025-08-08T16:13:28+00:00 · Latest: 2025-11-20T15:51:07+00:00
Comments: 16 pages; Previously this version appeared as arXiv:2510.15430 which was submitted as a new work by accident
Abstract
Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.
中文标题/摘要
标题:在大型视觉-语言模型中学习检测未知越狱攻击
尽管进行了广泛的对齐努力,大型视觉-语言模型(LVLMs)仍然容易受到越狱攻击的影响,这带来了严重的安全风险。为了解决这一问题,现有的检测方法要么学习特定攻击的参数,这妨碍了对未见过攻击的泛化,要么依赖于经验主义的原则,这限制了准确性和效率。为了克服这些限制,我们提出了学习检测(LoD),这是一种通用框架,通过将重点从特定攻击的学习转移到特定任务的学习,准确地检测未知的越狱攻击。该框架包括一个多模态安全概念激活向量模块,用于安全导向的表示学习,以及一个安全模式自编码器模块,用于无监督攻击分类。广泛的实验表明,我们的方法在多种未知攻击上的检测AUROC始终更高,同时提高了效率。代码可在https://anonymous.4open.science/r/Learning-to-Detect-51CB获取。
Summary / 总结
The paper addresses the vulnerability of Large Vision-Language Models (LVLMs) to jailbreak attacks, proposing a new framework called Learning to Detect (LoD) to improve detection of unknown attacks. LoD focuses on task-specific learning rather than attack-specific parameters, using a Multi-modal Safety Concept Activation Vector module for representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Experiments demonstrate that LoD achieves higher detection AUROC on various unknown attacks while enhancing efficiency.
论文针对大型视觉-语言模型(LVLM)对劫持攻击的脆弱性,提出了一种新的框架Learning to Detect (LoD),以提高对未知攻击的检测能力。LoD 侧重于任务特定的学习而非特定攻击的学习,使用多模态安全概念激活向量模块进行表示学习,以及安全模式自动编码器模块进行无监督攻击分类。实验表明,LoD 在各种未知攻击上的检测 AUROC 较高,同时提高了效率。
LLaVA$^3$: Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs
Authors: Doriand Petit, Steve Bourgeois, Vincent Gay-Bellile, Florian Chabot, Loïc Barthe
Venue: AAAI
First: 2025-11-20T15:22:22+00:00 · Latest: 2025-11-20T15:22:22+00:00
Comments: Accepted at AAAI'26
Abstract
Developing a multi-modal language model capable of understanding 3D scenes remains challenging due to the limited availability of 3D training data, in contrast to the abundance of 2D datasets used for vision-language models (VLM). As an alternative, we introduce LLaVA$^3$ (pronounced LLaVA-Cube), a novel method that improves the 3D scene understanding capabilities of VLM using only multi-view 2D images and without any fine-tuning. Inspired by Cubist painters, who represented multiple viewpoints of a 3D object within a single picture, we propose to describe the 3D scene for the VLM through omnidirectional visual representations of each object. These representations are derived from an intermediate multi-view 3D reconstruction of the scene. Extensive experiments on 3D VQA and 3D language grounding show that our approach outperforms previous 2D-based VLM solutions.
中文标题/摘要
标题:LLaVA$^3$: 以立体派画家的方式表示3D场景以增强VLM的3D场景理解能力
由于可用于训练的3D数据有限,而2D数据集却非常丰富,因此开发能够理解3D场景的多模态语言模型仍然具有挑战性。作为替代方案,我们提出了LLaVA$^3$(发音为LLaVA-立方体),这是一种利用多视角2D图像而不进行微调的新方法,以增强VLM的3D场景理解能力。受立体派画家将3D对象的多个视角整合到一幅画中的启发,我们提出通过每个对象的全向视觉表示来描述3D场景,这些表示是从场景的中间多视角3D重建中获得的。在3D VQA和3D语言定位的广泛实验中,我们的方法优于基于2D的VLM解决方案。
Summary / 总结
The research aims to enhance the 3D scene understanding of vision-language models (VLMs) by leveraging multi-view 2D images without fine-tuning, inspired by Cubist painting techniques. LLaVA$^3$ represents each object in a 3D scene using omnidirectional visual representations derived from an intermediate multi-view 3D reconstruction, which is then fed into the VLM. Experimental results on 3D VQA and 3D language grounding demonstrate that this method surpasses previous 2D-based VLM solutions in terms of performance.
研究旨在通过利用多视角2D图像而不进行微调,增强视觉语言模型(VLM)的3D场景理解能力,灵感来源于立体派绘画技巧。LLaVA$^3$从多个视角表示每个3D物体,并将其整合为单个全景视觉表示供VLM使用。该方法在3D视觉问答和3D语言定位任务中显著优于基于2D的VLM解决方案。
VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference
Authors: Ziyan Liu, Yeqiu Chen, Hongyi Cai, Tao Lin, Shuo Yang, Zheng Liu, Bo Zhao
First: 2025-11-20T15:16:09+00:00 · Latest: 2025-11-20T15:16:09+00:00
Abstract
Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA's intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.
中文标题/摘要
标题:VLA-Pruner: 时空感知的双层视觉标记剪枝方法以实现高效的视觉-语言-行动推理
视觉-语言-行动(VLA)模型在体感人工智能方面展现了巨大的潜力,然而处理连续视觉流的高昂计算成本严重限制了其实时部署。标记剪枝(保留显著的视觉标记并丢弃冗余的标记)已成为加速视觉-语言模型(VLMs)的有效方法,为高效的VLA提供了解决方案。然而,这些针对VLM的特定标记剪枝方法仅基于语义显著性指标(例如预填充注意)选择标记,而忽视了VLA固有的双系统本质,即高层次语义理解和低层次行动执行。因此,这些方法偏向于语义线索,丢弃了用于生成行动的关键信息,显著降低了VLA的性能。为解决这一问题,我们提出了一种名为VLA-Pruner的通用即插即用VLA特定标记剪枝方法,该方法与VLA模型的双系统本质相一致,并利用机器人操作中的时间连续性。具体而言,VLA-Pruner采用双层重要性标准来保留视觉标记:视觉-语言预填充注意用于语义层面的相关性,通过时间平滑估计的动作解码注意用于行动层面的重要性。基于此标准,VLA-Pruner提出了一种新颖的双层标记选择策略,在给定的计算预算下,自适应地保留一套紧凑且信息丰富的视觉标记,以支持语义理解和行动执行。实验表明,VLA-Pruner在多种VLA架构和不同机器人任务中均实现了最先进的性能。
Summary / 总结
VLA-Pruner is a token pruning method designed for efficient Vision-Language-Action (VLA) models. It addresses the issue of heavy computational cost by retaining salient visual tokens and discarding redundant ones, while considering both semantic and action aspects. VLA-Pruner uses a dual-level importance criterion, combining vision-language prefill attention and action decode attention estimated via temporal smoothing, to select tokens. Experimental results demonstrate that VLA-Pruner outperforms existing methods across various VLA architectures and robotic tasks, achieving state-of-the-art performance.
VLA-Pruner 是一种针对高效视觉-语言-行动 (VLA) 模型的 token 剪枝方法。它通过保留关键视觉 token 并丢弃冗余 token 来解决计算成本高的问题,同时考虑语义和行动两个方面。VLA-Pruner 使用双重重要性标准,结合视觉-语言预填充注意力和通过时间平滑估计的动作解码注意力,来选择 token。实验结果表明,VLA-Pruner 在多种 VLA 架构和机器人任务中表现出色,达到最先进的性能。
Beyond Visual Cues: Leveraging General Semantics as Support for Few-Shot Segmentation
Authors: Jin Wang, Bingfeng Zhang, Jian Pang, Mengyu Liu, Honglong Chen, Weifeng Liu
First: 2025-11-20T15:04:53+00:00 · Latest: 2025-11-20T15:04:53+00:00
Abstract
Few-shot segmentation (FSS) aims to segment novel classes under the guidance of limited support samples by a meta-learning paradigm. Existing methods mainly mine references from support images as meta guidance. However, due to intra-class variations among visual representations, the meta information extracted from support images cannot produce accurate guidance to segment untrained classes. In this paper, we argue that the references from support images may not be essential, the key to the support role is to provide unbiased meta guidance for both trained and untrained classes. We then introduce a Language-Driven Attribute Generalization (LDAG) architecture to utilize inherent target property language descriptions to build robust support strategy. Specifically, to obtain an unbiased support representation, we design a Multi-attribute Enhancement (MaE) module, which produces multiple detailed attribute descriptions of the target class through Large Language Models (LLMs), and then builds refined visual-text prior guidance utilizing multi-modal matching. Meanwhile, due to text-vision modal shift, attribute text struggles to promote visual feature representation, we design a Multi-modal Attribute Alignment (MaA) to achieve cross-modal interaction between attribute texts and visual feature. Experiments show that our proposed method outperforms existing approaches by a clear margin and achieves the new state-of-the art performance. The code will be released.
中文标题/摘要
标题:超越视觉线索:利用通用语义支持少量样本分割
少量样本分割(FSS)旨在通过元学习范式在有限支持样本的指导下对新类别进行分割。现有方法主要从支持图像中挖掘参考信息作为元指导。然而,由于视觉表示内的类别变化,从支持图像中提取的元信息无法为未训练类别提供准确的分割指导。在本文中,我们认为支持图像中的参考信息可能不是必需的,支持的关键在于为已训练和未训练类别提供无偏的元指导。我们随后引入了基于语言驱动属性泛化的(LDAG)架构,利用目标属性语言描述来构建稳健的支持策略。具体而言,为了获得无偏的支持表示,我们设计了一个多属性增强(MaE)模块,该模块通过大型语言模型(LLMs)生成目标类的多个详细属性描述,并利用多模态匹配构建精细的视觉-文本先验指导。同时,由于文本-视觉模态转换,属性文本难以促进视觉特征表示,我们设计了多模态属性对齐(MaA)以实现属性文本与视觉特征之间的跨模态交互。实验表明,我们提出的方法在现有方法上取得了明显的优势,并达到了新的最佳性能。代码将开源。
Summary / 总结
This paper addresses the challenge of few-shot segmentation by proposing a Language-Driven Attribute Generalization (LDAG) architecture. It leverages general semantics rather than visual cues from support images to provide unbiased meta guidance for both trained and untrained classes. Key findings include the introduction of a Multi-attribute Enhancement (MaE) module that generates detailed attribute descriptions using Large Language Models (LLMs) and a Multi-modal Attribute Alignment (MaA) module to enhance cross-modal interaction. Experimental results demonstrate that the proposed method surpasses existing approaches and sets a new state-of-the-art performance.
本文提出了一种语言驱动的属性泛化(LDAG)架构,通过一般语义而非支持图像中的视觉线索来提供无偏的元指导。该方法使用一个属性增强(MaE)模块通过大型语言模型生成目标类的详细属性描述,并使用多模态属性对齐(MaA)模块增强属性文本与视觉特征之间的跨模态交互。实验表明,该方法显著优于现有方法,并达到了新的最佳性能。
TOFA: Training-Free One-Shot Federated Adaptation for Vision-Language Models
Authors: Li Zhang, Zhongxuan Han, XiaoHua Feng, Jiaming Zhang, Yuyuan Li, Linbo Jiang, Jianan Lin, Chaochao Chen
Venue: AAAI 2026
First: 2025-11-20T14:45:59+00:00 · Latest: 2025-11-20T14:45:59+00:00
Comments: Accepted by AAAI 2026
Abstract
Efficient and lightweight adaptation of pre-trained Vision-Language Models (VLMs) to downstream tasks through collaborative interactions between local clients and a central server is a rapidly emerging research topic in federated learning. Existing adaptation algorithms are typically trained iteratively, which incur significant communication costs and increase the susceptibility to potential attacks. Motivated by the one-shot federated training techniques that reduce client-server exchanges to a single round, developing a lightweight one-shot federated VLM adaptation method to alleviate these issues is particularly attractive. However, current one-shot approaches face certain challenges in adapting VLMs within federated settings: (1) insufficient exploitation of the rich multimodal information inherent in VLMs; (2) lack of specialized adaptation strategies to systematically handle the severe data heterogeneity; and (3) requiring additional training resource of clients or server. To bridge these gaps, we propose a novel Training-free One-shot Federated Adaptation framework for VLMs, named TOFA. To fully leverage the generalizable multimodal features in pre-trained VLMs, TOFA employs both visual and textual pipelines to extract task-relevant representations. In the visual pipeline, a hierarchical Bayesian model learns personalized, class-specific prototype distributions. For the textual pipeline, TOFA evaluates and globally aligns the generated local text prompts for robustness. An adaptive weight calibration mechanism is also introduced to combine predictions from both modalities, balancing personalization and robustness to handle data heterogeneity. Our method is training-free, not relying on additional training resources on either the client or server side. Extensive experiments across 9 datasets in various federated settings demonstrate the effectiveness of the proposed TOFA method.
中文标题/摘要
标题:TOFA:无需训练的一次性联邦适应框架用于视觉-语言模型
通过本地客户端与中央服务器之间的协作交互,对预训练的视觉-语言模型(VLMs)进行高效且轻量级的下游任务适应,是联邦学习中迅速兴起的研究领域。现有的适应算法通常需要迭代训练,这会带来显著的通信成本并增加潜在攻击的风险。受减少客户端与服务器间交互次数的一次性联邦训练技术的启发,开发一种轻量级的一次性联邦VLM适应方法以缓解这些问题特别具有吸引力。然而,当前的一次性方法在联邦环境中适应VLMs时面临一些挑战:(1)未能充分利用VLMs中丰富的跨模态信息;(2)缺乏专门的适应策略来系统地处理严重的数据异质性;(3)需要额外的训练资源。为解决这些问题,我们提出了一种新的无需训练的一次性联邦适应框架,名为TOFA。为了充分利用预训练VLMs中的可泛化跨模态特征,TOFA采用视觉和文本两条管道来提取任务相关的表示。在视觉管道中,层次贝叶斯模型学习个性化、类特定的原型分布。对于文本管道,TOFA评估并全局对齐生成的本地文本提示以增强鲁棒性。还引入了一种自适应权重校准机制,以结合两种模态的预测,平衡个性化和鲁棒性以处理数据异质性。我们的方法是无需训练的,既不依赖于客户端也不依赖于服务器的额外训练资源。在各种联邦设置下的9个数据集上进行的广泛实验表明了所提TOFA方法的有效性。
Summary / 总结
TOFA is a training-free one-shot federated adaptation framework for Vision-Language Models (VLMs) that addresses the challenges of insufficient multimodal information exploitation, data heterogeneity, and the need for additional training resources. It uses both visual and textual pipelines to extract task-relevant representations and introduces an adaptive weight calibration mechanism to combine predictions from both modalities, enhancing personalization and robustness. Experiments across nine datasets show the effectiveness of TOFA in federated settings.
TOFA 是一种无需训练的一次性联邦适应框架,用于视觉-语言模型(VLMs),旨在解决多模态信息不足、数据异质性和需要额外训练资源的问题。它使用视觉和文本管道提取任务相关的表示,采用分层贝叶斯模型学习个性化类特定的原型分布,并对生成的本地文本提示进行全局对齐以增强鲁棒性。该方法引入了自适应权重校准机制,以平衡个性化和鲁棒性。实验表明,TOFA 在九个不同联邦设置的数据集上表现出色,无需额外的训练资源。
One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image
Authors: Ezzeldin Shereen, Dan Ristea, Shae McFadden, Burak Hasircioglu, Vasilios Mavroudis, Chris Hicks
First: 2025-04-02T21:08:33+00:00 · Latest: 2025-11-20T14:21:05+00:00
Abstract
Retrieval-augmented generation (RAG) is instrumental for inhibiting hallucinations in large language models (LLMs) through the use of a factual knowledge base (KB). Although PDF documents are prominent sources of knowledge, text-based RAG pipelines are ineffective at capturing their rich multi-modal information. In contrast, visual document RAG (VD-RAG) uses screenshots of document pages as the KB, which has been shown to achieve state-of-the-art results. However, by introducing the image modality, VD-RAG introduces new attack vectors for adversaries to disrupt the system by injecting malicious documents into the KB. In this paper, we demonstrate the vulnerability of VD-RAG to poisoning attacks targeting both retrieval and generation. We define two attack objectives and demonstrate that both can be realized by injecting only a single adversarial image into the KB. Firstly, we introduce a targeted attack against one or a group of queries with the goal of spreading targeted disinformation. Secondly, we present a universal attack that, for any potential user query, influences the response to cause a denial-of-service in the VD-RAG system. We investigate the two attack objectives under both white-box and black-box assumptions, employing a multi-objective gradient-based optimization approach as well as prompting state-of-the-art generative models. Using two visual document datasets, a diverse set of state-of-the-art retrievers (embedding models) and generators (vision language models), we show VD-RAG is vulnerable to poisoning attacks in both the targeted and universal settings, yet demonstrating robustness to black-box attacks in the universal setting.
中文标题/摘要
标题:一图足以:单张图像对视觉文档增强生成的投毒攻击
检索增强生成(RAG)对于通过事实知识库(KB)抑制大型语言模型(LLMs)的幻觉至关重要。尽管PDF文档是知识的重要来源,但基于文本的RAG管道无法有效捕捉其丰富的多模态信息。相比之下,视觉文档RAG(VD-RAG)使用文档页面的截图作为KB,已被证明可达到最先进的效果。然而,通过引入图像模态,VD-RAG为对手提供了新的攻击向量,通过向KB注入恶意文档来破坏系统。在本文中,我们展示了VD-RAG在检索和生成方面都容易受到投毒攻击。我们定义了两种攻击目标,并证明只需向KB注入一张对抗性图像即可实现这两种目标。首先,我们介绍了一种针对一个或一组查询的定向攻击,其目标是传播有针对性的虚假信息。其次,我们提出了一种通用攻击,对于任何潜在的用户查询,都会影响响应以导致VD-RAG系统的拒绝服务。我们在白盒和黑盒假设下研究了这两种攻击目标,采用多目标梯度优化方法以及提示最先进的生成模型。使用两个视觉文档数据集、一组多样化的最先进的检索器(嵌入模型)和生成器(视觉语言模型),我们展示了VD-RAG在定向和通用设置下都容易受到投毒攻击,但在通用设置下对黑盒攻击具有鲁棒性。
Summary / 总结
This paper investigates the vulnerability of visual document retrieval-augmented generation (VD-RAG) systems to poisoning attacks. Motivated by the need to protect against malicious documents in the knowledge base, the authors demonstrate that a single adversarial image can achieve both targeted and universal attacks. The targeted attack aims to spread disinformation for specific queries, while the universal attack causes denial-of-service for any query. The study employs a multi-objective gradient-based optimization approach and shows that VD-RAG is vulnerable to these attacks, though it remains robust to black-box attacks in the universal setting.
本文研究了视觉文档检索增强生成(VD-RAG)系统对抗投毒攻击的脆弱性。出于保护知识库免受恶意文档侵害的需要,作者证明了一个单一的对抗性图像可以用来传播有针对性的虚假信息或导致服务中断。他们采用多目标梯度优化方法,并展示了在白盒和黑盒条件下,VD-RAG 对于有针对性和通用攻击都是易受攻击的,尽管在通用攻击的黑盒条件下表现出一定的鲁棒性。
Unsupervised Graph Neural Network Framework for Balanced Multipatterning in Advanced Electronic Design Automation Layouts
Authors: Abdelrahman Helaly, Nourhan Sakr, Kareem Madkour, Ilhami Torunoglu
First: 2025-11-20T13:57:50+00:00 · Latest: 2025-11-20T13:57:50+00:00
Comments: manuscript under review
Abstract
Multipatterning is an essential decomposition strategy in electronic design automation (EDA) that overcomes lithographic limitations when printing dense circuit layouts. Although heuristic-based backtracking and SAT solvers can address these challenges, they often struggle to simultaneously handle both complex constraints and secondary objectives. In this study, we present a hybrid workflow that casts multipatterning as a variant of a constrained graph coloring problem with the primary objective of minimizing feature violations and a secondary objective of balancing the number of features on each mask. Our pipeline integrates two main components: (1) A GNN-based agent, trained in an unsupervised manner to generate initial color predictions, which are refined by (2) refinement strategies (a GNN-based heuristic and simulated annealing) that together enhance solution quality and balance. Experimental evaluation in both proprietary data sets and publicly available open source layouts demonstrate complete conflict-free decomposition and consistent color balancing. The proposed framework provides a reproducible, data-efficient and deployable baseline for scalable layout decomposition in EDA workflows.
中文标题/摘要
标题:无监督图神经网络框架在先进电子设计自动化布局中的平衡多图案化
多图案化是电子设计自动化(EDA)中克服光刻限制的一种基本分解策略,用于打印密集电路布局。尽管基于启发式的回溯和SAT求解器可以解决这些挑战,但它们通常难以同时处理复杂的约束条件和次要目标。在本研究中,我们提出了一种混合工作流,将多图案化视为一种受限图着色问题的变体,主要目标是减少特征冲突,次要目标是平衡每个掩膜上的特征数量。我们的流水线整合了两个主要组件:(1) 一种基于GNN的代理,以无监督方式训练以生成初始颜色预测,然后由(2) 改进策略(基于GNN的启发式和模拟退火)共同提高解决方案质量和平衡。在专有数据集和公开可用的开源布局中的实验评估表明,完全冲突自由分解和一致的颜色平衡。所提出的框架为EDA工作流中的可扩展布局分解提供了一个可重复、数据高效且可部署的基础。
Summary / 总结
This study addresses the challenge of multipatterning in EDA by formulating it as a constrained graph coloring problem. The proposed hybrid workflow includes an unsupervised GNN-based agent for initial color prediction and refinement strategies like a GNN-based heuristic and simulated annealing to enhance solution quality and balance. The experiments show complete conflict-free decomposition and consistent color balancing in both proprietary and open source layouts, providing a scalable baseline for EDA workflows.
该研究通过将多图案化问题表述为约束图着色问题来解决EDA中的挑战。研究提出了一种混合工作流,使用基于GNN的代理进行初始预测,并结合GNN启发式和模拟退火等精炼策略来提高解决方案的质量和平衡。实验结果表明,在专有数据集和开源布局中均实现了完全无冲突分解和一致的颜色平衡,为EDA工作流中的可扩展布局分解提供了可重复、高效的基础框架。
BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms
Authors: Yunlong Hou, Fengzhuo Zhang, Cunxiao Du, Xuan Zhang, Jiachun Pan, Tianyu Pang, Chao Du, Vincent Y. F. Tan, Zhuoran Yang
Venue: ICML
First: 2025-05-21T05:56:31+00:00 · Latest: 2025-11-20T13:34:42+00:00
Comments: 35 pages, 4 figures, accepted to ICML, typos and affiliations are corrected
Abstract
Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts.
中文标题/摘要
标题:BanditSpec:通过bandit算法实现自适应推测性解码
推测性解码已成为一种流行的方法,用于加速大型语言模型(LLMs)的推理,同时保持其卓越的文本生成性能。先前的方法要么采用固定不变的推测性解码配置,不考虑前缀标记,要么通过离线或在线训练草稿模型来与上下文对齐。本文提出了一种无需训练的在线学习框架,以在生成文本时自适应地选择推测性解码超参数配置。我们首先将此超参数选择问题形式化为一个多臂赌博机问题,并提供了一个通用的推测性解码框架BanditSpec。此外,我们设计并分析了两种基于bandit的超参数选择算法UCBSpec和EXP3Spec,以一种新的量度——停止时间后悔量度。我们分别在随机奖励和对抗奖励设置下对这种后悔进行了上界估计。通过推导出一种信息论不可能性结果,表明UCBSpec的后悔性能在通用常数范围内是最佳的。最后,通过LLaMA3和Qwen2的广泛实证实验表明,与现有方法相比,我们的算法是有效的,且吞吐量接近模拟的真实生活LLM服务场景中的最优超参数。
Summary / 总结
This paper introduces BanditSpec, an adaptive speculative decoding framework that dynamically selects hyperparameters during text generation, addressing the limitations of fixed configurations and offline training. It formulates the hyperparameter selection as a Multi-Armed Bandit problem and proposes UCBSpec and EXP3Spec algorithms, which are analyzed in terms of stopping time regret. The experiments show that these algorithms outperform existing methods and achieve throughput close to the optimal hyperparameter setting in various LLM serving scenarios.
该论文提出了BanditSpec,这是一种在文本生成过程中动态选择超参数的自适应推测解码框架,无需进行训练。它将超参数选择问题形式化为一个多臂老虎机问题,并提出了两种算法UCBSpec和EXP3Spec,从停止时间后悔的角度进行了分析。实验表明,这些算法在各种LLM服务场景中优于现有方法,并且吞吐量接近于模拟的真实场景中的最优超参数设置。
SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning
Authors: Wei Xia, Zhi-Hong Deng
First: 2025-11-20T13:00:04+00:00 · Latest: 2025-11-20T13:00:04+00:00
Abstract
With the rapid advancement of large language models (LLMs), their deployment in real-world applications has become increasingly widespread. LLMs are expected to deliver robust performance across diverse tasks, user preferences, and practical scenarios. However, as demands grow, ensuring that LLMs produce responses aligned with human intent remains a foundational challenge. In particular, aligning model behavior effectively and efficiently during inference, without costly retraining or extensive supervision, is both a critical requirement and a non-trivial technical endeavor. To address the challenge, we propose SDA (Steering-Driven Distribution Alignment), a training-free and model-agnostic alignment framework designed for open-source LLMs. SDA dynamically redistributes model output probabilities based on user-defined alignment instructions, enhancing alignment between model behavior and human intents without fine-tuning. The method is lightweight, resource-efficient, and compatible with a wide range of open-source LLMs. It can function independently during inference or be integrated with training-based alignment strategies. Moreover, SDA supports personalized preference alignment, enabling flexible control over the model response behavior. Empirical results demonstrate that SDA consistently improves alignment performance across 8 open-source LLMs with varying scales and diverse origins, evaluated on three key alignment dimensions, helpfulness, harmlessness, and honesty (3H). Specifically, SDA achieves average gains of 64.4% in helpfulness, 30% in honesty and 11.5% in harmlessness across the tested models, indicating its effectiveness and generalization across diverse models and application scenarios.
中文标题/摘要
标题:SDA:基于转向驱动的分布对齐方法,无需微调的开放LLM
随着大型语言模型(LLMs)的迅速发展,它们在实际应用中的部署变得越来越普遍。LLMs 被期望在各种任务、用户偏好和实际场景中提供稳健的性能。然而,随着需求的增长,确保LLMs 生成与人类意图一致的响应仍然是一个基础性的挑战。特别是在推理过程中有效且高效地对齐模型行为,而无需昂贵的重新训练或广泛的监督,既是关键要求也是非平凡的技术挑战。为了解决这一挑战,我们提出了SDA(Steering-Driven Distribution Alignment),一种无需训练且模型无关的对齐框架,适用于开源LLMs。SDA 根据用户定义的对齐指令动态重新分配模型输出概率,增强模型行为与人类意图之间的对齐,而无需微调。该方法轻量级、资源高效,并且兼容多种开源LLMs。它可以在推理过程中独立运行,也可以与基于训练的对齐策略集成。此外,SDA 支持个性化偏好对齐,允许灵活控制模型响应行为。实验证明,SDA 在三个关键对齐维度(帮助性、无害性和诚实性)上,对8个不同规模和来源的开源LLMs 的对齐性能进行了持续改进。具体而言,SDA 在测试模型中的平均帮助性提高了64.4%,诚实性提高了30%,无害性提高了11.5%,表明其在不同模型和应用场景中的有效性和普适性。
Summary / 总结
SDA (Steering-Driven Distribution Alignment) is a training-free and model-agnostic framework designed to align open-source large language models (LLMs) with human intent during inference. It dynamically redistributes model output probabilities based on user-defined instructions, enhancing alignment without fine-tuning. SDA improves alignment performance across 8 open-source LLMs, achieving average gains of 64.4% in helpfulness, 30% in honesty, and 11.5% in harmlessness, demonstrating its effectiveness and generalization across diverse models and scenarios.
SDA 是一个无需训练且模型通用的框架,用于在推理过程中使大型语言模型(LLM)与人类意图对齐。它基于用户定义的指令重新分配模型输出概率,而不进行微调。SDA 在帮助性、诚实性和无害性方面分别提高了 64.4%、30% 和 11.5%,表明其在不同模型和应用场景中的有效性和普适性。
NaTex: Seamless Texture Generation as Latent Color Diffusion
Authors: Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Xin Yang, Xin Huang, Jingwei Huang, Xiangyu Yue, Chunchao Guo
First: 2025-11-20T12:47:22+00:00 · Latest: 2025-11-20T12:47:22+00:00
Comments: Technical Report
Abstract
We present NaTex, a native texture generation framework that predicts texture color directly in 3D space. In contrast to previous approaches that rely on baking 2D multi-view images synthesized by geometry-conditioned Multi-View Diffusion models (MVDs), NaTex avoids several inherent limitations of the MVD pipeline. These include difficulties in handling occluded regions that require inpainting, achieving precise mesh-texture alignment along boundaries, and maintaining cross-view consistency and coherence in both content and color intensity. NaTex features a novel paradigm that addresses the aforementioned issues by viewing texture as a dense color point cloud. Driven by this idea, we propose latent color diffusion, which comprises a geometry-awared color point cloud VAE and a multi-control diffusion transformer (DiT), entirely trained from scratch using 3D data, for texture reconstruction and generation. To enable precise alignment, we introduce native geometry control that conditions the DiT on direct 3D spatial information via positional embeddings and geometry latents. We co-design the VAE-DiT architecture, where the geometry latents are extracted via a dedicated geometry branch tightly coupled with the color VAE, providing fine-grained surface guidance that maintains strong correspondence with the texture. With these designs, NaTex demonstrates strong performance, significantly outperforming previous methods in texture coherence and alignment. Moreover, NaTex also exhibits strong generalization capabilities, either training-free or with simple tuning, for various downstream applications, e.g., material generation, texture refinement, and part segmentation and texturing.
中文标题/摘要
标题:NaTex:无缝纹理生成的潜在颜色扩散
我们提出了NaTex,这是一种原生的纹理生成框架,可以直接在3D空间中预测纹理颜色。与依赖于几何条件多视图扩散模型(MVDs)合成的2D多视角图像烘焙的方法相比,NaTex避免了MVD管道中的几个固有限制。这些限制包括处理需要修补的遮挡区域的困难、在边界处实现精确的网格-纹理对齐以及在内容和颜色强度方面保持跨视图的一致性和连贯性。NaTex通过将纹理视为密集的颜色点云来解决这些问题,提出了一种新颖的范式。基于这一理念,我们提出了潜在颜色扩散,它包括一个几何感知的颜色点云VAE和一个多控制扩散变换器(DiT),完全从头开始使用3D数据进行训练,用于纹理重建和生成。为了实现精确对齐,我们引入了原生几何控制,通过位置嵌入和几何潜在变量直接条件化DiT。我们共同设计了VAE-DiT架构,其中几何潜在变量通过一个与颜色VAE紧密耦合的专用几何分支提取,提供细粒度的表面指导,保持与纹理的强烈对应关系。通过这些设计,NaTex展示了强大的性能,显著优于先前的方法在纹理连贯性和对齐方面的表现。此外,NaTex还展示了强大的泛化能力,无论是训练免费还是简单调整,都可以应用于各种下游应用,例如材料生成、纹理细化、部分分割和纹理化。
Summary / 总结
NaTex is a texture generation framework that directly predicts texture colors in 3D space, avoiding the limitations of previous Multi-View Diffusion models. It uses latent color diffusion, combining a geometry-aware color point cloud VAE and a multi-control diffusion transformer (DiT) to achieve precise alignment and coherence. NaTex outperforms previous methods in texture coherence and alignment and shows strong generalization capabilities for various applications like material generation and texture refinement.
NaTex 是一种直接在 3D 空间预测纹理颜色的生成框架,避免了基于 2D 多视图图像的先前方法的局限性。它使用潜色扩散,结合几何感知的颜色点云 VAE 和多控制扩散变换器(DiT),以实现精确对齐和一致性。NaTex 在纹理一致性和对齐方面优于先前的方法,并且在材料生成、纹理细化等各个下游应用中表现出强大的泛化能力。
VisPlay: Self-Evolving Vision-Language Models from Images
Authors: Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, Yonghui Yang
First: 2025-11-19T17:55:15+00:00 · Latest: 2025-11-20T12:43:54+00:00
Abstract
Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/
中文标题/摘要
标题:VisPlay:从图像中自我进化的视觉-语言模型
强化学习(RL)为在复杂推理任务上改进视觉-语言模型(VLMs)提供了一个原则性的框架。然而,现有的RL方法通常依赖于人工标注的标签或特定任务的启发式方法来定义可验证的奖励,这两种方法都成本高昂且难以扩展。我们引入了VisPlay,这是一种自我进化的RL框架,使VLMs能够利用大量未标注的图像数据自主提高其推理能力。从一个基础VLM开始,VisPlay将模型分配为两个相互作用的角色:一个图像条件下的提问者,它能够提出具有挑战性但可回答的视觉问题;以及一个多模态推理器,它生成银级回答。这些角色通过组相对策略优化(GRPO)联合训练,该方法结合了多样性和难度奖励,以平衡生成问题的复杂性与银级回答的质量。VisPlay在两个模型家族中高效扩展。当在Qwen2.5-VL和MiMo-VL上训练时,VisPlay在八个基准测试中,包括MM-Vet和MMMU,实现了视觉推理、组合泛化和幻觉减少的一致改进,展示了自我进化的多模态智能的可扩展路径。项目页面可在https://bruno686.github.io/VisPlay/获取
Summary / 总结
VisPlay is a self-evolving reinforcement learning framework for Vision-Language Models (VLMs) that uses large amounts of unlabeled image data to improve reasoning abilities. It assigns the model to two roles: an Image-Conditioned Questioner and a Multimodal Reasoner, which are trained together using Group Relative Policy Optimization (GRPO) to balance question complexity and answer quality. VisPlay consistently improves visual reasoning, compositional generalization, and reduces hallucination across multiple benchmarks, showing a scalable approach to self-evolving multimodal intelligence.
VisPlay 是一个自进化的视觉-语言模型(VLMs)强化学习框架,利用大量未标注的图像数据自主提升推理能力。它将模型分配为两个角色:图像条件下的问题提出者,负责提出具有挑战性的问题,以及多模态推理者,负责生成答案。通过组相对策略优化(GRPO)联合训练,该框架平衡了问题的复杂性和答案的质量,展示了在八个基准测试中(包括MM-Vet和MMMU)一致的视觉推理、组合泛化和幻觉减少改进,表明了一条自进化的多模态智能的可扩展路径。
Human Motion Unlearning
Authors: Edoardo De Matteis, Matteo Migliarini, Alessio Sampieri, Indro Spinelli, Fabio Galasso
First: 2025-03-24T13:46:27+00:00 · Latest: 2025-11-20T11:40:12+00:00
Abstract
We introduce the task of human motion unlearning to prevent the synthesis of toxic animations while preserving the general text-to-motion generative performance. Unlearning toxic motions is challenging as those can be generated from explicit text prompts and from implicit toxic combinations of safe motions (e.g., "kicking" is "loading and swinging a leg"). We propose the first motion unlearning benchmark by filtering toxic motions from the large and recent text-to-motion datasets of HumanML3D and Motion-X. We propose baselines, by adapting state-of-the-art image unlearning techniques to process spatio-temporal signals. Finally, we propose a novel motion unlearning model based on Latent Code Replacement, which we dub LCR. LCR is training-free and suitable to the discrete latent spaces of state-of-the-art text-to-motion diffusion models. LCR is simple and consistently outperforms baselines qualitatively and quantitatively. Project page: https://www.pinlab.org/hmu.
中文标题/摘要
标题:人类运动反学习
我们提出了人类运动反学习的任务,以防止合成有毒动画,同时保持文本到运动生成性能的一般表现。反学习有毒运动具有挑战性,因为这些运动可以从显式的文本提示或从安全运动的隐式有毒组合中生成(例如,“踢腿”是“加载并摆动腿部”)。我们通过从大型和最近的文本到运动数据集HumanML3D和Motion-X中过滤有毒运动,提出了第一个运动反学习基准。我们提出了基线方法,通过将最先进的图像反学习技术适应处理时空信号。最后,我们提出了一种基于潜在代码替换的新运动反学习模型,我们称之为LCR。LCR无需训练且适用于最先进的文本到运动扩散模型的离散潜在空间。LCR简单且在定性和定量上都优于基线方法。项目页面:https://www.pinlab.org/hmu.
Summary / 总结
The research introduces the task of human motion unlearning to prevent the synthesis of toxic animations while maintaining the generative performance of text-to-motion models. The study proposes a benchmark by filtering toxic motions from large datasets and adapts image unlearning techniques to process spatio-temporal signals. A novel motion unlearning model called LCR is proposed, which uses Latent Code Replacement and outperforms baselines both qualitatively and quantitatively without requiring training.
研究旨在防止在文本到运动生成中合成有毒动画,同时保持整体性能。研究引入了一个通过过滤现有数据集中的有毒动作来构建的运动遗忘基准。它提出了一种名为Latent Code Replacement (LCR)的新模型,该模型无需训练即可有效地遗忘有毒动作,并在定性和定量上均优于基线模型。
LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging
Authors: Seungeon Lee, Soumi Das, Manish Gupta, Krishna P. Gummadi
First: 2025-11-10T14:13:10+00:00 · Latest: 2025-11-20T11:08:55+00:00
Abstract
Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient approach for fine-tuning large language models. However, conventional LoRA adapters are typically trained for a single task, limiting their applicability in real-world settings where inputs may span diverse and unpredictable domains. At inference time, existing approaches combine multiple LoRAs for improving performance on diverse tasks, while usually requiring labeled data or additional task-specific training, which is expensive at scale. In this work, we introduce LoRA on the Go (LoGo), a training-free framework that dynamically selects and merges adapters at the instance level without any additional requirements. LoGo leverages signals extracted from a single forward pass through LoRA adapters, to identify the most relevant adapters and determine their contributions on-the-fly. Across 5 NLP benchmarks, 27 datasets, and 3 model families, LoGo outperforms training-based baselines on some tasks upto a margin of 3.6% while remaining competitive on other tasks and maintaining inference throughput, highlighting its effectiveness and practicality.
中文标题/摘要
标题:LoRA随行:实例级动态LoRA选择与合并
低秩适应(LoRA)已成为一种参数高效的方法,用于微调大型语言模型。然而,传统的LoRA适配器通常仅针对单一任务进行训练,限制了它们在实际应用场景中的适用性,这些场景中的输入可能涉及多样且不可预测的领域。在推理时,现有方法结合多个LoRA以提高在多种任务上的性能,但通常需要标记数据或额外的任务特定训练,这在大规模应用时成本高昂。在本工作中,我们引入了LoRA随行(LoGo),这是一种无需训练的框架,可以在实例级别动态选择和合并适配器,而无需任何额外要求。LoGo利用单次通过LoRA适配器的前向传递提取的信号,以识别最相关的适配器并在运行时确定它们的贡献。在5个NLP基准、27个数据集和3个模型家族上,LoGo在某些任务上的性能优于基于训练的基线,最多可提高3.6%,同时在其他任务上保持竞争力并保持推理吞吐量,突显了其有效性和实用性。
Taming Uncertainty via Automation: Observing, Analyzing, and Optimizing Agentic AI Systems
Authors: Dany Moshkovich, Sergey Zeltyn
First: 2025-07-15T12:54:43+00:00 · Latest: 2025-11-20T10:41:15+00:00
Abstract
Large Language Models (LLMs) are increasingly deployed within agentic systems - collections of interacting, LLM-powered agents that execute complex, adaptive workflows using memory, tools, and dynamic planning. While enabling powerful new capabilities, these systems also introduce unique forms of uncertainty stemming from probabilistic reasoning, evolving memory states, and fluid execution paths. Traditional software observability and operations practices fall short in addressing these challenges. This paper presents our vision of AgentOps: a comprehensive framework for observing, analyzing, optimizing, and automating operation of agentic AI systems. We identify distinct needs across four key roles - developers, testers, site reliability engineers (SREs), and business users - each of whom engages with the system at different points in its lifecycle. We present the AgentOps Automation Pipeline, a six-stage process encompassing behavior observation, metric collection, issue detection, root cause analysis, optimized recommendations, and runtime automation. Throughout, we emphasize the critical role of automation in managing uncertainty and enabling self-improving AI systems - not by eliminating uncertainty, but by taming it to ensure safe, adaptive, and effective operation.
中文标题/摘要
标题:通过自动化驯服不确定性:观察、分析和优化代理型AI系统
大型语言模型(LLMs)越来越多地部署在代理型系统中——这些系统是由相互作用的、由LLM驱动的代理组成的集合,它们使用记忆、工具和动态规划执行复杂的、适应性的工作流。虽然这些系统提供了强大的新能力,但它们也引入了源自概率推理、不断变化的记忆状态和灵活执行路径的独特形式的不确定性。传统的软件可观测性和运维实践在应对这些挑战方面力有不逮。 本文提出了我们的AgentOps愿景:一个全面的框架,用于观察、分析、优化和自动化代理型AI系统的操作。我们确定了四个关键角色——开发人员、测试人员、站点可靠性工程师(SRE)和业务用户——他们分别在系统生命周期的不同阶段与系统互动。我们介绍了AgentOps自动化管道,这是一个六阶段过程,包括行为观察、指标收集、问题检测、根本原因分析、优化建议和运行时自动化。在整个过程中,我们强调了自动化在管理不确定性、使AI系统能够自我改进方面的重要作用——不是通过消除不确定性,而是通过驯服它来确保安全、适应性和有效的操作。
Summary / 总结
This paper addresses the challenges of uncertainty in agentic systems powered by Large Language Models (LLMs), which involve interacting agents executing complex workflows. It introduces AgentOps, a framework for observing, analyzing, and automating the operation of these systems. The AgentOps Automation Pipeline includes six stages: behavior observation, metric collection, issue detection, root cause analysis, optimized recommendations, and runtime automation. The key finding is the importance of automation in managing uncertainty to ensure safe and effective operation of agentic AI systems.
本文探讨了由大型语言模型(LLMs)驱动的交互式代理系统中不确定性带来的挑战,这些系统执行复杂的流程工作。它引入了AgentOps框架,用于观察、分析和自动化这些系统的操作。AgentOps自动化管道包括六个阶段:行为观察、指标收集、问题检测、根本原因分析、优化建议和运行时自动化。关键发现是自动化在管理不确定性以确保代理AI系统的安全和有效操作中的重要性。
FlipVQA-Miner: Cross-Page Visual Question-Answer Mining from Textbooks
Authors: Zhen Hao Wong, Jingwen Deng, Hao Liang, Runming He, Chengyu Shen, Wentao Zhang
First: 2025-11-20T10:38:00+00:00 · Latest: 2025-11-20T10:38:00+00:00
Abstract
The development of Large Language Models (LLMs) increasingly depends on high-quality supervised data, yet existing instruction-tuning and RL datasets remain costly to curate and often rely on synthetic samples that introduce hallucination and limited diversity. At the same time, textbooks and exercise materials contain abundant, high-quality human-authored Question-Answer(QA) content that remains underexploited due to the difficulty of transforming raw PDFs into AI-ready supervision. Although modern OCR and vision-language models can accurately parse document structure, their outputs lack the semantic alignment required for training. We propose an automated pipeline that extracts well-formed QA and visual-QA (VQA) pairs from educational documents by combining layout-aware OCR with LLM-based semantic parsing. Experiments across diverse document types show that the method produces accurate, aligned, and low-noise QA/VQA pairs. This approach enables scalable use of real-world educational content and provides a practical alternative to synthetic data generation for improving reasoning-oriented LLM training. All code and data-processing pipelines are open-sourced at https://github.com/OpenDCAI/DataFlow.
中文标题/摘要
标题:FlipVQA-Miner: 跨页面视觉问答挖掘来自教科书
大型语言模型(LLMs)的发展越来越依赖高质量的监督数据,但现有的指令调优和强化学习数据集的收集成本高昂,且往往依赖于合成样本,这会引入幻觉并限制多样性。同时,教科书和练习材料中包含大量高质量的人工编写的问题-答案(QA)内容,但由于将原始PDF转换为AI可利用的监督数据的难度,这些内容尚未得到充分利用。尽管现代OCR和跨模态模型能够准确解析文档结构,但它们的输出缺乏用于训练所需的语义对齐。我们提出了一种自动化流水线,通过结合布局感知OCR与基于LLM的语义解析,从教育文档中提取结构良好且格式正确的QA和视觉问答(VQA)对。跨不同类型的文档进行的实验表明,该方法生成了准确、对齐且低噪声的QA/VQA对。该方法使实际教育内容的规模化利用成为可能,并为提高以推理为导向的LLM训练提供了实用的替代方案。所有代码和数据处理流水线均开源于https://github.com/OpenDCAI/DataFlow。
History
20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553