Scaling Group Inference for Diverse and High-Quality Generation
Authors: Gaurav Parmar, Or Patashnik, Daniil Ostashev, Kuan-Chieh Wang, Kfir Aberman, Srinivasa Narasimhan, Jun-Yan Zhu
Venue: www
First: 2025-08-21T17:59:57+00:00 · Latest: 2025-08-21T17:59:57+00:00
Comments: Project website: https://www.cs.cmu.edu/~group-inference, GitHub:
https://github.com/GaParmar/group-inference
Abstract
Generative models typically sample outputs independently, and recent
inference-time guidance and scaling algorithms focus on improving the quality
of individual samples. However, in real-world applications, users are often
presented with a set of multiple images (e.g., 4-8) for each prompt, where
independent sampling tends to lead to redundant results, limiting user choices
and hindering idea exploration. In this work, we introduce a scalable group
inference method that improves both the diversity and quality of a group of
samples. We formulate group inference as a quadratic integer assignment
problem: candidate outputs are modeled as graph nodes, and a subset is selected
to optimize sample quality (unary term) while maximizing group diversity
(binary term). To substantially improve runtime efficiency, we progressively
prune the candidate set using intermediate predictions, allowing our method to
scale up to large candidate sets. Extensive experiments show that our method
significantly improves group diversity and quality compared to independent
sampling baselines and recent inference algorithms. Our framework generalizes
across a wide range of tasks, including text-to-image, image-to-image, image
prompting, and video generation, enabling generative models to treat multiple
outputs as cohesive groups rather than independent samples.
中文标题/摘要
标题:规模化群体推理以实现多样化和高质量生成
生成模型通常独立采样输出,而近期的推理时引导与扩展算法主要关注提升单个样本的质量。然而在实际应用中,用户常需针对每个提示获取多张图像(如4-8张),独立采样易导致结果冗余,限制用户选择并阻碍创意探索。本研究提出一种可扩展的群体推理方法,同步提升样本组的多样性与质量。我们将群体推理构建为二次整数分配问题:候选输出建模为图节点,通过选择子集优化样本质量(一元项)并最大化群体多样性(二元项)。为显著提升运行效率,采用中间预测逐步剪枝候选集,使方法能扩展至大规模候选集。大量实验表明,相较于独立采样基线及近期推理算法,本方法显著提升了群体多样性与质量。该框架可泛化至文本生成图像、图像到图像转换、图像提示及视频生成等广泛任务,使生成模型将多输出视为有机整体而非独立样本。
Summary / 总结
Generative models often produce redundant outputs when sampling multiple images per prompt, limiting user choice and exploration. To address this, the authors formulate group inference as a quadratic integer assignment problem, modeling candidate outputs as graph nodes and selecting a subset that optimizes both quality (unary term) and diversity (binary term). They improve efficiency through progressive candidate pruning, enabling scalability to large sets. Experiments demonstrate significant improvements in group diversity and quality over independent sampling and recent inference methods across text-to-image, image-to-image, image prompting, and video generation tasks.
生成模型在每次提示下采样多张图像时往往产生冗余输出,限制了用户选择和创意探索。为此,研究者将群体推理构建为二次整数分配问题,将候选输出建模为图节点,并选择同时优化质量(一元项)和多样性(二元项)的子集。通过渐进式候选剪枝提高效率,实现大规模扩展。实验表明,该方法在文本到图像、图像到图像、图像提示和视频生成任务中,相比独立采样和近期推理方法,显著提升了群体多样性和质量。
CineScale: Free Lunch in High-Resolution Cinematic Visual Generation
Authors: Haonan Qiu, Ning Yu, Ziqi Huang, Paul Debevec, Ziwei Liu
Venue: ICCV 2025
First: 2025-08-21T17:59:57+00:00 · Latest: 2025-08-21T17:59:57+00:00
Comments: CineScale is an extended work of FreeScale (ICCV 2025). Project Page:
https://eyeline-labs.github.io/CineScale/, Code Repo:
https://github.com/Eyeline-Labs/CineScale
Abstract
Visual diffusion models achieve remarkable progress, yet they are typically
trained at limited resolutions due to the lack of high-resolution data and
constrained computation resources, hampering their ability to generate
high-fidelity images or videos at higher resolutions. Recent efforts have
explored tuning-free strategies to exhibit the untapped potential
higher-resolution visual generation of pre-trained models. However, these
methods are still prone to producing low-quality visual content with repetitive
patterns. The key obstacle lies in the inevitable increase in high-frequency
information when the model generates visual content exceeding its training
resolution, leading to undesirable repetitive patterns deriving from the
accumulated errors. In this work, we propose CineScale, a novel inference
paradigm to enable higher-resolution visual generation. To tackle the various
issues introduced by the two types of video generation architectures, we
propose dedicated variants tailored to each. Unlike existing baseline methods
that are confined to high-resolution T2I and T2V generation, CineScale broadens
the scope by enabling high-resolution I2V and V2V synthesis, built atop
state-of-the-art open-source video generation frameworks. Extensive experiments
validate the superiority of our paradigm in extending the capabilities of
higher-resolution visual generation for both image and video models.
Remarkably, our approach enables 8k image generation without any fine-tuning,
and achieves 4k video generation with only minimal LoRA fine-tuning. Generated
video samples are available at our website:
https://eyeline-labs.github.io/CineScale/.
中文标题/摘要
标题:CineScale:高分辨率电影级视觉生成的免费午餐
视觉扩散模型虽取得显著进展,但因缺乏高分辨率数据及计算资源受限,通常仅在有限分辨率下训练,制约了其生成高保真高分辨率图像或视频的能力。近期研究探索了无需调参的策略以挖掘预训练模型在高分辨率视觉生成中的潜力,但这些方法仍易产生带有重复模式的低质量内容。核心障碍在于当模型生成超出训练分辨率的视觉内容时,高频信息不可避免地增加,导致误差累积产生不良重复模式。本研究提出CineScale——一种实现更高分辨率视觉生成的新型推理范式。针对两类视频生成架构的不同问题,我们设计了专用变体方案。与现有局限于高分辨率文生图(T2I)和文生视频(T2V)的基线方法不同,CineScale基于顶尖开源视频生成框架,进一步实现了高分辨率图生视频(I2V)和视频生视频(V2V)的合成。大量实验验证了该范式在扩展图像与视频模型高分辨率生成能力方面的优越性。值得注意的是,本方法无需微调即可实现8K图像生成,并通过极少量LoRA微调达成4K视频生成。生成视频样本请访问:https://eyeline-labs.github.io/CineScale/。
Summary / 总结
Motivated by the limitations of visual diffusion models in generating high-resolution content due to training constraints and repetitive artifacts at higher resolutions, this work introduces CineScale, a novel inference paradigm. The method proposes architecture-specific variants to address high-frequency errors in both image and video generation models, enabling tuning-free upscaling. Experimental results demonstrate that CineScale achieves state-of-the-art performance in high-resolution text-to-image, text-to-video, image-to-video, and video-to-video generation, notably enabling 8k image generation without fine-tuning and 4k video generation with minimal LoRA adaptation.
本研究旨在解决使用预训练扩散模型生成高分辨率电影视觉内容的挑战,这些模型通常受限于训练分辨率和计算资源。作者提出了CineScale,一种新颖的推理范式,用于缓解生成超出训练分辨率内容时出现的重复模式和高频误差问题。该方法为不同视频生成架构设计了定制化变体,无需微调即可实现高分辨率的文本到视频、图像到视频和视频到视频合成。实验结果表明,CineScale无需微调即可实现8k图像生成,仅需最小LoRA微调即可实现4k视频生成,显著优于现有基线方法。
Visual Autoregressive Modeling for Instruction-Guided Image Editing
Authors: Qingyang Mao, Qi Cai, Yehao Li, Yingwei Pan, Mingyue Cheng, Ting Yao, Qi Liu, Tao Mei
First: 2025-08-21T17:59:32+00:00 · Latest: 2025-08-21T17:59:32+00:00
Comments: Source codes and models are available at
https://github.com/HiDream-ai/VAREdit
Abstract
Recent advances in diffusion models have brought remarkable visual fidelity
to instruction-guided image editing. However, their global denoising process
inherently entangles the edited region with the entire image context, leading
to unintended spurious modifications and compromised adherence to editing
instructions. In contrast, autoregressive models offer a distinct paradigm by
formulating image synthesis as a sequential process over discrete visual
tokens. Their causal and compositional mechanism naturally circumvents the
adherence challenges of diffusion-based methods. In this paper, we present
VAREdit, a visual autoregressive (VAR) framework that reframes image editing as
a next-scale prediction problem. Conditioned on source image features and text
instructions, VAREdit generates multi-scale target features to achieve precise
edits. A core challenge in this paradigm is how to effectively condition the
source image tokens. We observe that finest-scale source features cannot
effectively guide the prediction of coarser target features. To bridge this
gap, we introduce a Scale-Aligned Reference (SAR) module, which injects
scale-matched conditioning information into the first self-attention layer.
VAREdit demonstrates significant advancements in both editing adherence and
efficiency. On standard benchmarks, it outperforms leading diffusion-based
methods by 30\%+ higher GPT-Balance score. Moreover, it completes a
$512\times512$ editing in 1.2 seconds, making it 2.2$\times$ faster than the
similarly sized UltraEdit. The models are available at
https://github.com/HiDream-ai/VAREdit.
中文标题/摘要
标题:指令引导图像编辑的视觉自回归建模
扩散模型的最新进展为指令引导图像编辑带来了卓越的视觉保真度,但其全局去噪过程本质上将编辑区域与整个图像上下文纠缠,导致意外的伪修改和编辑指令遵循度的妥协。相比之下,自回归模型通过将图像合成构建为离散视觉标记的序列过程,提供了一种独特范式。其因果组合机制天然规避了基于扩散方法的遵循难题。本文提出VAREdit——一种将图像编辑重构为下一尺度预测问题的视觉自回归(VAR)框架。通过源图像特征和文本指令的条件化,VAREdit生成多尺度目标特征以实现精确编辑。该范式的核心挑战在于如何有效条件化源图像标记。我们发现最精细尺度的源特征无法有效指导较粗糙目标特征的预测。为弥合此差距,我们引入了尺度对齐参考(SAR)模块,将尺度匹配的条件信息注入首个自注意力层。VAREdit在编辑遵循度和效率方面均取得显著进展,在标准基准测试中,其GPT平衡分数比领先的扩散方法高出30%以上,且完成512×512编辑仅需1.2秒,比同等规模的UltraEdit快2.2倍。模型详见https://github.com/HiDream-ai/VAREdit。
Summary / 总结
Motivated by the limitations of diffusion models in instruction-guided image editing, which often produce unintended changes due to global denoising, this paper introduces VAREdit, a visual autoregressive framework that treats editing as a sequential, next-scale prediction task. The method conditions on source image features and text instructions, employing a Scale-Aligned Reference module to align multi-scale conditioning for improved guidance. Experimental results show that VAREdit achieves over 30% higher GPT-Balance score than diffusion-based approaches and completes 512x512 image edits in 1.2 seconds, offering both superior adherence to instructions and faster performance.
本研究针对扩散模型在指令引导图像编辑中存在的全局去噪导致意外修改和指令遵循不佳的问题,提出了VAREdit视觉自回归框架,将图像编辑转化为基于离散视觉标记的序列化多尺度预测任务。为解决跨尺度条件化挑战,作者设计了尺度对齐参考模块来实现源图像与目标特征的尺度匹配。实验结果表明,VAREdit在标准基准上比扩散方法获得30%以上的GPT-Balance分数提升,并以1.2秒完成512×512编辑,速度达到同类方法的2.2倍。
SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass
Authors: Yanxu Meng, Haoning Wu, Ya Zhang, Weidi Xie
First: 2025-08-21T17:59:16+00:00 · Latest: 2025-08-21T17:59:16+00:00
Comments: Technical Report; Project Page: https://mengmouxu.github.io/SceneGen
Abstract
3D content generation has recently attracted significant research interest
due to its applications in VR/AR and embodied AI. In this work, we address the
challenging task of synthesizing multiple 3D assets within a single scene
image. Concretely, our contributions are fourfold: (i) we present SceneGen, a
novel framework that takes a scene image and corresponding object masks as
input, simultaneously producing multiple 3D assets with geometry and texture.
Notably, SceneGen operates with no need for optimization or asset retrieval;
(ii) we introduce a novel feature aggregation module that integrates local and
global scene information from visual and geometric encoders within the feature
extraction module. Coupled with a position head, this enables the generation of
3D assets and their relative spatial positions in a single feedforward pass;
(iii) we demonstrate SceneGen's direct extensibility to multi-image input
scenarios. Despite being trained solely on single-image inputs, our
architectural design enables improved generation performance with multi-image
inputs; and (iv) extensive quantitative and qualitative evaluations confirm the
efficiency and robust generation abilities of our approach. We believe this
paradigm offers a novel solution for high-quality 3D content generation,
potentially advancing its practical applications in downstream tasks. The code
and model will be publicly available at: https://mengmouxu.github.io/SceneGen.
中文标题/摘要
标题:场景生成器:单次前馈传递实现单图像3D场景生成
3D内容生成因其在VR/AR和具身智能领域的应用近期引发广泛研究关注。本研究致力于解决从单张场景图像合成多个3D资产这一挑战性任务。具体贡献包括:(i)提出SceneGen新型框架,通过输入场景图像及对应物体掩码,同步生成带几何结构与纹理的多重3D资产,无需优化过程或资产检索;(ii)设计新型特征聚合模块,在特征提取阶段整合视觉与几何编码器的局部与全局场景信息,结合位置预测头实现单次前馈生成3D资产及其相对空间位置;(iii)证明框架可直接扩展至多图像输入场景,尽管仅使用单图像训练,架构设计仍能提升多图像输入的生成性能;(iv)通过大量定量与定性实验验证方法的高效性与强健生成能力。该范式为高质量3D内容生成提供了创新解决方案,有望推动下游任务的实际应用。代码与模型将公开于:https://mengmouxu.github.io/SceneGen
Summary / 总结
This research addresses the challenge of generating multiple 3D assets from a single scene image, motivated by the growing demand for 3D content in VR/AR and embodied AI applications. The authors propose SceneGen, a framework that processes an input image and object masks to simultaneously produce 3D geometries and textures in one feedforward pass without optimization or retrieval. Key innovations include a feature aggregation module that fuses local and global visual and geometric cues, along with a position head to infer spatial relationships. Experimental results demonstrate that SceneGen efficiently generates coherent 3D scenes, generalizes effectively to multi-image inputs despite single-image training, and outperforms existing methods in both quantitative metrics and qualitative visualizations.
本研究针对从单张场景图像生成多个3D资产的挑战,其动机源于VR/AR和具身AI应用中对3D内容日益增长的需求。作者提出SceneGen框架,通过处理输入图像和物体掩码,无需优化或检索即可同时生成3D几何和纹理。核心创新包括融合局部与全局视觉几何特征的特征聚合模块,以及通过位置头在单次前向传播中推断空间关系。实验结果表明该方法具有高效稳健的生成能力,尽管仅使用单图像训练,但在多图像输入时仍表现出更好的性能。
ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling
Authors: Jinhyung Park, Javier Romero, Shunsuke Saito, Fabian Prada, Takaaki Shiratori, Yichen Xu, Federica Bogo, Shoou-I Yu, Kris Kitani, Rawal Khirodkar
Venue: ICCV 2025
First: 2025-08-21T17:58:56+00:00 · Latest: 2025-08-21T17:58:56+00:00
Comments: ICCV 2025; Website: https://jindapark.github.io/projects/atlas/
Abstract
Parametric body models offer expressive 3D representation of humans across a
wide range of poses, shapes, and facial expressions, typically derived by
learning a basis over registered 3D meshes. However, existing human mesh
modeling approaches struggle to capture detailed variations across diverse body
poses and shapes, largely due to limited training data diversity and
restrictive modeling assumptions. Moreover, the common paradigm first optimizes
the external body surface using a linear basis, then regresses internal
skeletal joints from surface vertices. This approach introduces problematic
dependencies between internal skeleton and outer soft tissue, limiting direct
control over body height and bone lengths. To address these issues, we present
ATLAS, a high-fidelity body model learned from 600k high-resolution scans
captured using 240 synchronized cameras. Unlike previous methods, we explicitly
decouple the shape and skeleton bases by grounding our mesh representation in
the human skeleton. This decoupling enables enhanced shape expressivity,
fine-grained customization of body attributes, and keypoint fitting independent
of external soft-tissue characteristics. ATLAS outperforms existing methods by
fitting unseen subjects in diverse poses more accurately, and quantitative
evaluations show that our non-linear pose correctives more effectively capture
complex poses compared to linear models.
中文标题/摘要
标题:ATLAS:解耦骨骼与形态参数以实现富有表现力的参数化人体建模
参数化人体模型通过基于配准三维网格学习基向量,实现了跨姿态、体型和面部表情的丰富三维表达。然而,现有方法因训练数据多样性不足和建模假设限制,难以捕捉多样体态下的细节变化。传统范式先通过线性基优化体表,再从表面顶点回归内部骨骼关节点,导致骨骼与软组织间存在不良依赖,限制了直接控制身高和骨长的能力。为此,我们提出ATLAS——一个基于240台同步相机采集的60万次高分辨率扫描构建的高保真人体模型。该方法通过将网格表征锚定在人体骨骼上,显式解耦形态与骨骼基向量,从而增强形态表现力、实现细粒度身体属性定制,并支持独立于软组织特征的关键点拟合。定量评估表明,ATLAS能更精准地拟合未知对象的多样姿态,其非线性姿态校正比线性模型更能有效捕捉复杂姿态。
Summary / 总结
Existing parametric human models often fail to capture detailed shape and pose variations due to limited data diversity and the coupling of skeletal and surface parameters, which restricts control over attributes like height and bone length. To address this, ATLAS introduces a high-fidelity model trained on 600k high-resolution scans, explicitly decoupling shape and skeleton by grounding the mesh representation in the human skeleton. Experimental results demonstrate that ATLAS fits unseen subjects in diverse poses more accurately than prior methods, with its non-linear pose correctives outperforming linear models in capturing complex poses.
现有参数化人体模型由于数据多样性有限以及骨骼与表面参数的耦合,难以捕捉细节形状和姿态变化,且限制了身高、骨长等属性的控制。为此,ATLAS基于60万高分辨率扫描数据提出高保真模型,通过将网格表示锚定在人体骨骼上,显式解耦形状与骨骼参数。实验结果表明,ATLAS对未知对象在不同姿态下的拟合精度优于现有方法,其非线性姿态校正比线性模型更有效捕捉复杂姿态。
Discovering Hidden Algebraic Structures via Transformers with Rank-Aware Beam GRPO
Authors: Jaeha Lee, Gio Huh, Ning Su, Tony Yue YU
First: 2025-08-21T17:58:50+00:00 · Latest: 2025-08-21T17:58:50+00:00
Abstract
Recent efforts have extended the capabilities of transformers in logical
reasoning and symbolic computations. In this work, we investigate their
capacity for non-linear latent pattern discovery in the context of functional
decomposition, focusing on the challenging algebraic task of multivariate
polynomial decomposition. This problem, with widespread applications in science
and engineering, is proved to be NP-hard, and demands both precision and
insight. Our contributions are threefold: First, we develop a synthetic data
generation pipeline providing fine-grained control over problem complexity.
Second, we train transformer models via supervised learning and evaluate them
across four key dimensions involving scaling behavior and generalizability.
Third, we propose Beam Grouped Relative Policy Optimization (BGRPO), a
rank-aware reinforcement learning method suitable for hard algebraic problems.
Finetuning with BGRPO improves accuracy while reducing beam width by up to
half, resulting in approximately 75% lower inference compute. Additionally, our
model demonstrates competitive performance in polynomial simplification,
outperforming Mathematica in various cases.
中文标题/摘要
标题:通过具有秩感知束GRPO的Transformer发现隐藏代数结构
近期研究扩展了Transformer在逻辑推理和符号计算方面的能力。本文探讨了其在函数分解背景下进行非线性潜在模式发现的能力,重点关注多元多项式分解这一具有挑战性的代数任务。该问题在科学与工程领域应用广泛,已被证明是NP难问题,需要精确性与洞察力。我们的贡献有三:首先开发了能精细控制问题复杂度的合成数据生成流程;其次通过监督学习训练Transformer模型,并在涉及扩展行为和泛化能力的四个关键维度进行评估;第三提出了束分组相对策略优化(BGRPO),这是一种适用于困难代数问题的秩感知强化学习方法。使用BGRPO进行微调可在将束宽减少多达一半的同时提升准确率,推理计算量降低约75%。此外,我们的模型在多项式简化任务中展现出竞争优势,在多类案例中表现优于Mathematica。
Summary / 总结
This research aims to enhance transformers' capability for discovering latent algebraic structures, specifically addressing the NP-hard problem of multivariate polynomial decomposition, which has broad applications in science and engineering. The method involves generating synthetic data with controlled complexity, training transformers via supervised learning, and introducing Beam Grouped Relative Policy Optimization (BGRPO), a rank-aware reinforcement learning technique tailored for hard algebraic tasks. Experimental results show that finetuning with BGRPO improves accuracy while reducing beam width by up to half, cutting inference compute by approximately 75%, and the model outperforms Mathematica in polynomial simplification tasks.
本研究旨在提升Transformer模型在发现隐藏代数结构方面的能力,特别是针对科学和工程中广泛应用的NP难问题——多元多项式分解。作者开发了一个合成数据生成流程以控制问题复杂度,通过监督学习训练Transformer模型,并提出了适用于困难代数问题的排名感知强化学习方法——Beam分组相对策略优化(BGRPO)。实验结果表明,使用BGRPO进行微调在提高精度的同时将束宽度减少多达一半,推理计算量降低约75%,且在多项式简化任务中模型表现优于Mathematica。
Distributed Detection of Adversarial Attacks in Multi-Agent Reinforcement Learning with Continuous Action Space
Authors: Kiarash Kazari, Ezzeldin Shereen, György Dán
First: 2025-08-21T17:58:36+00:00 · Latest: 2025-08-21T17:58:36+00:00
Comments: Accepted for publication at ECAI 2025
Abstract
We address the problem of detecting adversarial attacks against cooperative
multi-agent reinforcement learning with continuous action space. We propose a
decentralized detector that relies solely on the local observations of the
agents and makes use of a statistical characterization of the normal behavior
of observable agents. The proposed detector utilizes deep neural networks to
approximate the normal behavior of agents as parametric multivariate Gaussian
distributions. Based on the predicted density functions, we define a normality
score and provide a characterization of its mean and variance. This
characterization allows us to employ a two-sided CUSUM procedure for detecting
deviations of the normality score from its mean, serving as a detector of
anomalous behavior in real-time. We evaluate our scheme on various multi-agent
PettingZoo benchmarks against different state-of-the-art attack methods, and
our results demonstrate the effectiveness of our method in detecting impactful
adversarial attacks. Particularly, it outperforms the discrete counterpart by
achieving AUC-ROC scores of over 0.95 against the most impactful attacks in all
evaluated environments.
中文标题/摘要
标题:连续动作空间多智能体强化学习中对抗攻击的分布式检测
本文研究针对连续动作空间协作式多智能体强化学习的对抗攻击检测问题。提出一种仅依赖智能体局部观测的分布式检测器,利用可观测智能体正常行为的统计特征。该检测器通过深度神经网络将智能体正常行为近似为参数化多元高斯分布。基于预测密度函数定义正态性评分并解析其均值与方差特征,进而采用双端CUSUM算法实时检测正态性评分偏离均值的情况,作为异常行为检测器。在多种多智能体PettingZoo基准测试中针对不同前沿攻击方法进行评估,结果表明本方法能有效检测高影响力对抗攻击,尤其在所有测试环境中对最具影响力攻击的AUC-ROC分数超过0.95,性能显著优于离散对应方案。
Summary / 总结
This research is motivated by the need to detect adversarial attacks in cooperative multi-agent reinforcement learning systems with continuous action spaces. The method proposes a decentralized detector that uses local observations and models normal agent behavior via deep neural networks approximating parametric multivariate Gaussian distributions. A normality score is defined from these distributions, and a two-sided CUSUM procedure detects deviations in real-time. Experimental evaluation on PettingZoo benchmarks against state-of-the-art attacks shows the method effectively detects impactful adversarial attacks, achieving AUC-ROC scores over 0.95 in all environments and outperforming discrete counterparts.
本研究旨在检测连续动作空间下协作多智能体强化学习系统中的对抗攻击。方法提出了一种分散式检测器,利用局部观测数据,通过深度神经网络将智能体正常行为建模为参数化多元高斯分布。基于预测密度函数定义正态性评分,并采用双侧CUSUM程序实时检测偏离。在PettingZoo基准测试中针对先进攻击方法的实验表明,该方法能有效检测高影响力对抗攻击,在所有测试环境中AUC-ROC分数均超过0.95,性能优于离散对应方法。
Intern-S1: A Scientific Multimodal Foundation Model
Authors: Lei Bai, Zhongrui Cai, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, Yongkang Chen, Yu Cheng, Yu Cheng, Pei Chu, Tao Chu, Erfei Cui, Ganqu Cui, Long Cui, Ziyun Cui, Nianchen Deng, Ning Ding, Nanqin Dong, Peijie Dong, Shihan Dou, Sinan Du, Haodong Duan, Caihua Fan, Ben Gao, Changjiang Gao, Jianfei Gao, Songyang Gao, Yang Gao, Zhangwei Gao, Jiaye Ge, Qiming Ge, Lixin Gu, Yuzhe Gu, Aijia Guo, Qipeng Guo, Xu Guo, Conghui He, Junjun He, Yili Hong, Siyuan Hou, Caiyu Hu, Hanglei Hu, Jucheng Hu, Ming Hu, Zhouqi Hua, Haian Huang, Junhao Huang, Xu Huang, Zixian Huang, Zhe Jiang, Lingkai Kong, Linyang Li, Peiji Li, Pengze Li, Shuaibin Li, Tianbin Li, Wei Li, Yuqiang Li, Dahua Lin, Junyao Lin, Tianyi Lin, Zhishan Lin, Hongwei Liu, Jiangning Liu, Jiyao Liu, Junnan Liu, Kai Liu, Kaiwen Liu, Kuikun Liu, Shichun Liu, Shudong Liu, Wei Liu, Xinyao Liu, Yuhong Liu, Zhan Liu, Yinquan Lu, Haijun Lv, Hongxia Lv, Huijie Lv, Qidang Lv, Ying Lv, Chengqi Lyu, Chenglong Ma, Jianpeng Ma, Ren Ma, Runmin Ma, Runyuan Ma, Xinzhu Ma, Yichuan Ma, Zihan Ma, Sixuan Mi, Junzhi Ning, Wenchang Ning, Xinle Pang, Jiahui Peng, Runyu Peng, Yu Qiao, Jiantao Qiu, Xiaoye Qu, Yuan Qu, Yuchen Ren, Fukai Shang, Wenqi Shao, Junhao Shen, Shuaike Shen, Chunfeng Song, Demin Song, Diping Song, Chenlin Su, Weijie Su, Weigao Sun, Yu Sun, Qian Tan, Cheng Tang, Huanze Tang, Kexian Tang, Shixiang Tang, Jian Tong, Aoran Wang, Bin Wang, Dong Wang, Lintao Wang, Rui Wang, Weiyun Wang, Wenhai Wang, Yi Wang, Ziyi Wang, Ling-I Wu, Wen Wu, Yue Wu, Zijian Wu, Linchen Xiao, Shuhao Xing, Chao Xu, Huihui Xu, Jun Xu, Ruiliang Xu, Wanghan Xu, GanLin Yang, Yuming Yang, Haochen Ye, Jin Ye, Shenglong Ye, Jia Yu, Jiashuo Yu, Jing Yu, Fei Yuan, Bo Zhang, Chao Zhang, Chen Zhang, Hongjie Zhang, Jin Zhang, Qiaosheng Zhang, Qiuyinzhe Zhang, Songyang Zhang, Taolin Zhang, Wenlong Zhang, Wenwei Zhang, Yechen Zhang, Ziyang Zhang, Haiteng Zhao, Qian Zhao, Xiangyu Zhao, Xiangyu Zhao, Bowen Zhou, Dongzhan Zhou, Peiheng Zhou, Yuhao Zhou, Yunhua Zhou, Dongsheng Zhu, Lin Zhu, Yicheng Zou
First: 2025-08-21T17:58:00+00:00 · Latest: 2025-08-21T17:58:00+00:00
Abstract
In recent years, a plethora of open-source foundation models have emerged,
achieving remarkable progress in some widely attended fields, with performance
being quite close to that of closed-source models. However, in high-value but
more challenging scientific professional fields, either the fields still rely
on expert models, or the progress of general foundation models lags
significantly compared to those in popular areas, far from sufficient for
transforming scientific research and leaving substantial gap between
open-source models and closed-source models in these scientific domains. To
mitigate this gap and explore a step further toward Artificial General
Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped
with general understanding and reasoning capabilities with expertise to analyze
multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE)
model with 28 billion activated parameters and 241 billion total parameters,
continually pre-trained on 5T tokens, including over 2.5T tokens from
scientific domains. In the post-training stage, Intern-S1 undergoes offline and
then online reinforcement learning (RL) in InternBootCamp, where we propose
Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks
simultaneously. Through integrated innovations in algorithms, data, and
training systems, Intern-S1 achieved top-tier performance in online RL
training.On comprehensive evaluation benchmarks, Intern-S1 demonstrates
competitive performance on general reasoning tasks among open-source models and
significantly outperforms open-source models in scientific domains, surpassing
closed-source state-of-the-art models in professional tasks, such as molecular
synthesis planning, reaction condition prediction, predicting thermodynamic
stabilities for crystals. Our models are available at
https://huggingface.co/internlm/Intern-S1.
中文标题/摘要
标题:Intern-S1:科学多模态基础模型
近年来,大量开源基础模型涌现,在部分广受关注的领域取得显著进展,性能已十分接近闭源模型。然而,在高价值但更具挑战性的科学专业领域,这些领域仍依赖专家模型,或通用基础模型的进展远落后于热门领域,远不足以变革科学研究,且开源模型与闭源模型在这些科学领域存在巨大差距。为缩小这一差距并探索迈向通用人工智能(AGI)的下一步,我们推出Intern-S1——一个具备通用理解与推理能力,并能分析多科学模态数据的专业通才模型。Intern-S1是多模态混合专家(MoE)模型,拥有280亿激活参数和2410亿总参数,基于5T token(其中包含超过2.5T科学领域token)进行持续预训练。在后训练阶段,该模型通过InternBootCamp先后进行离线和在线强化学习(RL),我们提出混合奖励机制(MoR)以协同千余项任务的RL训练。通过算法、数据和训练系统的集成创新,Intern-S1在在线RL训练中达到顶尖性能。在综合评估基准上,该模型在开源模型中展现通用推理任务的竞争力,并在科学领域显著优于开源模型,在分子合成规划、反应条件预测、晶体热力学稳定性预测等专业任务中超越闭源最先进模型。模型详见:https://huggingface.co/internlm/Intern-S1。
Summary / 总结
Motivated by the significant performance gap between open-source and closed-source foundation models in scientific domains, this work introduces Intern-S1, a multimodal Mixture-of-Experts model, to advance capabilities in professional scientific tasks and contribute to AGI development. The method involves continual pre-training on a large-scale scientific corpus (over 2.5T tokens) and a novel post-training stage using offline and online reinforcement learning with a proposed Mixture-of-Rewards approach to synergize training across more than 1000 diverse tasks. Experimental results show that Intern-S1 achieves top-tier performance in online RL training, demonstrates competitive reasoning on general tasks among open-source models, and significantly outperforms existing open-source models in scientific domains—even surpassing closed-source state-of-the-art models in specialized tasks like molecular synthesis planning, reaction condition prediction, and crystal stability prediction.
该研究旨在解决开源与闭源基础模型在科学领域存在的显著性能差距,以推动科学研究转型和通用人工智能发展。方法上提出了Intern-S1,一个多模态混合专家模型,拥有280亿激活参数,基于包含科学数据的5万亿token进行持续预训练,并通过离线与在线强化学习及混合奖励机制在1000多个任务上协同优化。实验结果表明,Intern-S1在通用推理任务上达到开源模型竞争水平,在科学领域显著优于开源模型,并在分子合成规划、晶体稳定性预测等专业任务上超越了闭源最先进模型。
Waver: Wave Your Way to Lifelike Video Generation
Authors: Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Zehuan Yuan, Bingyue Peng
First: 2025-08-21T17:56:10+00:00 · Latest: 2025-08-21T17:56:10+00:00
Abstract
We present Waver, a high-performance foundation model for unified image and
video generation. Waver can directly generate videos with durations ranging
from 5 to 10 seconds at a native resolution of 720p, which are subsequently
upscaled to 1080p. The model simultaneously supports text-to-video (T2V),
image-to-video (I2V), and text-to-image (T2I) generation within a single,
integrated framework. We introduce a Hybrid Stream DiT architecture to enhance
modality alignment and accelerate training convergence. To ensure training data
quality, we establish a comprehensive data curation pipeline and manually
annotate and train an MLLM-based video quality model to filter for the
highest-quality samples. Furthermore, we provide detailed training and
inference recipes to facilitate the generation of high-quality videos. Building
on these contributions, Waver excels at capturing complex motion, achieving
superior motion amplitude and temporal consistency in video synthesis. Notably,
it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial
Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming
existing open-source models and matching or surpassing state-of-the-art
commercial solutions. We hope this technical report will help the community
more efficiently train high-quality video generation models and accelerate
progress in video generation technologies. Official page:
https://github.com/FoundationVision/Waver.
中文标题/摘要
标题:Waver:以波动方式实现逼真视频生成
我们推出Waver,一个用于统一图像与视频生成的高性能基础模型。该模型可直接生成本地分辨率为720p、时长5至10秒的视频,并支持上采样至1080p。通过单一集成框架同步支持文本生成视频(T2V)、图像生成视频(I2V)及文本生成图像(T2I)功能。我们采用混合流DiT架构以增强模态对齐并加速训练收敛。为确保训练数据质量,建立了完整的数据筛选流程,并人工标注训练基于MLLM的视频质量评估模型来筛选最优样本。此外,提供详细的训练与推理方案以促进高质量视频生成。基于这些创新,Waver在捕捉复杂运动、实现卓越运动幅度与时间一致性方面表现突出,在Artificial Analysis平台的T2V和I2V排行榜均位列前三(数据截至2025年7月30日北京时间10时),持续超越现有开源模型并媲美或领先业界商业方案。本技术报告旨在助力社区更高效训练高质量视频生成模型,加速视频技术发展。官方页面:https://github.com/FoundationVision/Waver。
Summary / 总结
This research aims to develop a unified foundation model for high-quality image and video generation, addressing the need for a single framework capable of text-to-video, image-to-video, and text-to-image tasks. The method introduces a Hybrid Stream DiT architecture to improve modality alignment and training convergence, alongside a comprehensive data curation pipeline that uses a manually annotated MLLM-based video quality model to filter training samples. Experimental results demonstrate that Waver generates 5-10 second videos at 720p (upscaled to 1080p) with complex motion, superior motion amplitude, and temporal consistency, ranking among the top 3 on T2V and I2V leaderboards and outperforming open-source models while matching or surpassing commercial solutions.
本研究旨在开发一个统一的图像与视频生成基础模型,以解决单一框架生成长时长、高分辨率且具有复杂运动视频的需求。方法上引入了混合流DiT架构以增强模态对齐并加速训练收敛,同时建立了全面的数据筛选流程,利用基于MLLM的视频质量模型过滤训练样本。实验结果表明,Waver能够生成5至10秒、原生720p(可升级至1080p)的视频,具有卓越的运动幅度和时间一致性,在T2V和I2V排行榜中均位列前三,性能匹配或超越了当前最先进的商业解决方案。
LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
Authors: Ming Yin, Dinghan Shen, Silei Xu, Jianbing Han, Sixun Dong, Mian Zhang, Yebowen Hu, Shujian Liu, Simin Ma, Song Wang, Sathish Reddy Indurthi, Xun Wang, Yiran Chen, Kaiqiang Song
First: 2025-08-21T17:55:54+00:00 · Latest: 2025-08-21T17:55:54+00:00
Abstract
Tool calling has emerged as a critical capability for AI agents to interact
with the real world and solve complex tasks. While the Model Context Protocol
(MCP) provides a powerful standardized framework for tool integration, there is
a significant gap in benchmarking how well AI agents can effectively solve
multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In
this work, we present LiveMCP-101, a benchmark of 101 carefully curated
real-world queries, refined through iterative LLM rewriting and manual review,
that require coordinated use of multiple MCP tools including web search, file
operations, mathematical reasoning, and data analysis. Moreover, we introduce a
novel evaluation approach that leverages ground-truth execution plans rather
than raw API outputs, better reflecting the evolving nature of real-world
environments. Experiments show that even frontier LLMs achieve a success rate
below 60\%, highlighting major challenges in tool orchestration. Detailed
ablations and error analysis further reveal distinct failure modes and
inefficiencies in token usage, pointing to concrete directions for advancing
current models. LiveMCP-101 sets a rigorous standard for evaluating real-world
agent capabilities, advancing toward autonomous AI systems that reliably
execute complex tasks through tool use.
中文标题/摘要
标题:LiveMCP-101:对支持MCP的智能体在挑战性查询下的压力测试与诊断
工具调用已成为AI智能体与现实世界交互并解决复杂任务的关键能力。虽然模型上下文协议(MCP)为工具集成提供了强大的标准化框架,但在基准测试AI智能体如何在真实动态场景中有效利用多样化MCP工具解决多步骤任务方面存在显著空白。本研究推出LiveMCP-101基准测试,包含101个精心筛选的真实查询(经过迭代式LLM重写和人工审核),需要协调使用包括网络搜索、文件操作、数学推理和数据分析在内的多种MCP工具。此外,我们引入了一种新颖的评估方法,该方法基于真实执行计划而非原始API输出,更好地反映了现实环境的动态特性。实验表明,即使前沿LLMs的成功率也低于60%,突显了工具协调方面的重大挑战。详细的消融实验和错误分析进一步揭示了不同的故障模式和令牌使用效率低下问题,为改进现有模型指明了具体方向。LiveMCP-101为评估真实场景下的智能体能力设立了严格标准,推动通过工具使用可靠执行复杂任务的自主AI系统发展。
Summary / 总结
The research addresses the lack of realistic benchmarks for evaluating AI agents' ability to solve multi-step tasks using diverse tools via the Model Context Protocol (MCP). The authors introduce LiveMCP-101, a benchmark comprising 101 real-world queries requiring coordinated use of multiple MCP tools, and propose a novel evaluation method based on ground-truth execution plans rather than raw API outputs. Experimental results show that even state-of-the-art LLMs achieve below 60% success rate, with detailed error analysis revealing persistent challenges in tool orchestration and token inefficiency.
该研究针对当前缺乏评估AI代理通过模型上下文协议(MCP)使用多样化工具解决多步骤任务能力的现实基准的问题。作者提出了LiveMCP-101基准,包含101个需要协调使用网络搜索、文件操作和数据分析等多种MCP工具的真实查询,并采用基于真实执行计划的新评估方法以更好反映动态环境。实验结果表明,即使最先进的LLM成功率也低于60%,详细错误分析揭示了工具协调和token使用效率方面的重大挑战,为模型改进提供了具体方向。
Language-Guided Tuning: Enhancing Numeric Optimization with Textual Feedback
Authors: Yuxing Lu, Yucheng Hu, Nan Sun, Xukai Zhao
First: 2025-08-21T17:55:07+00:00 · Latest: 2025-08-21T17:55:07+00:00
Comments: 9 pages, 4 figures, 4 tables
Abstract
Configuration optimization remains a critical bottleneck in machine learning,
requiring coordinated tuning across model architecture, training strategy,
feature engineering, and hyperparameters. Traditional approaches treat these
dimensions independently and lack interpretability, while recent automated
methods struggle with dynamic adaptability and semantic reasoning about
optimization decisions. We introduce Language-Guided Tuning (LGT), a novel
framework that employs multi-agent Large Language Models to intelligently
optimize configurations through natural language reasoning. We apply textual
gradients - qualitative feedback signals that complement numerical optimization
by providing semantic understanding of training dynamics and configuration
interdependencies. LGT coordinates three specialized agents: an Advisor that
proposes configuration changes, an Evaluator that assesses progress, and an
Optimizer that refines the decision-making process, creating a self-improving
feedback loop. Through comprehensive evaluation on six diverse datasets, LGT
demonstrates substantial improvements over traditional optimization methods,
achieving performance gains while maintaining high interpretability.
中文标题/摘要
标题:语言引导调优:通过文本反馈增强数值优化
配置优化仍是机器学习中的关键瓶颈,需在模型架构、训练策略、特征工程和超参数等方面进行协调调优。传统方法独立处理这些维度且缺乏可解释性,而近期自动化方法难以实现动态适应性及对优化决策的语义推理。我们提出语言引导调优(LGT)这一新颖框架,利用多智能体大语言模型通过自然语言推理智能优化配置。该方法采用文本梯度——通过提供对训练动态和配置相互依赖关系的语义理解来补充数值优化的定性反馈信号。LGT协调三个专业智能体:提出配置变更建议的顾问、评估进展的评估器,以及优化决策过程的优化器,形成自我改进的反馈循环。在六个多样化数据集上的综合评估表明,LGT相较传统优化方法实现显著提升,在保持高可解释性的同时获得性能增益。
Summary / 总结
The research addresses the limitations of traditional and automated configuration optimization methods in machine learning, which often lack interpretability, dynamic adaptability, and semantic reasoning. The authors propose Language-Guided Tuning (LGT), a framework that uses multi-agent Large Language Models to optimize configurations through natural language reasoning, incorporating textual gradients as qualitative feedback to understand training dynamics and interdependencies. LGT employs three specialized agents—Advisor, Evaluator, and Optimizer—to form a self-improving feedback loop. Experimental results on six diverse datasets show that LGT significantly outperforms traditional optimization methods, achieving higher performance while maintaining strong interpretability.
该研究针对机器学习中传统和自动化配置优化方法在可解释性、动态适应性和语义推理方面的不足,提出了语言引导调优(LGT)框架,利用多智能体大语言模型通过自然语言推理优化配置,并引入文本梯度作为定性反馈以理解训练动态和配置相互依赖关系。LGT协调三个专门智能体——建议者、评估者和优化者——形成自我改进的反馈循环。在六个不同数据集上的实验结果表明,LGT显著优于传统优化方法,在保持高可解释性的同时实现了性能提升。
Neural Robot Dynamics
Authors: Jie Xu, Eric Heiden, Iretiayo Akinola, Dieter Fox, Miles Macklin, Yashraj Narang
First: 2025-08-21T17:54:41+00:00 · Latest: 2025-08-21T17:54:41+00:00
Abstract
Accurate and efficient simulation of modern robots remains challenging due to
their high degrees of freedom and intricate mechanisms. Neural simulators have
emerged as a promising alternative to traditional analytical simulators,
capable of efficiently predicting complex dynamics and adapting to real-world
data; however, existing neural simulators typically require
application-specific training and fail to generalize to novel tasks and/or
environments, primarily due to inadequate representations of the global state.
In this work, we address the problem of learning generalizable neural
simulators for robots that are structured as articulated rigid bodies. We
propose NeRD (Neural Robot Dynamics), learned robot-specific dynamics models
for predicting future states for articulated rigid bodies under contact
constraints. NeRD uniquely replaces the low-level dynamics and contact solvers
in an analytical simulator and employs a robot-centric and spatially-invariant
simulation state representation. We integrate the learned NeRD models as an
interchangeable backend solver within a state-of-the-art robotics simulator. We
conduct extensive experiments to show that the NeRD simulators are stable and
accurate over a thousand simulation steps; generalize across tasks and
environment configurations; enable policy learning exclusively in a neural
engine; and, unlike most classical simulators, can be fine-tuned from
real-world data to bridge the gap between simulation and reality.
中文标题/摘要
标题:神经机器人动力学
现代机器人因其高自由度和复杂机械结构,实现精确高效的仿真仍具挑战。神经仿真器作为传统解析仿真器的有前景替代方案,能高效预测复杂动力学并适配真实数据;然而现有神经仿真器通常需针对特定应用训练,且因全局状态表征不足难以泛化至新任务或环境。本研究针对铰接式刚体机器人,提出学习可泛化神经仿真器的方法。我们推出NeRD(神经机器人动力学)——通过学习的机器人专属动力学模型,预测接触约束下铰接式刚体的未来状态。NeRD创新性地取代解析仿真器中的底层动力学与接触求解器,采用以机器人为中心且空间不变的仿真状态表征。我们将学习到的NeRD模型作为可互换后端求解器集成至先进机器人仿真器中。大量实验表明:NeRD仿真器在千步仿真中保持稳定精确;跨任务和环境配置泛化能力强;支持纯神经引擎策略学习;且与多数经典仿真器不同,可通过真实数据微调以弥合仿真与现实差距。
Summary / 总结
Accurate and efficient simulation of high-DoF robots remains challenging due to complex dynamics and contact constraints. This work introduces NeRD, a neural dynamics model that replaces traditional analytical solvers with a robot-centric, spatially-invariant state representation to enable generalizable prediction of articulated rigid body motion. Experiments demonstrate that NeRD simulators are stable and accurate over long horizons, generalize across tasks and environments, support policy learning entirely in a neural engine, and can be fine-tuned with real-world data to reduce the sim-to-real gap.
高自由度机器人的精确高效仿真因复杂动力学和接触约束而具有挑战性。本研究提出NeRD(神经机器人动力学),一种学习型机器人专用动力学模型,采用以机器人为中心且空间不变的仿真状态表示,替代传统解析仿真器中的底层求解器。实验结果表明,NeRD仿真器在长期仿真中保持稳定性和准确性,能够跨任务和环境泛化,支持完全在神经引擎中进行策略学习,并可通过真实数据微调以缩小仿真与现实的差距。
Dissecting Tool-Integrated Reasoning: An Empirical Study and Analysis
Authors: Yufeng Zhao, Junnan Liu, Hongwei Liu, Dongsheng Zhu, Yuan Shen, Songyang Zhang, Kai Chen
First: 2025-08-21T17:50:24+00:00 · Latest: 2025-08-21T17:50:24+00:00
Comments: Preprint, working in progress
Abstract
Large Language Models (LLMs) have made significant strides in reasoning tasks
through methods like chain-of-thought (CoT) reasoning. However, they often fall
short in tasks requiring precise computations. Tool-Integrated Reasoning (TIR)
has emerged as a solution by incorporating external tools into the reasoning
process. Nevertheless, the generalization of TIR in improving the reasoning
ability of LLM is still unclear. Additionally, whether TIR has improved the
model's reasoning behavior and helped the model think remains to be studied. We
introduce ReasonZoo, a comprehensive benchmark encompassing nine diverse
reasoning categories, to evaluate the effectiveness of TIR across various
domains. Additionally, we propose two novel metrics, Performance-Aware Cost
(PAC) and Area Under the Performance-Cost Curve (AUC-PCC), to assess reasoning
efficiency. Our empirical evaluation demonstrates that TIR-enabled models
consistently outperform their non-TIR counterparts in both mathematical and
non-mathematical tasks. Furthermore, TIR enhances reasoning efficiency, as
evidenced by improved PAC and AUC-PCC, indicating reduced overthinking and more
streamlined reasoning. These findings underscore the domain-general benefits of
TIR and its potential to advance LLM capabilities in complex reasoning tasks.
中文标题/摘要
标题:剖析工具集成推理:一项实证研究与分析
大型语言模型(LLMs)通过思维链(CoT)推理等方法在推理任务上取得显著进展,但在需要精确计算的任务中常显不足。工具集成推理(TIR)通过将外部工具融入推理过程应运而生。然而,TIR在提升LLM推理能力方面的泛化性尚不明确,其是否改善了模型的推理行为及辅助模型思考仍有待研究。我们推出ReasonZoo——一个涵盖九大推理类别的综合基准,以评估TIR在不同领域的有效性。同时提出性能感知成本(PAC)和性能-成本曲线下面积(AUC-PCC)两项新指标来评估推理效率。实证评估表明,启用TIR的模型在数学和非数学任务中均持续优于非TIR模型。此外,TIR通过提升PAC和AUC-PCC值证明了其提高推理效率的能力,表现为减少过度思考并优化推理流程。这些发现凸显了TIR的领域通用优势及其在推进LLM复杂推理任务能力方面的潜力。
Summary / 总结
This study investigates the generalization and behavioral impact of Tool-Integrated Reasoning (TIR) in large language models, motivated by the limitations of methods like chain-of-thought in tasks requiring precise computation. The authors introduce ReasonZoo, a benchmark covering nine reasoning categories, and propose two efficiency metrics—Performance-Aware Cost and Area Under the Performance-Cost Curve—to evaluate TIR. Experimental results show that TIR-enabled models outperform non-TIR models across mathematical and non-mathematical tasks, with improved efficiency metrics indicating reduced overthinking and more streamlined reasoning processes.
本研究探讨了工具集成推理(TIR)在大语言模型中的泛化能力和行为影响,动机在于思维链推理在精确计算任务中的不足。作者提出了涵盖九个推理类别的基准测试ReasonZoo,并设计了性能感知成本和性能-成本曲线下面积两个效率指标来评估TIR。实验结果表明,启用TIR的模型在数学和非数学任务中均优于非TIR模型,效率提升表现为减少过度思考并简化推理过程。
"Does the cafe entrance look accessible? Where is the door?" Towards Geospatial AI Agents for Visual Inquiries
Authors: Jon E. Froehlich, Jared Hwang, Zeyu Wang, John S. O'Meara, Xia Su, William Huang, Yang Zhang, Alex Fiannaca, Philip Nelson, Shaun Kane
Venue: ICCV
First: 2025-08-21T17:49:52+00:00 · Latest: 2025-08-21T17:49:52+00:00
Comments: Accepted to the ICCV'25 Workshop "Vision Foundation Models and
Generative AI for Accessibility: Challenges and Opportunities"
Abstract
Interactive digital maps have revolutionized how people travel and learn
about the world; however, they rely on pre-existing structured data in GIS
databases (e.g., road networks, POI indices), limiting their ability to address
geo-visual questions related to what the world looks like. We introduce our
vision for Geo-Visual Agents--multimodal AI agents capable of understanding and
responding to nuanced visual-spatial inquiries about the world by analyzing
large-scale repositories of geospatial images, including streetscapes (e.g.,
Google Street View), place-based photos (e.g., TripAdvisor, Yelp), and aerial
imagery (e.g., satellite photos) combined with traditional GIS data sources. We
define our vision, describe sensing and interaction approaches, provide three
exemplars, and enumerate key challenges and opportunities for future work.
中文标题/摘要
标题:“咖啡馆入口看起来是否便利?门在哪里?”——探索面向视觉查询的地理空间AI智能体
交互式数字地图彻底改变了人们的出行与认知世界的方式,但其依赖GIS数据库中预存的结构化数据(如道路网络、POI索引),限制了处理涉及世界外观的地理视觉问题的能力。我们提出了地理视觉智能体的构想——这是一种多模态AI智能体,能够通过分析大规模地理空间图像库(包括街景图像如Google Street View、基于地点的照片如TripAdvisor和Yelp、航空影像如卫星照片)并结合传统GIS数据源,来理解并回应关于世界的精细视觉空间查询。我们阐述了这一构想,描述了感知与交互方法,提供了三个范例,并列举了未来工作的关键挑战与机遇。
Summary / 总结
Interactive digital maps are constrained by reliance on pre-existing structured GIS data, limiting their ability to answer nuanced visual-spatial questions about real-world appearances. To address this, the authors propose Geo-Visual Agents—multimodal AI systems that analyze diverse geospatial imagery (such as street views, place photos, and aerial imagery) alongside traditional GIS data to interpret and respond to visual inquiries. The paper outlines the vision, sensing and interaction methods, presents three exemplars, and discusses key challenges and future opportunities for such agents.
交互式数字地图受限于对现有结构化GIS数据的依赖,难以回答有关现实世界场景的细致视觉空间问题。为此,作者提出地理视觉智能体——一种多模态人工智能系统,通过分析多样化的地理空间图像(如街景、地点照片和航拍影像)并结合传统GIS数据,以理解和响应视觉查询。论文阐述了其愿景、感知与交互方法,并提供了三个示例,同时指出了未来研究的关键挑战与机遇。
Fine-grained Multi-class Nuclei Segmentation with Molecular-empowered All-in-SAM Model
Authors: Xueyuan Li, Can Cui, Ruining Deng, Yucheng Tang, Quan Liu, Tianyuan Yao, Shunxing Bao, Naweed Chowdhury, Haichun Yang, Yuankai Huo
First: 2025-08-21T17:49:21+00:00 · Latest: 2025-08-21T17:49:21+00:00
Comments: 25 pages, 3 figures, accepted by Journal of Medical Imaging
Abstract
Purpose: Recent developments in computational pathology have been driven by
advances in Vision Foundation Models, particularly the Segment Anything Model
(SAM). This model facilitates nuclei segmentation through two primary methods:
prompt-based zero-shot segmentation and the use of cell-specific SAM models for
direct segmentation. These approaches enable effective segmentation across a
range of nuclei and cells. However, general vision foundation models often face
challenges with fine-grained semantic segmentation, such as identifying
specific nuclei subtypes or particular cells. Approach: In this paper, we
propose the molecular-empowered All-in-SAM Model to advance computational
pathology by leveraging the capabilities of vision foundation models. This
model incorporates a full-stack approach, focusing on: (1) annotation-engaging
lay annotators through molecular-empowered learning to reduce the need for
detailed pixel-level annotations, (2) learning-adapting the SAM model to
emphasize specific semantics, which utilizes its strong generalizability with
SAM adapter, and (3) refinement-enhancing segmentation accuracy by integrating
Molecular-Oriented Corrective Learning (MOCL). Results: Experimental results
from both in-house and public datasets show that the All-in-SAM model
significantly improves cell classification performance, even when faced with
varying annotation quality. Conclusions: Our approach not only reduces the
workload for annotators but also extends the accessibility of precise
biomedical image analysis to resource-limited settings, thereby advancing
medical diagnostics and automating pathology image analysis.
中文标题/摘要
标题:分子赋能全栈SAM模型实现细粒度多类细胞核分割
目的:计算病理学的最新进展得益于视觉基础模型(尤其是Segment Anything Model/SAM)的突破。该模型通过两种主要方式实现细胞核分割:基于提示的零样本分割和采用细胞特异性SAM模型进行直接分割。这些方法能有效分割多种细胞核与细胞,但通用视觉基础模型在细粒度语义分割(如识别特定细胞核亚型或特殊细胞)方面仍存在挑战。方法:本文提出分子赋能的全栈SAM模型,通过整合三大核心策略推进计算病理学发展:(1)标注环节——通过分子赋能学习动员非专业标注者参与,降低像素级精细标注需求;(2)学习环节——通过SAM适配器调整模型以聚焦特定语义,充分利用其强泛化能力;(3)优化环节——融合分子导向校正学习(MOCL)提升分割精度。结果:在自建与公开数据集上的实验表明,全栈SAM模型在不同标注质量条件下均显著提升细胞分类性能。结论:该方法不仅减轻标注者工作量,更将精准生物医学图像分析推广至资源有限场景,推动医学诊断与病理图像分析自动化进程。
Summary / 总结
The research aims to address the limitations of general vision foundation models like SAM in fine-grained semantic segmentation of nuclei subtypes, which is critical for computational pathology. The proposed molecular-empowered All-in-SAM Model employs a full-stack approach involving annotation-engaging learning to reduce reliance on pixel-level annotations, semantic adaptation via a SAM adapter, and refinement through Molecular-Oriented Corrective Learning (MOCL). Experiments on in-house and public datasets demonstrate significant improvements in cell classification performance, even with varying annotation quality, reducing annotator workload and enhancing accessibility in resource-limited settings.
本研究针对计算病理学中通用视觉基础模型(如SAM)在细粒度细胞核亚型语义分割方面的局限性,提出了分子赋能的全能SAM模型。该方法采用全栈策略,结合标注参与学习减少像素级标注需求、通过SAM适配器实现语义适应,并利用分子导向校正学习进行精细化改进。在内部和公共数据集上的实验表明,该模型在不同标注质量下均显著提升了细胞分类性能。
End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning
Authors: Qiaoyu Zheng, Yuze Sun, Chaoyi Wu, Weike Zhao, Pengcheng Qiu, Yongguo Yu, Kun Sun, Yanfeng Wang, Ya Zhang, Weidi Xie
First: 2025-08-21T17:42:47+00:00 · Latest: 2025-08-21T17:42:47+00:00
Comments: 35 pages, 5 figures, 3 tables
Abstract
Accurate diagnosis with medical large language models is hindered by
knowledge gaps and hallucinations. Retrieval and tool-augmented methods help,
but their impact is limited by weak use of external knowledge and poor
feedback-reasoning traceability. To address these challenges, We introduce
Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement
learning (RL) that enables steer tracebale retrieval-augmented reasoning for
medical diagnosis. In Deep-DxSearch, we first construct a large-scale medical
retrieval corpus comprising patient records and reliable medical knowledge
sources to support retrieval-aware reasoning across diagnostic scenarios. More
crutially, we frame the LLM as the core agent and the retrieval corpus as its
environment, using tailored rewards on format, retrieval, reasoning structure,
and diagnostic accuracy, thereby evolving the agentic RAG policy from
large-scale data through RL.
Experiments demonstrate that our end-to-end agentic RL training framework
consistently outperforms prompt-engineering and training-free RAG approaches
across multiple data centers. After training, Deep-DxSearch achieves
substantial gains in diagnostic accuracy, surpassing strong diagnostic
baselines such as GPT-4o, DeepSeek-R1, and other medical-specific frameworks
for both common and rare disease diagnosis under in-distribution and
out-of-distribution settings. Moreover, ablation studies on reward design and
retrieval corpus components confirm their critical roles, underscoring the
uniqueness and effectiveness of our approach compared with traditional
implementations. Finally, case studies and interpretability analyses highlight
improvements in Deep-DxSearch's diagnostic policy, providing deeper insight
into its performance gains and supporting clinicians in delivering more
reliable and precise preliminary diagnoses. See
https://github.com/MAGIC-AI4Med/Deep-DxSearch.
中文标题/摘要
标题:端到端可追溯诊断推理的智能体RAG系统训练
医学大语言模型的精准诊断受限于知识断层与幻觉现象。检索与工具增强方法虽有效,但外部知识利用不足及反馈推理可追溯性差制约其效果。为此,我们推出Deep-DxSearch——基于强化学习(RL)端到端训练的智能体RAG系统,实现可引导的检索增强式医学诊断推理。该系统首先构建包含病历与可靠医学知识源的大规模检索语料库,支持跨诊断场景的检索感知推理。更关键的是,我们将LLM设为核心智能体,检索语料库作为环境,通过格式、检索、推理结构和诊断准确性的定制化奖励,利用RL从大规模数据中演化智能体RAG策略。实验表明,该端到端RL训练框架在多个数据中心持续优于提示工程和无训练RAG方法。训练后,Deep-DxSearch在分布内外场景下对常见与罕见疾病的诊断准确率显著提升,超越GPT-4o、DeepSeek-R1等强基线及医学专用框架。奖励设计与检索语料库的消融研究证实其关键作用,凸显相较于传统方案的独特优势。案例研究与可解释性分析揭示了诊断策略的改进,为性能提升提供深层洞见,辅助临床医生提供更可靠精准的初步诊断。详见https://github.com/MAGIC-AI4Med/Deep-DxSearch。
Summary / 总结
This research addresses the limitations of medical large language models in diagnostic reasoning, particularly knowledge gaps, hallucinations, and poor traceability in retrieval-augmented methods. The authors propose Deep-DxSearch, an end-to-end agentic RAG system trained with reinforcement learning, which frames the LLM as an agent interacting with a large-scale medical retrieval corpus and uses tailored rewards for format, retrieval, reasoning structure, and diagnostic accuracy. Experimental results show that the system outperforms prompt-engineering and training-free RAG approaches across multiple data centers, achieving substantial gains in diagnostic accuracy over strong baselines like GPT-4o and DeepSeek-R1 for both common and rare diseases in in-distribution and out-of-distribution settings, with ablation studies confirming the critical roles of reward design and retrieval corpus components.
本研究针对医疗大语言模型在诊断推理中的知识缺口、幻觉问题以及检索增强方法可追溯性差的局限性,提出了Deep-DxSearch系统,采用端到端强化学习训练代理式检索增强生成框架,将LLM作为核心代理与构建的大规模医疗检索语料库交互,并通过格式、检索、推理结构和诊断准确性等多维度奖励进行优化。实验结果表明,该系统在多个数据中心均优于提示工程和无训练RAG方法,诊断准确性超过GPT-4o和DeepSeek-R1等基线模型,在分布内和分布外场景下对常见和罕见疾病均表现更优,消融研究验证了奖励设计和检索语料库的关键作用。
Probability Density from Latent Diffusion Models for Out-of-Distribution Detection
Authors: Joonas Järve, Karl Kaspar Haavel, Meelis Kull
First: 2025-08-21T17:27:35+00:00 · Latest: 2025-08-21T17:27:35+00:00
Comments: ECAI 2025
Abstract
Despite rapid advances in AI, safety remains the main bottleneck to deploying
machine-learning systems. A critical safety component is out-of-distribution
detection: given an input, decide whether it comes from the same distribution
as the training data. In generative models, the most natural OOD score is the
data likelihood. Actually, under the assumption of uniformly distributed OOD
data, the likelihood is even the optimal OOD detector, as we show in this work.
However, earlier work reported that likelihood often fails in practice, raising
doubts about its usefulness. We explore whether, in practice, the
representation space also suffers from the inability to learn good density
estimation for OOD detection, or if it is merely a problem of the pixel space
typically used in generative models. To test this, we trained a Variational
Diffusion Model not on images, but on the representation space of a pre-trained
ResNet-18 to assess the performance of our likelihood-based detector in
comparison to state-of-the-art methods from the OpenOOD suite.
中文标题/摘要
标题:基于隐扩散模型的概率密度用于分布外检测
尽管人工智能快速发展,安全性仍是部署机器学习系统的主要瓶颈。关键的安全组件是分布外检测:给定输入,判断其是否来自与训练数据相同的分布。在生成模型中,最自然的OOD评分是数据似然。实际上,在OOD数据均匀分布的假设下,似然甚至是最优的OOD检测器,正如本研究所证明。然而,早期研究指出似然在实践中常失效,引发对其有效性的质疑。我们探讨在实践中,表征空间是否也因无法学习良好的密度估计而影响OOD检测,抑或这只是生成模型中通常使用的像素空间问题。为验证此点,我们训练了一个变分扩散模型,并非基于图像,而是基于预训练ResNet-18的表征空间,以评估基于似然的检测器性能,并与OpenOOD套件中的先进方法进行比较。
Summary / 总结
This work addresses the critical safety challenge of out-of-distribution (OOD) detection in machine learning systems, motivated by the need to reliably identify inputs that deviate from the training distribution. The authors theoretically establish that data likelihood is the optimal OOD detector under uniform OOD assumptions, then investigate why likelihood-based methods often fail in practice. They propose training a Variational Diffusion Model on the representation space of a pre-trained ResNet-18, rather than on pixel space, to evaluate whether better density estimation improves OOD detection. Experimental comparisons using the OpenOOD benchmark demonstrate that their likelihood-based approach in representation space achieves competitive performance against state-of-the-art OOD detection methods.
本研究针对机器学习系统中可靠分布外检测的关键安全需求展开。作者从理论上证明,在均匀分布外数据假设下,数据似然是最优的检测器,反驳了先前关于其实际应用失败的报道。为探究性能不佳是源于像素空间限制而非密度估计本身的问题,他们在预训练ResNet-18的表征空间(而非原始图像)上训练变分扩散模型。通过OpenOOD基准测试的实验表明,潜在空间的似然检测方法相比当前最先进的分布外检测方法具有竞争力或更优性能。
Exploring the Landscape of Non-Equilibrium Memories with Neural Cellular Automata
Authors: Ethan Lake, Ehsan Pajouheshgar
First: 2025-08-21T17:09:07+00:00 · Latest: 2025-08-21T17:09:07+00:00
Comments: 4+9 pages
Abstract
We investigate the landscape of many-body memories: families of local
non-equilibrium dynamics that retain information about their initial conditions
for thermodynamically long time scales, even in the presence of arbitrary
perturbations. In two dimensions, the only well-studied memory is Toom's rule.
Using a combination of rigorous proofs and machine learning methods, we show
that the landscape of 2D memories is in fact quite vast. We discover memories
that correct errors in ways qualitatively distinct from Toom's rule, have
ordered phases stabilized by fluctuations, and preserve information only in the
presence of noise. Taken together, our results show that physical systems can
perform robust information storage in many distinct ways, and demonstrate that
the physics of many-body memories is richer than previously realized.
Interactive visualizations of the dynamics studied in this work are available
at https://memorynca.github.io/2D.
中文标题/摘要
标题:利用神经元胞自动机探索非平衡记忆景观
我们研究多体记忆的景观:一类局域非平衡动力学体系,能在热力学长时间尺度下保留初始条件信息,即使面对任意扰动。在二维体系中,目前唯一被深入研究的记忆是图姆规则。通过结合严格证明与机器学习方法,我们揭示二维记忆景观实际上极为广阔。发现了以与图姆规则质不同的方式纠错的记忆、通过涨落稳定的有序相,以及仅在噪声存在时保存信息的记忆。综合表明,物理系统能以多种不同方式实现鲁棒信息存储,多体记忆的物理内涵比既往认知更为丰富。本研究涉及的动力学交互可视化详见https://memorynca.github.io/2D。
Summary / 总结
This research aims to explore the diversity of non-equilibrium many-body memories in two dimensions, motivated by the scarcity of known examples beyond Toom's rule. The method combines rigorous mathematical proofs with machine learning techniques to systematically identify and analyze local dynamics that preserve initial information over long times despite perturbations. Key experimental findings reveal a vast landscape of memories, including error correction mechanisms distinct from Toom's rule, fluctuation-stabilized ordered phases, and noise-dependent information preservation, demonstrating richer physical possibilities for robust information storage than previously recognized.
本研究旨在拓展对非平衡多体记忆的理解,这些局部动力学能够在扰动下长时间保持初始信息。研究结合严格数学证明与机器学习方法,系统探索二维记忆系统,超越了著名的图姆规则。关键实验发现揭示了众多先前未知的记忆机制,包括以独特方式纠错、涨落稳定的有序相以及噪声依赖的信息保存,共同表明物理系统中鲁棒信息存储的景观远比以往认知更为丰富。
EcomMMMU: Strategic Utilization of Visuals for Robust Multimodal E-Commerce Models
Authors: Xinyi Ling, Hanwen Du, Zhihui Zhu, Xia Ning
First: 2025-08-21T17:01:12+00:00 · Latest: 2025-08-21T17:01:12+00:00
Abstract
E-commerce platforms are rich in multimodal data, featuring a variety of
images that depict product details. However, this raises an important question:
do these images always enhance product understanding, or can they sometimes
introduce redundancy or degrade performance? Existing datasets are limited in
both scale and design, making it difficult to systematically examine this
question. To this end, we introduce EcomMMMU, an e-commerce multimodal
multitask understanding dataset with 406,190 samples and 8,989,510 images.
EcomMMMU is comprised of multi-image visual-language data designed with 8
essential tasks and a specialized VSS subset to benchmark the capability of
multimodal large language models (MLLMs) to effectively utilize visual content.
Analysis on EcomMMMU reveals that product images do not consistently improve
performance and can, in some cases, degrade it. This indicates that MLLMs may
struggle to effectively leverage rich visual content for e-commerce tasks.
Building on these insights, we propose SUMEI, a data-driven method that
strategically utilizes multiple images via predicting visual utilities before
using them for downstream tasks. Comprehensive experiments demonstrate the
effectiveness and robustness of SUMEI. The data and code are available through
https://anonymous.4open.science/r/submission25.
中文标题/摘要
标题:EcomMMMU:战略利用视觉信息构建稳健的多模态电商模型
电商平台富含多模态数据,其中包含大量展示产品细节的图像。然而这引出一个重要问题:这些图像是否总能增强产品理解,抑或有时会带来冗余甚至降低性能?现有数据集在规模和设计上均存在局限,难以系统研究该问题。为此,我们推出EcomMMMU——一个包含406,190个样本和8,989,510张图像的电商多模态多任务理解数据集。该数据集由多图像视觉-语言数据构成,设计包含8项核心任务和专门的VSS子集,用于评估多模态大语言模型(MLLMs)有效利用视觉内容的能力。EcomMMMU分析表明,产品图像并非总能提升性能,有时反而会降低表现,说明MLLMs在利用丰富视觉内容执行电商任务时可能存在困难。基于此,我们提出SUMEI方法,通过预测视觉效用值来战略性地调用多图像资源,再将其用于下游任务。全面实验证明了SUMEI的有效性和鲁棒性。数据与代码可通过https://anonymous.4open.science/r/submission25获取。
Summary / 总结
Motivated by the need to understand whether product images consistently enhance or potentially hinder e-commerce product understanding, this study introduces EcomMMMU, a large-scale multimodal dataset with 406,190 samples and 8.99 million images across 8 tasks. The authors propose SUMEI, a method that strategically selects and utilizes images by predicting their visual utility before applying them to downstream tasks. Experimental results show that indiscriminate use of images can degrade model performance, while SUMEI significantly improves both effectiveness and robustness in multimodal e-commerce tasks.
本研究旨在探究电商产品图像是否始终提升多模态模型性能,或可能引入冗余甚至降低效果,为此构建了大规模数据集EcomMMMU,包含406,190个样本和899万张图像,涵盖8项核心任务。作者提出了SUMEI方法,通过预测视觉效用值来策略性筛选和使用图像。实验结果表明,现有多模态模型难以有效利用丰富视觉内容,而SUMEI显著提升了模型鲁棒性和任务表现。
Tutorial on the Probabilistic Unification of Estimation Theory, Machine Learning, and Generative AI
Authors: Mohammed Elmusrati
First: 2025-08-21T16:57:33+00:00 · Latest: 2025-08-21T16:57:33+00:00
Abstract
Extracting meaning from uncertain, noisy data is a fundamental problem across
time series analysis, pattern recognition, and language modeling. This survey
presents a unified mathematical framework that connects classical estimation
theory, statistical inference, and modern machine learning, including deep
learning and large language models. By analyzing how techniques such as maximum
likelihood estimation, Bayesian inference, and attention mechanisms address
uncertainty, the paper illustrates that many AI methods are rooted in shared
probabilistic principles. Through illustrative scenarios including system
identification, image classification, and language generation, we show how
increasingly complex models build upon these foundations to tackle practical
challenges like overfitting, data sparsity, and interpretability. In other
words, the work demonstrates that maximum likelihood, MAP estimation, Bayesian
classification, and deep learning all represent different facets of a shared
goal: inferring hidden causes from noisy and/or biased observations. It serves
as both a theoretical synthesis and a practical guide for students and
researchers navigating the evolving landscape of machine learning.
中文标题/摘要
标题:概率统一估计理论、机器学习与生成式AI教程
从不确定的噪声数据中提取意义是时间序列分析、模式识别和语言建模中的核心问题。本综述提出了一个统一的数学框架,将经典估计理论、统计推断与现代机器学习(包括深度学习和大型语言模型)联系起来。通过分析最大似然估计、贝叶斯推断和注意力机制等技术如何处理不确定性,本文阐明了许多人工智能方法都植根于共享的概率原理。通过系统辨识、图像分类和语言生成等示例场景,我们展示了日益复杂的模型如何基于这些基础应对过拟合、数据稀疏性和可解释性等实际挑战。换言之,这项工作证明最大似然估计、MAP估计、贝叶斯分类和深度学习都代表了共同目标的不同侧面:从噪声和/或有偏观察中推断隐藏原因。它既是理论综合,也是为学生和研究者导航机器学习发展脉络的实用指南。
Summary / 总结
This work aims to unify diverse approaches for extracting meaning from uncertain data across fields like time series analysis, pattern recognition, and language modeling. It presents a mathematical framework connecting classical estimation theory, statistical inference, and modern machine learning by analyzing how maximum likelihood estimation, Bayesian inference, and attention mechanisms address uncertainty through shared probabilistic principles. Experimental scenarios in system identification, image classification, and language generation demonstrate that these methods effectively tackle practical challenges including overfitting, data sparsity, and interpretability, showing that maximum likelihood, MAP estimation, Bayesian classification, and deep learning all represent facets of inferring hidden causes from noisy observations.
本研究旨在从时间序列分析、模式识别和语言建模等不同领域的噪声和不确定数据中提取有意义的信息。论文提出了一个统一的概率框架,将经典估计理论、统计推断和现代机器学习技术(包括深度学习和大型语言模型)联系起来。通过系统辨识、图像分类和语言生成等示例场景,作者证明了最大似然估计、贝叶斯推断和注意力机制等方法都基于共同的概率原理,通过从观测数据推断隐藏原因来解决过拟合、数据稀疏性和可解释性等核心挑战。
WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception
Authors: Zhiheng Liu, Xueqing Deng, Shoufa Chen, Angtian Wang, Qiushan Guo, Mingfei Han, Zeyue Xue, Mengzhao Chen, Ping Luo, Linjie Yang
First: 2025-08-21T16:57:33+00:00 · Latest: 2025-08-21T16:57:33+00:00
Comments: Project page: https://johanan528.github.io/worldweaver_web/
Abstract
Generative video modeling has made significant strides, yet ensuring
structural and temporal consistency over long sequences remains a challenge.
Current methods predominantly rely on RGB signals, leading to accumulated
errors in object structure and motion over extended durations. To address these
issues, we introduce WorldWeaver, a robust framework for long video generation
that jointly models RGB frames and perceptual conditions within a unified
long-horizon modeling scheme. Our training framework offers three key
advantages. First, by jointly predicting perceptual conditions and color
information from a unified representation, it significantly enhances temporal
consistency and motion dynamics. Second, by leveraging depth cues, which we
observe to be more resistant to drift than RGB, we construct a memory bank that
preserves clearer contextual information, improving quality in long-horizon
video generation. Third, we employ segmented noise scheduling for training
prediction groups, which further mitigates drift and reduces computational
cost. Extensive experiments on both diffusion- and rectified flow-based models
demonstrate the effectiveness of WorldWeaver in reducing temporal drift and
improving the fidelity of generated videos.
中文标题/摘要
标题:WorldWeaver:通过丰富感知生成长时域视频世界
生成式视频建模已取得显著进展,但确保长序列的结构与时间一致性仍是挑战。现有方法主要依赖RGB信号,导致物体结构和运动在长时间跨度中误差累积。为此,我们推出WorldWeaver——一个鲁棒的长视频生成框架,其在统一的长时域建模方案中联合建模RGB帧与感知条件。我们的训练框架具备三大优势:首先,通过从统一表征联合预测感知条件与色彩信息,显著增强时间一致性与运动动态;其次,利用比RGB更抗漂移的深度线索构建记忆库,保留更清晰的上下文信息以提升长时域视频生成质量;第三,采用分段噪声调度训练预测组,进一步抑制漂移并降低计算成本。基于扩散模型和整流流模型的广泛实验验证了WorldWeaver在减少时间漂移和提升生成视频保真度方面的有效性。
Summary / 总结
Motivated by the challenge of maintaining structural and temporal consistency in long video generation, where current RGB-based methods suffer from accumulated errors, this paper introduces WorldWeaver, a framework that jointly models RGB frames and perceptual conditions like depth within a unified long-horizon scheme. The method leverages depth cues for a memory bank to preserve context, uses joint prediction of perceptual and color information, and employs segmented noise scheduling to reduce drift and computational cost. Experimental results on diffusion and rectified flow models show that WorldWeaver effectively reduces temporal drift and improves video fidelity in long sequences.
针对长视频生成中结构与时序一致性难以维持的问题,现有基于RGB的方法在长序列中易出现误差累积,本文提出了WorldWeaver框架,通过统一建模RGB帧与深度等感知条件来提升长时域生成效果。方法利用深度线索构建记忆库以保持上下文清晰性,联合预测感知条件与颜色信息,并采用分段噪声调度减少漂移和计算成本。在扩散和整流流模型上的实验表明,WorldWeaver有效降低了时序漂移,提高了生成长视频的保真度。
StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding
Authors: Yanlai Yang, Zhuokai Zhao, Satya Narayan Shukla, Aashu Singh, Shlok Kumar Mishra, Lizhu Zhang, Mengye Ren
First: 2025-08-21T16:56:29+00:00 · Latest: 2025-08-21T16:56:29+00:00
Comments: 15 pages, 3 figures
Abstract
Multimodal large language models (MLLMs) have made significant progress in
visual-language reasoning, but their ability to efficiently handle long videos
remains limited. Despite recent advances in long-context MLLMs, storing and
attending to the key-value (KV) cache for long visual contexts incurs
substantial memory and computational overhead. Existing visual compression
methods require either encoding the entire visual context before compression or
having access to the questions in advance, which is impractical for long video
understanding and multi-turn conversational settings. In this work, we propose
StreamMem, a query-agnostic KV cache memory mechanism for streaming video
understanding. Specifically, StreamMem encodes new video frames in a streaming
manner, compressing the KV cache using attention scores between visual tokens
and generic query tokens, while maintaining a fixed-size KV memory to enable
efficient question answering (QA) in memory-constrained, long-video scenarios.
Evaluation on three long video understanding and two streaming video question
answering benchmarks shows that StreamMem achieves state-of-the-art performance
in query-agnostic KV cache compression and is competitive with query-aware
compression approaches.
中文标题/摘要
标题:StreamMem:面向流式视频理解的查询无关KV缓存内存机制
多模态大语言模型(MLLMs)在视觉语言推理方面取得显著进展,但处理长视频的能力仍受限。尽管长上下文MLLMs有所进步,但存储和处理长视觉上下文的关键值(KV)缓存仍带来巨大内存和计算开销。现有视觉压缩方法需在压缩前编码整个视觉上下文或预先获取问题,这不适用于长视频理解和多轮对话场景。本文提出StreamMem,一种用于流式视频理解的查询无关KV缓存内存机制。它通过流式编码新视频帧,利用视觉标记与通用查询标记间的注意力分数压缩KV缓存,同时维持固定大小的KV内存,以在内存受限的长视频场景中实现高效问答。在三个长视频理解和两个流式视频问答基准测试中,StreamMem在查询无关KV缓存压缩方面达到最先进性能,并与查询感知压缩方法竞争。
Summary / 总结
The motivation for this work stems from the limitations of multimodal large language models (MLLMs) in efficiently processing long videos due to the substantial memory and computational overhead of storing and attending to key-value (KV) cache. Existing visual compression methods are impractical as they require encoding the entire visual context beforehand or advance access to questions. To address this, the authors propose StreamMem, a query-agnostic KV cache memory mechanism that encodes new video frames in a streaming manner, compressing the KV cache using attention scores between visual tokens and generic query tokens while maintaining a fixed-size memory. Experimental results on three long video understanding and two streaming video question answering benchmarks demonstrate that StreamMem achieves state-of-the-art performance in query-agnostic KV cache compression and is competitive with query-aware approaches.
本研究动机源于多模态大语言模型(MLLMs)在处理长视频时因存储和关注键值(KV)缓存而产生的高内存和计算开销,现有视觉压缩方法需预先编码整个视觉上下文或提前获取问题,不适用于长视频理解和多轮对话场景。为此,提出StreamMem,一种查询无关的KV缓存内存机制,以流式方式编码新视频帧,利用视觉标记与通用查询标记间的注意力分数压缩KV缓存,同时维持固定大小内存以实现高效问答。在三个长视频理解和两个流式视频问答基准测试中,StreamMem在查询无关KV缓存压缩上达到最先进性能,并与查询感知方法竞争。
Stemming -- The Evolution and Current State with a Focus on Bangla
Authors: Abhijit Paul, Mashiat Amin Farin, Sharif Md. Abdullah, Ahmedul Kabir, Zarif Masud, Shebuti Rayana
First: 2025-08-21T16:54:24+00:00 · Latest: 2025-08-21T16:54:24+00:00
Abstract
Bangla, the seventh most widely spoken language worldwide with 300 million
native speakers, faces digital under-representation due to limited resources
and lack of annotated datasets. Stemming, a critical preprocessing step in
language analysis, is essential for low-resource, highly-inflectional languages
like Bangla, because it can reduce the complexity of algorithms and models by
significantly reducing the number of words the algorithm needs to consider.
This paper conducts a comprehensive survey of stemming approaches, emphasizing
the importance of handling morphological variants effectively. While exploring
the landscape of Bangla stemming, it becomes evident that there is a
significant gap in the existing literature. The paper highlights the
discontinuity from previous research and the scarcity of accessible
implementations for replication. Furthermore, it critiques the evaluation
methodologies, stressing the need for more relevant metrics. In the context of
Bangla's rich morphology and diverse dialects, the paper acknowledges the
challenges it poses. To address these challenges, the paper suggests directions
for Bangla stemmer development. It concludes by advocating for robust Bangla
stemmers and continued research in the field to enhance language analysis and
processing.
中文标题/摘要
标题:词干提取——以孟加拉语为重点的演变与现状
孟加拉语作为全球第七大语言,拥有三亿母语者,但因资源有限和标注数据集匮乏而面临数字代表性不足的问题。词干提取作为语言分析的关键预处理步骤,对孟加拉语这类低资源、高屈折性语言至关重要,它能显著减少算法需处理的词汇量,从而降低算法与模型的复杂度。本文系统综述了词干提取方法,强调有效处理形态变体的重要性。在考察孟加拉语词干提取现状时,发现现有文献存在明显空白。研究指出前人工作的断层现象及可复现实现的稀缺性,并批判现有评估方法,呼吁采用更相关的度量标准。针对孟加拉语丰富的形态结构和方言多样性,本文承认其带来的挑战,提出孟加拉语词干提取器的发展方向,最后倡导开发强健的词干提取工具并持续推进该领域研究,以提升语言分析与处理能力。
Summary / 总结
Motivated by the digital under-representation of Bangla, a highly-inflectional language with 300 million speakers, this paper surveys stemming approaches to reduce algorithmic complexity in language processing. The method involves a comprehensive review of existing techniques, highlighting gaps in literature, scarce implementations, and inadequate evaluation metrics. Key findings emphasize the need for robust stemmers tailored to Bangla's morphology and dialects, advocating for continued research to improve language analysis.
本文针对拥有3亿母语者的孟加拉语在数字资源中的代表性不足问题,调查词干提取方法以降低语言处理算法的复杂性。方法包括全面回顾现有技术,指出文献中的空白、可复现实现的稀缺以及评估指标的不完善。主要发现强调需要针对孟加拉语的形态和方言开发健壮的词干提取器,并呼吁持续研究以提升语言分析能力。
End-to-End Analysis of Charge Stability Diagrams with Transformers
Authors: Rahul Marchand, Lucas Schorling, Cornelius Carlsson, Jonas Schuff, Barnaby van Straaten, Taylor L. Patti, Federico Fedele, Joshua Ziegler, Parth Girdhar, Pranav Vaidhyanathan, Natalia Ares
First: 2025-08-21T16:54:22+00:00 · Latest: 2025-08-21T16:54:22+00:00
Comments: 8 pages, 2 figures, RM and LS contributed equally
Abstract
Transformer models and end-to-end learning frameworks are rapidly
revolutionizing the field of artificial intelligence. In this work, we apply
object detection transformers to analyze charge stability diagrams in
semiconductor quantum dot arrays, a key task for achieving scalability with
spin-based quantum computing. Specifically, our model identifies triple points
and their connectivity, which is crucial for virtual gate calibration, charge
state initialization, drift correction, and pulse sequencing. We show that it
surpasses convolutional neural networks in performance on three different spin
qubit architectures, all without the need for retraining. In contrast to
existing approaches, our method significantly reduces complexity and runtime,
while enhancing generalizability. The results highlight the potential of
transformer-based end-to-end learning frameworks as a foundation for a
scalable, device- and architecture-agnostic tool for control and tuning of
quantum dot devices.
中文标题/摘要
标题:基于Transformer的电荷稳定性图端到端分析
Transformer模型与端到端学习框架正迅速革新人工智能领域。本研究将目标检测Transformer应用于半导体量子点阵列中的电荷稳定性图分析——这是实现自旋量子计算可扩展性的关键任务。具体而言,我们的模型能识别三重点及其连通性,这对虚拟门校准、电荷态初始化、漂移校正和脉冲序列制定至关重要。实验表明,在三种不同自旋量子比特架构上,其性能均超越卷积神经网络,且无需重新训练。与现有方法相比,本方法显著降低了复杂度和运行时间,同时增强了泛化能力。研究成果凸显了基于Transformer的端到端学习框架作为可扩展、设备与架构无关的量子点器件控制调谐工具的潜力。
Summary / 总结
The research aims to automate the analysis of charge stability diagrams in semiconductor quantum dot arrays, which is essential for scaling spin-based quantum computing. The method employs an object detection transformer to identify triple points and their connectivity in an end-to-end learning framework, eliminating the need for manual feature extraction. Experimental results demonstrate that the model outperforms convolutional neural networks across three distinct spin qubit architectures without retraining, reducing complexity and runtime while improving generalizability.
该研究旨在自动化分析半导体量子点阵列中的电荷稳定性图,这是扩展自旋量子计算的关键步骤。方法采用目标检测Transformer模型,通过端到端学习框架识别三重点和它们的连接性。实验结果表明,该模型在三种不同的自旋量子比特架构上均优于卷积神经网络,无需重新训练,同时降低了复杂性和运行时间,并提高了泛化能力。
Position Bias Mitigates Position Bias:Mitigate Position Bias Through Inter-Position Knowledge Distillation
Authors: Yifei Wang, Feng Xiong, Yong Wang, Linjing Li, Xiangxiang Chu, Daniel Dajun Zeng
First: 2025-08-21T16:54:04+00:00 · Latest: 2025-08-21T16:54:04+00:00
Comments: EMNLP2025 Main
Abstract
Positional bias (PB), manifesting as non-uniform sensitivity across different
contextual locations, significantly impairs long-context comprehension and
processing capabilities. While prior work seeks to mitigate PB through
modifying the architectures causing its emergence, significant PB still
persists. To address PB effectively, we introduce \textbf{Pos2Distill}, a
position to position knowledge distillation framework. Pos2Distill transfers
the superior capabilities from advantageous positions to less favorable ones,
thereby reducing the huge performance gaps. The conceptual principle is to
leverage the inherent, position-induced disparity to counteract the PB itself.
We identify distinct manifestations of PB under \textbf{\textsc{r}}etrieval and
\textbf{\textsc{r}}easoning paradigms, thereby designing two specialized
instantiations: \emph{Pos2Distill-R\textsuperscript{1}} and
\emph{Pos2Distill-R\textsuperscript{2}} respectively, both grounded in this
core principle. By employing the Pos2Distill approach, we achieve enhanced
uniformity and significant performance gains across all contextual positions in
long-context retrieval and reasoning tasks. Crucially, both specialized systems
exhibit strong cross-task generalization mutually, while achieving superior
performance on their respective tasks.
中文标题/摘要
标题:位置偏差缓解位置偏差:通过位置间知识蒸馏缓解位置偏差
位置偏差(PB)表现为不同上下文位置的非均匀敏感性,严重损害长上下文理解与处理能力。尽管先前研究通过修改引发该偏差的架构来缓解PB,但显著的PB仍然存在。为有效解决PB,我们提出\textbf{Pos2Distill}——一种位置到位置的知识蒸馏框架。该框架将优势位置的卓越能力迁移至劣势位置,从而缩小巨大性能差距。其核心原理是利用固有的位置诱导差异来抵消PB本身。我们识别了PB在\textbf{检索}与\textbf{推理}范式下的不同表现,据此设计出两个基于该核心原理的专用实例:\emph{Pos2Distill-R\textsuperscript{1}}和\emph{Pos2Distill-R\textsuperscript{2}}。通过Pos2Distill方法,我们在长上下文检索和推理任务的所有位置实现了更高的均匀性及显著性能提升。关键的是,两个专用系统在各自任务取得优异性能的同时,展现出强大的跨任务相互泛化能力。
Summary / 总结
Positional bias, which causes uneven performance across different contextual positions, significantly hinders long-context understanding and processing. To address this, the authors propose Pos2Distill, a knowledge distillation framework that transfers capabilities from high-performing positions to weaker ones, leveraging position-induced disparities to counteract the bias itself. They design two task-specific instantiations, Pos2Distill-R¹ for retrieval and Pos2Distill-R² for reasoning, both of which yield improved uniformity and performance across all positions, while also demonstrating strong cross-task generalization.
位置偏差导致不同上下文位置的性能不均,显著影响长文本的理解和处理能力。为解决这一问题,作者提出Pos2Distill,一种通过位置间知识蒸馏的框架,利用位置差异本身来抵消偏差,将优势位置的能力迁移到弱势位置。他们针对检索和推理任务分别设计了Pos2Distill-R¹和Pos2Distill-R²两个实例,实验表明该方法提高了长文本任务中所有位置的性能均匀性,并展现出强大的跨任务泛化能力。
Communication Efficient LLM Pre-training with SparseLoCo
Authors: Amir Sarfi, Benjamin Thérien, Joel Lidin, Eugene Belilovsky
First: 2025-08-21T16:48:19+00:00 · Latest: 2025-08-21T16:48:19+00:00
Comments: 15 pages, 9 tables, 2 figures
Abstract
Communication-efficient distributed training algorithms have received
considerable interest recently due to their benefits for training Large
Language Models (LLMs) in bandwidth-constrained settings, such as across data
centers and over the internet. Despite reducing communication frequency, these
methods still typically require communicating a full copy of the model's
gradients-resulting in a communication bottleneck even for cross-datacenter
links. Furthermore, they can slightly degrade performance compared to a naive
AdamW DDP baseline. While quantization and error feedback are often applied to
reduce the pseudo-gradient's size, in the context of LLM pre-training, existing
approaches have been unable to additionally leverage sparsification and have
obtained limited quantization. In this work, we introduce SparseLoCo, a
communication-efficient training algorithm for LLMs that effectively leverages
Top-k sparsification and quantization to reach extreme compression ratios of up
to 1-3% sparsity and 2-bit quantization while outperforming full-precision
DiLoCo. Our key observations are that outer momentum can be locally
approximated by an error feedback combined with aggressive sparsity and that
sparse aggregation can actually improve model performance. We empirically
demonstrate in a range of communication-constrained LLM training settings that
SparseLoCo provides significant benefits in both performance and communication
cost.
中文标题/摘要
标题:基于SparseLoCo的高效通信大语言模型预训练
近年来,通信高效的分布式训练算法因能在带宽受限环境(如跨数据中心和互联网)中优化大语言模型(LLM)训练而备受关注。尽管这些方法降低了通信频率,但仍需传输完整的模型梯度副本,导致即使跨数据中心链路也存在通信瓶颈。此外,与朴素的AdamW DDP基线相比,其性能可能略有下降。虽然常采用量化和误差反馈来减小伪梯度大小,但在LLM预训练中,现有方法未能额外利用稀疏化技术且量化效果有限。本研究提出SparseLoCo——一种面向LLM的通信高效训练算法,通过Top-k稀疏化与量化技术实现极端压缩比(稀疏度达1-3%,量化至2位),同时性能超越全精度DiLoCo。关键发现包括:外部动量可通过误差反馈结合激进稀疏化进行本地近似,且稀疏聚合能实际提升模型性能。我们在多种通信受限的LLM训练场景中实证表明,SparseLoCo在性能和通信成本方面均带来显著提升。
Summary / 总结
This research addresses the communication bottleneck in distributed training of large language models (LLMs) across bandwidth-constrained environments, where existing methods still require full gradient communication and often degrade performance. The proposed method, SparseLoCo, combines Top-k sparsification and quantization to achieve extreme compression ratios of 1-3% sparsity and 2-bit quantization, leveraging outer momentum approximation via error feedback and sparse aggregation. Experimental results across various communication-constrained LLM training settings show that SparseLoCo outperforms full-precision DiLoCo in both model performance and communication efficiency.
该研究旨在解决大型语言模型(LLM)在带宽受限环境(如跨数据中心)中进行分布式训练时的通信瓶颈问题,现有方法仍需传输完整梯度且常导致性能下降。提出的方法SparseLoCo结合Top-k稀疏化和量化,实现1-3%稀疏度和2位量化的极端压缩比,通过误差反馈和稀疏聚合近似外部动量。实验结果表明,在多种通信受限的LLM训练场景中,SparseLoCo在性能和通信效率上均优于全精度DiLoCo,表现出显著优势。
Investigation of D-Wave quantum annealing for training Restricted Boltzmann Machines and mitigating catastrophic forgetting
Authors: Abdelmoula El-Yazizi, Yaroslav Koshka
First: 2025-08-21T16:26:58+00:00 · Latest: 2025-08-21T16:26:58+00:00
Comments: 26 pages, 5 figures
Abstract
Modest statistical differences between the sampling performances of the
D-Wave quantum annealer (QA) and the classical Markov Chain Monte Carlo (MCMC),
when applied to Restricted Boltzmann Machines (RBMs), are explored to explain,
and possibly address, the absence of significant and consistent improvements in
RBM trainability when the D-Wave sampling was used in previous investigations.
A novel hybrid sampling approach, combining the classical and the QA
contributions, is investigated as a promising way to benefit from the modest
differences between the two sampling methods. No improvements in the RBM
training are achieved in this work, thereby suggesting that the differences
between the QA-based and MCMC sampling, mainly found in the medium-to-low
probability regions of the distribution, which are less important for the
quality of the sample, are insufficient to benefit the training. Difficulties
in achieving sufficiently high quality of embedding RBMs into the lattice of
the newer generation of D-Wave hardware could be further complicating the task.
On the other hand, the ability to generate samples of sufficient variety from
lower-probability parts of the distribution has a potential to benefit other
machine learning applications, such as the mitigation of catastrophic
forgetting (CF) during incremental learning. The feasibility of using
QA-generated patterns of desirable classes for CF mitigation by the generative
replay is demonstrated in this work for the first time. While the efficiency of
the CF mitigation using the D-Wave QA was comparable to that of the classical
mitigation, both the speed of generating a large number of distinct desirable
patterns and the potential for further improvement make this approach promising
for a variety of challenging machine learning applications.
中文标题/摘要
标题:D-Wave量子退火在训练受限玻尔兹曼机及缓解灾难性遗忘中的研究
本研究探讨了D-Wave量子退火器(QA)与经典马尔可夫链蒙特卡洛(MCMC)在受限玻尔兹曼机(RBM)采样性能上的细微统计差异,以解释先前研究中使用D-Wave采样时未能显著提升RBM可训练性的现象。提出了一种结合经典方法与量子退火贡献的新型混合采样策略,但实验表明:由于QA与MCMC的差异主要体现在概率分布中低概率区域(对采样质量影响较小),这种差异不足以改善RBM训练。新一代D-Wave硬件在嵌入RBM时难以保证高质量的问题可能加剧了这一挑战。然而,从低概率区域生成多样化样本的能力对其他机器学习应用具有潜力,例如增量学习中的灾难性遗忘(CF)缓解。本研究首次证明了通过生成式回放使用QA生成目标类别模式来实现CF缓解的可行性。虽然D-Wave量子退火在CF缓解效率上与经典方法相当,但其快速生成大量 distinct 目标模式的能力以及进一步优化的潜力,使其在各种挑战性机器学习应用中展现出前景。
Summary / 总结
This study investigates the potential of D-Wave quantum annealing (QA) for training Restricted Boltzmann Machines (RBMs) and mitigating catastrophic forgetting. The motivation stems from previous observations that QA sampling did not significantly improve RBM training compared to classical Markov Chain Monte Carlo (MCMC) methods. A hybrid sampling approach combining QA and MCMC was explored, but no training improvements were found, as differences between the samplers occurred mainly in low-probability regions less critical for training quality. However, QA-generated samples demonstrated effectiveness in mitigating catastrophic forgetting via generative replay, achieving comparable performance to classical methods while offering advantages in speed and pattern diversity for incremental learning tasks.
本研究探讨了D-Wave量子退火器(QA)在训练受限玻尔兹曼机(RBM)时相比经典马尔可夫链蒙特卡洛(MCMC)采样的潜在改进,动机源于先前研究中观察到的性能差异有限。作者提出并测试了一种结合QA与MCMC的混合采样方法,但未实现训练提升,认为原因是QA的采样差异主要出现在对训练质量影响较小的低概率区域。然而,他们首次证明QA生成的样本可通过生成式回放有效缓解增量学习中的灾难性遗忘,其效率与经典方法相当,同时在生成多样样本的速度和可扩展性方面展现出对多种机器学习应用的潜力。
Conditionally adaptive augmented Lagrangian method for physics-informed learning of forward and inverse problems using artificial neural networks
Authors: Qifeng Hu, Shamsulhaq Basir, Inanc Senocak
First: 2025-08-21T16:22:40+00:00 · Latest: 2025-08-21T16:22:40+00:00
Comments: 37 pages, 23 figures
Abstract
We present several advances to the physics and equality constrained
artificial neural networks (PECANN) framework that substantially improve its
capability to learn solutions of canonical partial differential equations
(PDEs). First, we generalize the augmented Lagrangian method (ALM) to support
multiple independent penalty parameters, enabling simultaneous enforcement of
heterogeneous constraints. Second, we reformulate pointwise constraint
enforcement and Lagrange multipliers as expectations over constraint terms,
reducing memory overhead and permitting efficient mini-batch training. Third,
to address PDEs with oscillatory, multi-scale features, we incorporate Fourier
feature mappings and show that a single mapping suffices where multiple
mappings or more costly architectures were required in related methods. Fourth,
we introduce a time-windowing strategy for long-time evolution in which the
terminal state of each window is enforced as an initial-condition constraint
for the next, ensuring continuity without discrete time models. Crucially, we
propose a conditionally adaptive penalty update (CAPU) strategy for ALM, which
preserves the principle that larger constraint violations incur stronger
penalties. CAPU accelerates the growth of Lagrange multipliers for selectively
challenging constraints, enhancing constraint enforcement during training. We
demonstrate the effectiveness of PECANN-CAPU on problems including the
transonic rarefaction problem, reversible advection of a passive by a vortex,
high-wavenumber Helmholtz and Poisson equations, and inverse identification of
spatially varying heat sources. Comparisons with established methods and recent
Kolmogorov-Arnold network approaches show that PECANN-CAPU achieves competitive
accuracy across all cases. Collectively, these advances improve PECANN's
robustness, efficiency, and applicability to demanding problems in scientific
computing.
中文标题/摘要
标题:条件自适应增广拉格朗日方法在物理信息神经网络正逆问题学习中的应用
我们对物理与等式约束人工神经网络(PECANN)框架提出了多项改进,显著提升了其学习典型偏微分方程(PDE)解的能力。首先,将增广拉格朗日方法(ALM)推广至支持多个独立惩罚参数,实现异构约束的同时执行。其次,将逐点约束执行和拉格朗日乘子重构为约束项的期望计算,降低内存开销并支持高效小批量训练。第三,针对具有振荡多尺度特征的PDE,引入傅里叶特征映射,证明单一映射即可满足相关方法中需多个映射或更高成本架构的需求。第四,提出长时间演化的时间窗口策略,通过将每个窗口的终止状态作为下一窗口的初始条件约束,确保连续性而无需离散时间模型。关键创新是提出ALM的条件自适应惩罚更新(CAPU)策略,保持较大约束违反对应较强惩罚的原则,通过选择性加速挑战性约束的拉格朗日乘子增长来增强训练中的约束执行。通过在跨音速稀疏化问题、涡旋被动可逆平流、高波数亥姆霍兹与泊松方程、空间变化热源反演识别等问题上的实验,证明PECANN-CAPU在所有案例中均达到竞争优势精度。这些进展共同提升了PECANN在科学计算复杂问题中的鲁棒性、效率及适用性。
Summary / 总结
This research aims to enhance the physics and equality constrained artificial neural networks (PECANN) framework to better solve partial differential equations (PDEs) for both forward and inverse problems. The method introduces a conditionally adaptive penalty update (CAPU) strategy within the augmented Lagrangian method, which supports multiple penalty parameters and reformulates constraints as expectations to reduce memory usage and enable mini-batch training. It also incorporates Fourier feature mappings for multi-scale problems and a time-windowing approach for long-time evolution. Experimental results on problems such as transonic rarefaction, high-wavenumber Helmholtz equations, and inverse heat source identification demonstrate that PECANN-CAPU achieves competitive accuracy compared to established and recent methods, improving robustness and efficiency in scientific computing applications.
本研究旨在通过改进约束实施和可扩展性,增强物理信息神经网络框架PECANN求解偏微分方程的能力。方法采用条件自适应惩罚更新策略的增广拉格朗日法,支持多个独立惩罚参数,将约束重构为期望以减少内存占用,引入傅里叶特征映射处理多尺度问题,并使用时窗策略进行长时间演化。在跨音速稀疏化、高波数亥姆霍兹/泊松方程和逆热源识别等实验表明,PECANN-CAPU相比现有方法和近期Kolmogorov-Arnold网络取得了竞争性精度,在各类正反问题中表现出更强的鲁棒性和效率。
Effect Identification and Unit Categorization in the Multi-Score Regression Discontinuity Design with Application to LED Manufacturing
Authors: Philipp Alexander Schwarz, Oliver Schacht, Sven Klaassen, Johannes Oberpriller, Martin Spindler
First: 2025-08-21T16:17:15+00:00 · Latest: 2025-08-21T16:17:15+00:00
Abstract
The RDD (regression discontinuity design) is a widely used framework for
identification and estimation of causal effects at a cutoff of a single running
variable. Practical settings, in particular those encountered in production
systems, often involve decision-making defined by multiple thresholds and
criteria. Common MRD (multi-score RDD) approaches transform these to a
one-dimensional design, to employ identification and estimation results.
However, this practice can introduce non-compliant behavior. We develop
theoretical tools to identify and reduce some of this "fuzziness" when
estimating the cutoff-effect on compliers of sub-rules. We provide a sound
definition and categorization of unit behavior types for multi-dimensional
cutoff-rules, extending existing categorizations. We identify conditions for
the existence and identification of the cutoff-effect on complier in multiple
dimensions, and specify when identification remains stable after excluding
nevertaker and alwaystaker. Further, we investigate how decomposing
cutoff-rules into simpler parts alters the unit behavior. This allows
identification and removal of non-compliant units potentially improving
estimates. We validate our framework on simulated and real-world data from
opto-electronic semiconductor manufacturing. Our empirical results demonstrate
the usability for refining production policies. Particularly we show that our
approach decreases the estimation variance, highlighting the practical value of
the MRD framework in manufacturing.
中文标题/摘要
标题:多分数回归断点设计中的效应识别与单元分类及其在LED制造中的应用
回归断点设计(RDD)是广泛用于单一运行变量临界点处因果效应识别与估计的框架。实际场景,尤其是生产系统中的决策,常涉及多阈值和多标准。常见多分数RDD(MRD)方法将其转化为一维设计以应用识别与估计结果,但可能引入非合规行为。我们开发理论工具以识别和减少子规则合规者临界效应估计中的部分'模糊性',提出多维临界规则下单元行为类型的明确定义与分类体系。我们确立了多维环境下合规者临界效应存在与识别的条件,阐明排除永不采纳者和总是采纳者后识别保持稳定的情形,并研究分解临界规则对单元行为的影响,从而实现非合规单元的识别与移除以改进估计。通过光电半导体制造的模拟和实际数据验证,我们的实证结果展示了优化生产策略的实用性,特别证明了该方法能降低估计方差,凸显MRD框架在制造业中的实践价值。
Summary / 总结
Motivated by the limitations of standard regression discontinuity designs (RDD) in handling multi-threshold decision-making common in production systems, this paper develops a theoretical framework for multi-score RDD (MRD) to reduce non-compliant behavior introduced by dimension reduction. The method extends unit categorization to multiple dimensions, identifies conditions for causal effect identification on compliers, and examines how decomposing complex rules affects unit behavior to allow removal of non-compliant units. Experimental validation on simulated and real-world LED manufacturing data shows that the approach decreases estimation variance and refines production policies, demonstrating practical utility.
本研究针对生产系统中常见的多阈值决策问题,指出传统断点回归设计(RDD)的局限性,开发了多分数断点回归(MRD)理论框架以减少非合规行为并改进因果效应估计。方法扩展了多维断点规则下的单位分类,确定了合规者效应识别的条件,并通过分解复杂规则来排除非合规单位。在模拟和真实光电半导体制造数据上的实验表明,该方法降低了估计方差,优化了生产策略,证明了其在制造业中的实际应用价值。
GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning
Authors: Abhigya Verma, Sriram Puttagunta, Seganrasan Subramanian, Sravan Ramachandran
First: 2025-08-21T16:13:49+00:00 · Latest: 2025-08-21T16:13:49+00:00
Comments: 23 pages, 9 tables, 3 figures
Abstract
GRAFT is a structured multimodal benchmark for evaluating models on
instruction-following, visual reasoning, and visual-textual alignment tasks. It
features programmatically generated charts and synthetically rendered tables,
created with Python visualization libraries to ensure control over data
semantics, structure, and clarity. Each GRAFT instance pairs a chart or table
image with a systematically generated, multi-step analytical question based
solely on visual content. Answers are provided in structured formats such as
JSON or YAML, supporting consistent evaluation of both reasoning and output
format. The benchmark introduces a taxonomy of reasoning types including
comparison, trend identification, ranking, aggregation, proportion estimation,
and anomaly detection to enable comprehensive assessment. Reference answers
follow strict factual and formatting guidelines for precise, aspect-based
evaluation. GRAFT offers a unified, scalable framework for fine-grained
benchmarking of multimodal models on visually grounded, structured reasoning
tasks, setting a new evaluation standard in this field.
中文标题/摘要
标题:GRAFT:图表与表格文本对齐推理——结构化指令遵循与视觉推理的基准测试集
GRAFT是一个结构化多模态基准测试集,用于评估模型在指令遵循、视觉推理及视觉-文本对齐任务上的表现。该基准采用Python可视化库程序化生成图表和合成渲染表格,确保对数据语义、结构与清晰度的精确控制。每个GRAFT实例将图表或表格图像与基于视觉内容系统生成的多步骤分析问题配对,答案以JSON或YAML等结构化格式提供,支持对推理过程和输出格式的一致性评估。基准引入了比较、趋势识别、排序、聚合、比例估算和异常检测等推理类型分类,以实现全面评估。参考答案遵循严格的事实性和格式规范,支持基于维度的精确评估。GRAFT为多模态模型在视觉基础结构化推理任务上提供了统一、可扩展的细粒度评估框架,树立了该领域新的评估标准。
Summary / 总结
The research introduces GRAFT, a benchmark designed to address the need for evaluating multimodal models on structured instruction following and visual reasoning tasks involving charts and tables. The method involves programmatically generating chart and table images using Python visualization libraries, pairing each with a multi-step analytical question derived solely from visual content, and providing answers in structured formats like JSON or YAML to ensure consistent evaluation. Experimental results demonstrate that GRAFT enables comprehensive assessment across reasoning types such as comparison, trend identification, ranking, aggregation, proportion estimation, and anomaly detection, offering a scalable framework for fine-grained benchmarking in visually grounded, structured reasoning.
该研究提出了GRAFT基准,旨在解决评估多模态模型在结构化指令遵循和视觉推理任务上的需求。方法包括使用Python可视化库程序化生成图表和表格,将每个视觉内容与多步骤分析问题配对,并以JSON或YAML等结构化格式提供答案。实验结果表明,GRAFT能够全面评估比较、趋势识别和异常检测等推理类型,为视觉基础任务的细粒度模型性能评估提供了一个可扩展的框架。