arXiv 论文速递

2025-08-22 12:23
Snapshot: 20250822_1223
Scaling Group Inference for Diverse and High-Quality Generation
Authors: Gaurav Parmar, Or Patashnik, Daniil Ostashev, Kuan-Chieh Wang, Kfir Aberman, Srinivasa Narasimhan, Jun-Yan Zhu
Venue: www
First: 2025-08-21T17:59:57+00:00 · Latest: 2025-08-21T17:59:57+00:00
Comments: Project website: https://www.cs.cmu.edu/~group-inference, GitHub: https://github.com/GaParmar/group-inference
Abstract
Generative models typically sample outputs independently, and recent inference-time guidance and scaling algorithms focus on improving the quality of individual samples. However, in real-world applications, users are often presented with a set of multiple images (e.g., 4-8) for each prompt, where independent sampling tends to lead to redundant results, limiting user choices and hindering idea exploration. In this work, we introduce a scalable group inference method that improves both the diversity and quality of a group of samples. We formulate group inference as a quadratic integer assignment problem: candidate outputs are modeled as graph nodes, and a subset is selected to optimize sample quality (unary term) while maximizing group diversity (binary term). To substantially improve runtime efficiency, we progressively prune the candidate set using intermediate predictions, allowing our method to scale up to large candidate sets. Extensive experiments show that our method significantly improves group diversity and quality compared to independent sampling baselines and recent inference algorithms. Our framework generalizes across a wide range of tasks, including text-to-image, image-to-image, image prompting, and video generation, enabling generative models to treat multiple outputs as cohesive groups rather than independent samples.
中文标题/摘要
标题:规模化群体推理:实现多样化与高质量生成
生成模型通常独立采样输出,近期推理时引导与扩展算法主要关注提升单一样本质量。然而在实际应用中,用户常需获取每组提示对应的多个图像(如4-8张),独立采样易导致结果冗余,限制用户选择并阻碍创意探索。本研究提出可扩展的群体推理方法,同步提升样本组的多样性与质量。我们将群体推理构建为二次整数分配问题:候选输出建模为图节点,通过选择子集优化样本质量(一元项)同时最大化群体多样性(二元项)。为显著提升运行效率,采用中间预测逐步剪枝候选集,使方法能扩展至大规模候选集。大量实验表明,相比独立采样基线及近期推理算法,本方法显著提升群体多样性与质量。该框架可泛化至多种任务,包括文本到图像、图像到图像、图像提示及视频生成,使生成模型将多输出视为 cohesive 群体而非独立样本。
Summary / 总结
The motivation is to address the redundancy in independently sampled outputs from generative models, which limits user choice and idea exploration when presented with multiple results per prompt. The method formulates group inference as a quadratic integer assignment problem, selecting a subset of candidate outputs to optimize both quality and diversity, and introduces progressive candidate pruning for scalability. Experimental results demonstrate significant improvements in group diversity and quality across text-to-image, image-to-image, image prompting, and video generation tasks compared to independent sampling and recent inference algorithms.
该研究的动机是解决生成模型独立采样输出导致的冗余问题,这限制了用户在每次提示下获得多个结果时的选择和探索。方法将群体推理构建为二次整数分配问题,通过选择候选输出子集来优化质量和多样性,并采用渐进剪枝以提高可扩展性。实验结果表明,在文本到图像、图像到图像和视频生成等任务中,该方法显著提升了群体多样性和质量,优于独立采样和近期推理算法。
CineScale: Free Lunch in High-Resolution Cinematic Visual Generation
Authors: Haonan Qiu, Ning Yu, Ziqi Huang, Paul Debevec, Ziwei Liu
Venue: ICCV 2025
First: 2025-08-21T17:59:57+00:00 · Latest: 2025-08-21T17:59:57+00:00
Comments: CineScale is an extended work of FreeScale (ICCV 2025). Project Page: https://eyeline-labs.github.io/CineScale/, Code Repo: https://github.com/Eyeline-Labs/CineScale
Abstract
Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. In this work, we propose CineScale, a novel inference paradigm to enable higher-resolution visual generation. To tackle the various issues introduced by the two types of video generation architectures, we propose dedicated variants tailored to each. Unlike existing baseline methods that are confined to high-resolution T2I and T2V generation, CineScale broadens the scope by enabling high-resolution I2V and V2V synthesis, built atop state-of-the-art open-source video generation frameworks. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Remarkably, our approach enables 8k image generation without any fine-tuning, and achieves 4k video generation with only minimal LoRA fine-tuning. Generated video samples are available at our website: https://eyeline-labs.github.io/CineScale/.
中文标题/摘要
标题:CineScale:高分辨率电影级视觉生成中的免费午餐
视觉扩散模型虽取得显著进展,但因缺乏高分辨率数据及计算资源受限,通常仅在有限分辨率下训练,制约了其生成高保真高分辨率图像或视频的能力。近期研究探索了无需调参的策略以释放预训练模型在高分辨率视觉生成方面的潜力,但这些方法仍易产生带有重复模式的低质量内容。关键障碍在于当模型生成超出训练分辨率的视觉内容时,高频信息不可避免地增加,导致误差累积产生不良重复模式。本研究提出CineScale——一种实现更高分辨率视觉生成的新型推理范式。针对两类视频生成架构的不同问题,我们设计了专用变体方案。与现有局限于高分辨率文生图(T2I)和文生视频(T2V)的基线方法不同,CineScale基于顶尖开源视频生成框架,进一步实现了高分辨率图生视频(I2V)和视频生视频(V2V)合成。大量实验验证了我们的范式在扩展图像与视频模型高分辨率生成能力方面的优越性。值得注意的是,该方法无需微调即可实现8K图像生成,仅需极少量LoRA微调即可达成4K视频生成。生成视频样本请访问:https://eyeline-labs.github.io/CineScale/。
Summary / 总结
Motivated by the limitations of visual diffusion models in generating high-resolution content due to training constraints and repetitive artifacts, this paper introduces CineScale, a novel inference paradigm designed to enable higher-resolution visual generation without fine-tuning. The method proposes tailored variants for different video generation architectures to address accumulated high-frequency errors, extending capabilities beyond text-to-image and text-to-video to include image-to-video and video-to-video synthesis. Experimental results demonstrate superior performance, achieving 8k image generation without fine-tuning and 4k video generation with minimal LoRA fine-tuning, validated on state-of-the-art open-source frameworks.
本研究提出了CineScale,一种新颖的推理范式,旨在无需额外训练即可实现高分辨率电影级视觉生成,解决了现有模型在超出训练分辨率时因高频信息增加而产生重复模式和误差的问题。该方法针对不同视频生成架构提出了定制化变体,有效处理高频信息以生成更高质量的视觉内容。实验结果表明,CineScale实现了最先进的性能,无需微调即可生成8k图像,仅需少量LoRA微调即可生成4k视频,显著扩展了现有模型在高分辨率视觉合成方面的能力。
Visual Autoregressive Modeling for Instruction-Guided Image Editing
Authors: Qingyang Mao, Qi Cai, Yehao Li, Yingwei Pan, Mingyue Cheng, Ting Yao, Qi Liu, Tao Mei
First: 2025-08-21T17:59:32+00:00 · Latest: 2025-08-21T17:59:32+00:00
Comments: Source codes and models are available at https://github.com/HiDream-ai/VAREdit
Abstract
Recent advances in diffusion models have brought remarkable visual fidelity to instruction-guided image editing. However, their global denoising process inherently entangles the edited region with the entire image context, leading to unintended spurious modifications and compromised adherence to editing instructions. In contrast, autoregressive models offer a distinct paradigm by formulating image synthesis as a sequential process over discrete visual tokens. Their causal and compositional mechanism naturally circumvents the adherence challenges of diffusion-based methods. In this paper, we present VAREdit, a visual autoregressive (VAR) framework that reframes image editing as a next-scale prediction problem. Conditioned on source image features and text instructions, VAREdit generates multi-scale target features to achieve precise edits. A core challenge in this paradigm is how to effectively condition the source image tokens. We observe that finest-scale source features cannot effectively guide the prediction of coarser target features. To bridge this gap, we introduce a Scale-Aligned Reference (SAR) module, which injects scale-matched conditioning information into the first self-attention layer. VAREdit demonstrates significant advancements in both editing adherence and efficiency. On standard benchmarks, it outperforms leading diffusion-based methods by 30\%+ higher GPT-Balance score. Moreover, it completes a $512\times512$ editing in 1.2 seconds, making it 2.2$\times$ faster than the similarly sized UltraEdit. The models are available at https://github.com/HiDream-ai/VAREdit.
中文标题/摘要
标题:视觉自回归建模在指令引导图像编辑中的应用
扩散模型的最新进展为指令引导图像编辑带来了显著的视觉保真度,但其全局去噪过程本质上将编辑区域与整个图像上下文纠缠,导致意外的伪修改并削弱对编辑指令的遵循。相比之下,自回归模型通过将图像合成构建为离散视觉标记的序列过程,提供了一种独特范式。其因果组合机制天然规避了基于扩散方法的遵循难题。本文提出VAREdit——一种将图像编辑重构为下一尺度预测问题的视觉自回归(VAR)框架。通过源图像特征和文本指令的条件化,VAREdit生成多尺度目标特征以实现精确编辑。该范式的核心挑战在于如何有效条件化源图像标记。我们发现最精细尺度的源特征无法有效指导较粗糙目标特征的预测。为弥合此差距,我们引入了尺度对齐参考(SAR)模块,将尺度匹配的条件信息注入首个自注意力层。VAREdit在编辑遵循度和效率方面均取得显著进展,在标准基准测试中,其GPT平衡分数比领先的扩散方法高出30%以上,且完成512×512编辑仅需1.2秒,比同等规模的UltraEdit快2.2倍。
Summary / 总结
Motivated by the limitations of diffusion models in instruction-guided image editing, which often cause unintended changes due to global denoising, this paper introduces VAREdit, a visual autoregressive framework that treats editing as a sequential next-scale prediction task. The method conditions on source image features and text instructions, generating multi-scale target features through a novel Scale-Aligned Reference module to ensure scale-matched conditioning. Experimental results show VAREdit achieves over 30% higher GPT-Balance score than leading diffusion methods and completes 512x512 edits in 1.2 seconds, making it both more precise and efficient.
本文的动机是扩散模型在指令引导的图像编辑中存在全局去噪导致的意外修改问题,因此提出了VAREdit,一种将编辑视为序列化多尺度预测的视觉自回归框架。该方法基于源图像特征和文本指令生成目标特征,并通过尺度对齐参考模块解决源与目标尺度不匹配的调节挑战。实验结果表明,VAREdit在标准基准上比领先的扩散方法GPT-Balance分数高30%以上,并在1.2秒内完成512x512编辑,效率提升2.2倍。
SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass
Authors: Yanxu Meng, Haoning Wu, Ya Zhang, Weidi Xie
First: 2025-08-21T17:59:16+00:00 · Latest: 2025-08-21T17:59:16+00:00
Comments: Technical Report; Project Page: https://mengmouxu.github.io/SceneGen
Abstract
3D content generation has recently attracted significant research interest due to its applications in VR/AR and embodied AI. In this work, we address the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architectural design enables improved generation performance with multi-image inputs; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robust generation abilities of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.
中文标题/摘要
标题:SceneGen:单次前馈传递实现单图像3D场景生成
3D内容生成因其在VR/AR和具身智能领域的应用近期备受关注。本研究致力于解决从单张场景图像中合成多个3D资产的挑战性任务。具体贡献包括:(i)提出SceneGen框架,通过输入场景图像及物体掩码,同步生成带几何与纹理的多重3D资产,无需优化过程或资产检索;(ii)设计新型特征聚合模块,在特征提取阶段融合视觉与几何编码器的局部与全局场景信息,结合位置预测头实现单次前馈生成3D资产及其相对空间位置;(iii)展示框架对多图像输入的直接扩展能力——尽管仅使用单图像训练,架构设计支持多图像输入提升生成质量;(iv)通过大量定量与定性实验验证方法的高效性与强健生成能力。该范式为高质量3D内容生成提供了创新解决方案,有望推动下游任务的实际应用。代码与模型将开源于:https://mengmouxu.github.io/SceneGen
Summary / 总结
Motivated by the growing demand for 3D content in VR/AR and embodied AI, this paper introduces SceneGen, a framework that generates multiple 3D assets with geometry and texture from a single scene image and object masks in one feedforward pass, eliminating the need for optimization or retrieval. The method integrates a feature aggregation module that combines local and global scene information from visual and geometric encoders, along with a position head to determine spatial relationships. Experimental results demonstrate its efficiency, robust generation quality, and extensibility to multi-image inputs despite single-image training, showing strong quantitative and qualitative performance.
本研究针对VR/AR和具身AI中对3D内容日益增长的需求,提出了SceneGen框架,能够从单张场景图像和物体掩码中一次性前向生成多个带有几何和纹理的3D资产,无需优化或检索。该方法采用新颖的特征聚合模块,整合了视觉和几何编码器的局部与全局场景信息,并通过位置头确定空间关系。实验结果表明,该方法高效、生成质量稳健,且尽管仅使用单图像训练,仍可扩展至多图像输入,展现了其在实用3D内容生成中的潜力。
ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling
Authors: Jinhyung Park, Javier Romero, Shunsuke Saito, Fabian Prada, Takaaki Shiratori, Yichen Xu, Federica Bogo, Shoou-I Yu, Kris Kitani, Rawal Khirodkar
Venue: ICCV 2025
First: 2025-08-21T17:58:56+00:00 · Latest: 2025-08-21T17:58:56+00:00
Comments: ICCV 2025; Website: https://jindapark.github.io/projects/atlas/
Abstract
Parametric body models offer expressive 3D representation of humans across a wide range of poses, shapes, and facial expressions, typically derived by learning a basis over registered 3D meshes. However, existing human mesh modeling approaches struggle to capture detailed variations across diverse body poses and shapes, largely due to limited training data diversity and restrictive modeling assumptions. Moreover, the common paradigm first optimizes the external body surface using a linear basis, then regresses internal skeletal joints from surface vertices. This approach introduces problematic dependencies between internal skeleton and outer soft tissue, limiting direct control over body height and bone lengths. To address these issues, we present ATLAS, a high-fidelity body model learned from 600k high-resolution scans captured using 240 synchronized cameras. Unlike previous methods, we explicitly decouple the shape and skeleton bases by grounding our mesh representation in the human skeleton. This decoupling enables enhanced shape expressivity, fine-grained customization of body attributes, and keypoint fitting independent of external soft-tissue characteristics. ATLAS outperforms existing methods by fitting unseen subjects in diverse poses more accurately, and quantitative evaluations show that our non-linear pose correctives more effectively capture complex poses compared to linear models.
中文标题/摘要
标题:ATLAS:解耦骨骼与形态参数以实现富有表现力的参数化人体建模
参数化人体模型通过基于配准三维网格学习基向量,能够广泛表达不同姿态、体型和面部表情的三维人体表征。然而,现有方法因训练数据多样性不足和建模假设限制,难以捕捉多样体态下的细节变化。传统范式先通过线性基优化体表,再从表面顶点回归内部骨骼关节点,导致骨骼与软组织间存在不良依赖,限制了直接控制身高和骨长的能力。为此,我们提出ATLAS——基于240台同步相机采集的60万高分辨率扫描数据构建的高保真人体模型。该方法通过将网格表征锚定于人体骨骼,显式解耦形态与骨骼基向量,从而增强形态表现力、实现细粒度身体属性定制,以及独立于软组织特征的关键点拟合。定量评估表明,ATLAS能更精准地拟合未知对象的多样姿态,其非线性姿态校正比线性模型更能有效捕捉复杂姿态。
Summary / 总结
The motivation behind ATLAS is to overcome limitations in existing parametric human models, which struggle with detailed variations across poses and shapes due to data constraints and modeling assumptions, and which entangle skeletal and surface parameters. The method involves learning a high-fidelity body model from 600k high-resolution scans by explicitly decoupling shape and skeleton bases, grounding the mesh representation in the human skeleton to enhance expressivity and control. Experimental results show that ATLAS fits unseen subjects in diverse poses more accurately than prior methods, with its non-linear pose correctives better capturing complex poses compared to linear models.
针对现有参数化人体模型在捕捉细节形状变化以及骨骼与表面参数耦合问题上的局限性,ATLAS提出了一种新颖方法,通过将网格表示基于人体骨骼来显式解耦形状和骨骼基。该方法利用60万高分辨率扫描进行训练,增强了形状表现力,并允许独立于软组织的细粒度定制和关键点拟合。实验结果表明,ATLAS在不同姿态下对未见过的受试者拟合效果优于现有方法,定量评估显示其非线性姿态校正比线性模型更有效地捕捉复杂姿态。
Discovering Hidden Algebraic Structures via Transformers with Rank-Aware Beam GRPO
Authors: Jaeha Lee, Gio Huh, Ning Su, Tony Yue YU
First: 2025-08-21T17:58:50+00:00 · Latest: 2025-08-21T17:58:50+00:00
Abstract
Recent efforts have extended the capabilities of transformers in logical reasoning and symbolic computations. In this work, we investigate their capacity for non-linear latent pattern discovery in the context of functional decomposition, focusing on the challenging algebraic task of multivariate polynomial decomposition. This problem, with widespread applications in science and engineering, is proved to be NP-hard, and demands both precision and insight. Our contributions are threefold: First, we develop a synthetic data generation pipeline providing fine-grained control over problem complexity. Second, we train transformer models via supervised learning and evaluate them across four key dimensions involving scaling behavior and generalizability. Third, we propose Beam Grouped Relative Policy Optimization (BGRPO), a rank-aware reinforcement learning method suitable for hard algebraic problems. Finetuning with BGRPO improves accuracy while reducing beam width by up to half, resulting in approximately 75% lower inference compute. Additionally, our model demonstrates competitive performance in polynomial simplification, outperforming Mathematica in various cases.
中文标题/摘要
标题:通过具有秩感知束GRPO的Transformer发现隐藏代数结构
近期研究扩展了Transformer在逻辑推理和符号计算方面的能力。本文探讨了其在函数分解背景下发现非线性潜在模式的能力,重点关注多元多项式分解这一具有挑战性的代数任务。该问题在科学与工程领域应用广泛,被证明是NP难问题,需要精确性与洞察力。我们的贡献有三:首先开发了能精细控制问题复杂度的合成数据生成流程;其次通过监督学习训练Transformer模型,并在涉及扩展行为和泛化能力的四个关键维度进行评估;第三提出了束分组相对策略优化(BGRPO),这是一种适用于困难代数问题的秩感知强化学习方法。使用BGRPO进行微调可在将束宽减少多达一半的同时提升准确率,推理计算量降低约75%。此外,我们的模型在多项式简化任务中展现出竞争优势,在多类案例中表现优于Mathematica。
Summary / 总结
This paper aims to enhance transformer models for discovering latent algebraic structures, specifically targeting the NP-hard problem of multivariate polynomial decomposition, which has significant applications in science and engineering. The authors develop a synthetic data generation pipeline for controlled complexity, train transformers via supervised learning, and introduce Beam Grouped Relative Policy Optimization (BGRPO), a rank-aware reinforcement learning method tailored for hard algebraic tasks. Experimental results show that finetuning with BGRPO improves accuracy while reducing beam width by up to half, cutting inference compute by approximately 75%, and the model outperforms Mathematica in polynomial simplification in various cases.
本文旨在提升Transformer模型在发现非线性潜在模式方面的能力,特别针对科学和工程中广泛应用的NP难问题——多元多项式分解。作者开发了一个合成数据生成流程以精细控制问题复杂度,通过监督学习训练Transformer模型,并提出了适用于困难代数任务的秩感知强化学习方法Beam Grouped Relative Policy Optimization(BGRPO)。实验结果表明,使用BGRPO进行微调可在将束宽减少多达一半的同时提高准确性,推理计算量降低约75%,且在多项式简化任务中表现优于Mathematica。
Distributed Detection of Adversarial Attacks in Multi-Agent Reinforcement Learning with Continuous Action Space
Authors: Kiarash Kazari, Ezzeldin Shereen, György Dán
First: 2025-08-21T17:58:36+00:00 · Latest: 2025-08-21T17:58:36+00:00
Comments: Accepted for publication at ECAI 2025
Abstract
We address the problem of detecting adversarial attacks against cooperative multi-agent reinforcement learning with continuous action space. We propose a decentralized detector that relies solely on the local observations of the agents and makes use of a statistical characterization of the normal behavior of observable agents. The proposed detector utilizes deep neural networks to approximate the normal behavior of agents as parametric multivariate Gaussian distributions. Based on the predicted density functions, we define a normality score and provide a characterization of its mean and variance. This characterization allows us to employ a two-sided CUSUM procedure for detecting deviations of the normality score from its mean, serving as a detector of anomalous behavior in real-time. We evaluate our scheme on various multi-agent PettingZoo benchmarks against different state-of-the-art attack methods, and our results demonstrate the effectiveness of our method in detecting impactful adversarial attacks. Particularly, it outperforms the discrete counterpart by achieving AUC-ROC scores of over 0.95 against the most impactful attacks in all evaluated environments.
中文标题/摘要
标题:连续动作空间多智能体强化学习中对抗攻击的分布式检测
本文研究连续动作空间下协作多智能体强化学习系统遭受对抗攻击的检测问题。我们提出一种仅依赖智能体局部观测的分布式检测器,通过统计建模智能体正常行为特征,利用深度神经网络将智能体行为近似为参数化多元高斯分布。基于预测密度函数定义正态性评分并解析其均值与方差特性,进而采用双端CUSUM算法实时检测评分偏离均值的异常行为。在PettingZoo多智能体测试环境中针对多种先进攻击方法的实验表明,本方案能有效检测关键对抗攻击,尤其在所有测试场景中对最具影响力攻击的AUC-ROC分数均超过0.95,显著优于离散动作空间方案。
Summary / 总结
This paper addresses the challenge of detecting adversarial attacks in cooperative multi-agent reinforcement learning with continuous action spaces, motivated by the need for robust decentralized security. The method involves using deep neural networks to model each agent's normal behavior as parametric multivariate Gaussian distributions, then employing a two-sided CUSUM procedure to detect deviations from expected behavior based on a defined normality score. Experimental results on various PettingZoo benchmarks demonstrate high effectiveness, with AUC-ROC scores exceeding 0.95 against impactful attacks, outperforming discrete action space counterparts.
本文针对连续动作空间下协作多智能体强化学习中的对抗攻击检测问题,旨在实现去中心化的鲁棒安全防护。方法上提出了一种分散式检测器,利用深度神经网络将智能体的正常行为建模为参数化多元高斯分布,并通过双侧CUSUM过程实时监测正态性评分的偏差以检测异常。在多种多智能体PettingZoo基准测试中针对先进攻击方法的实验结果表明,该方法在所有评估环境中对最具影响力攻击的AUC-ROC分数均超过0.95,性能优于离散动作空间的对应方案。
Intern-S1: A Scientific Multimodal Foundation Model
Authors: Lei Bai, Zhongrui Cai, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, Yongkang Chen, Yu Cheng, Yu Cheng, Pei Chu, Tao Chu, Erfei Cui, Ganqu Cui, Long Cui, Ziyun Cui, Nianchen Deng, Ning Ding, Nanqin Dong, Peijie Dong, Shihan Dou, Sinan Du, Haodong Duan, Caihua Fan, Ben Gao, Changjiang Gao, Jianfei Gao, Songyang Gao, Yang Gao, Zhangwei Gao, Jiaye Ge, Qiming Ge, Lixin Gu, Yuzhe Gu, Aijia Guo, Qipeng Guo, Xu Guo, Conghui He, Junjun He, Yili Hong, Siyuan Hou, Caiyu Hu, Hanglei Hu, Jucheng Hu, Ming Hu, Zhouqi Hua, Haian Huang, Junhao Huang, Xu Huang, Zixian Huang, Zhe Jiang, Lingkai Kong, Linyang Li, Peiji Li, Pengze Li, Shuaibin Li, Tianbin Li, Wei Li, Yuqiang Li, Dahua Lin, Junyao Lin, Tianyi Lin, Zhishan Lin, Hongwei Liu, Jiangning Liu, Jiyao Liu, Junnan Liu, Kai Liu, Kaiwen Liu, Kuikun Liu, Shichun Liu, Shudong Liu, Wei Liu, Xinyao Liu, Yuhong Liu, Zhan Liu, Yinquan Lu, Haijun Lv, Hongxia Lv, Huijie Lv, Qidang Lv, Ying Lv, Chengqi Lyu, Chenglong Ma, Jianpeng Ma, Ren Ma, Runmin Ma, Runyuan Ma, Xinzhu Ma, Yichuan Ma, Zihan Ma, Sixuan Mi, Junzhi Ning, Wenchang Ning, Xinle Pang, Jiahui Peng, Runyu Peng, Yu Qiao, Jiantao Qiu, Xiaoye Qu, Yuan Qu, Yuchen Ren, Fukai Shang, Wenqi Shao, Junhao Shen, Shuaike Shen, Chunfeng Song, Demin Song, Diping Song, Chenlin Su, Weijie Su, Weigao Sun, Yu Sun, Qian Tan, Cheng Tang, Huanze Tang, Kexian Tang, Shixiang Tang, Jian Tong, Aoran Wang, Bin Wang, Dong Wang, Lintao Wang, Rui Wang, Weiyun Wang, Wenhai Wang, Yi Wang, Ziyi Wang, Ling-I Wu, Wen Wu, Yue Wu, Zijian Wu, Linchen Xiao, Shuhao Xing, Chao Xu, Huihui Xu, Jun Xu, Ruiliang Xu, Wanghan Xu, GanLin Yang, Yuming Yang, Haochen Ye, Jin Ye, Shenglong Ye, Jia Yu, Jiashuo Yu, Jing Yu, Fei Yuan, Bo Zhang, Chao Zhang, Chen Zhang, Hongjie Zhang, Jin Zhang, Qiaosheng Zhang, Qiuyinzhe Zhang, Songyang Zhang, Taolin Zhang, Wenlong Zhang, Wenwei Zhang, Yechen Zhang, Ziyang Zhang, Haiteng Zhao, Qian Zhao, Xiangyu Zhao, Xiangyu Zhao, Bowen Zhou, Dongzhan Zhou, Peiheng Zhou, Yuhao Zhou, Yunhua Zhou, Dongsheng Zhu, Lin Zhu, Yicheng Zou
First: 2025-08-21T17:58:00+00:00 · Latest: 2025-08-21T17:58:00+00:00
Abstract
In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared to those in popular areas, far from sufficient for transforming scientific research and leaving substantial gap between open-source models and closed-source models in these scientific domains. To mitigate this gap and explore a step further toward Artificial General Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped with general understanding and reasoning capabilities with expertise to analyze multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE) model with 28 billion activated parameters and 241 billion total parameters, continually pre-trained on 5T tokens, including over 2.5T tokens from scientific domains. In the post-training stage, Intern-S1 undergoes offline and then online reinforcement learning (RL) in InternBootCamp, where we propose Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks simultaneously. Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL training.On comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting thermodynamic stabilities for crystals. Our models are available at https://huggingface.co/internlm/Intern-S1.
中文标题/摘要
标题:Intern-S1:科学多模态基础模型
近年来,众多开源基础模型涌现,在部分广受关注的领域取得显著进展,性能已十分接近闭源模型。然而,在高价值但更具挑战性的科学专业领域,这些领域仍依赖专家模型,或通用基础模型的进展远落后于热门领域,远不足以变革科学研究,且开源模型与闭源模型在这些科学领域存在巨大差距。为缩小这一差距并探索迈向通用人工智能(AGI)的进一步步伐,我们推出了Intern-S1,这是一个具备通用理解与推理能力、专长于分析多科学模态数据的专业通才模型。Intern-S1是一个多模态混合专家(MoE)模型,拥有280亿激活参数和2410亿总参数,基于5T token(其中包含超过2.5T科学领域token)进行持续预训练。在后训练阶段,Intern-S1在InternBootCamp中经历离线及在线强化学习(RL),我们提出混合奖励(MoR)方法以协同推进超过1000项任务的RL训练。通过算法、数据和训练系统的集成创新,Intern-S1在在线RL训练中达到顶尖性能。在综合评估基准测试中,Intern-S1在开源模型中展现出通用推理任务的竞争优势,并在科学领域显著超越开源模型,在分子合成规划、反应条件预测、晶体热力学稳定性预测等专业任务中超越闭源最先进模型。模型详见:https://huggingface.co/internlm/Intern-S1。
Summary / 总结
The motivation behind Intern-S1 is to address the performance gap between open-source and closed-source foundation models in scientific domains, where general models lag significantly, hindering scientific research transformation and progress toward AGI. The method involves building a multimodal Mixture-of-Experts model with 28 billion activated parameters, pre-trained on 5T tokens including substantial scientific data, and enhanced through offline and online reinforcement learning with a proposed Mixture-of-Rewards approach to handle over 1000 tasks simultaneously. Experimental results show that Intern-S1 achieves competitive performance on general reasoning tasks among open-source models and significantly outperforms them in scientific domains, even surpassing state-of-the-art closed-source models in specialized tasks like molecular synthesis planning and crystal stability prediction.
Intern-S1的动机是解决开源与闭源基础模型在科学领域的性能差距问题,这些领域通用模型常落后于专家系统,阻碍科研转型和通用人工智能发展。方法上构建了280亿激活参数的多模态混合专家模型,基于5万亿token(其中超2.5万亿来自科学数据)持续预训练,并通过离线与在线强化学习及混合奖励机制在1000多个任务上微调。实验结果表明,Intern-S1在通用推理任务上达到开源模型领先水平,在科学领域显著优于开源模型,甚至在分子合成规划、晶体稳定性预测等专业任务上超越了最先进的闭源模型。
Waver: Wave Your Way to Lifelike Video Generation
Authors: Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Zehuan Yuan, Bingyue Peng
First: 2025-08-21T17:56:10+00:00 · Latest: 2025-08-21T17:56:10+00:00
Abstract
We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion, achieving superior motion amplitude and temporal consistency in video synthesis. Notably, it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming existing open-source models and matching or surpassing state-of-the-art commercial solutions. We hope this technical report will help the community more efficiently train high-quality video generation models and accelerate progress in video generation technologies. Official page: https://github.com/FoundationVision/Waver.
中文标题/摘要
标题:Waver:以波动方式实现逼真视频生成
我们推出Waver,一个用于统一图像与视频生成的高性能基础模型。该模型能直接生成长度5至10秒、原生分辨率720p的视频,并支持上采样至1080p。在单一集成框架内,Waver同步支持文本生成视频(T2V)、图像生成视频(I2V)及文本生成图像(T2I)功能。通过引入混合流式DiT架构,我们增强了模态对齐能力并加速了训练收敛。为确保训练数据质量,建立了全流程数据筛选机制,并人工标注训练基于MLLM的视频质量评估模型以筛选最优样本。此外,我们提供详尽的训练与推理方案以促进高质量视频生成。基于这些创新,Waver在捕捉复杂运动方面表现卓越,实现了优异的运动幅度与时间一致性。值得注意的是,截至2025年7月30日北京时间10时,该模型在Artificial Analysis平台的T2V和I2V排行榜均位列前三,持续超越现有开源模型,媲美或领先最先进商业解决方案。本技术报告旨在助力社区更高效训练高质量视频生成模型,加速视频技术发展。官方页面:https://github.com/FoundationVision/Waver。
Summary / 总结
Motivated by the need for high-quality, unified video and image generation, Waver introduces a Hybrid Stream DiT architecture to enhance modality alignment and training efficiency, supported by a rigorous data curation pipeline and an MLLM-based quality filter. The method enables direct generation of 5-10 second videos at 720p, upscaled to 1080p, and supports text-to-video, image-to-video, and text-to-image tasks in a single framework. Experimental results show Waver excels in complex motion capture and temporal consistency, ranking Top 3 on T2V and I2V leaderboards, outperforming open-source models and matching state-of-the-art commercial solutions.
Waver旨在开发一个统一且高质量的视频生成模型,其动机是解决现有方法在模态对齐和训练效率上的不足。该方法采用混合流DiT架构,结合严格的数据筛选流程和基于MLLM的质量过滤模型,支持文本到视频、图像到视频和文本到图像的单框架生成,可原生生成5-10秒720p视频并上采样至1080p。实验结果表明,Waver在复杂运动和时间一致性上表现优异,在T2V和I2V排行榜中位列前三,超越开源模型并媲美最先进的商业解决方案。
LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
Authors: Ming Yin, Dinghan Shen, Silei Xu, Jianbing Han, Sixun Dong, Mian Zhang, Yebowen Hu, Shujian Liu, Simin Ma, Song Wang, Sathish Reddy Indurthi, Xun Wang, Yiran Chen, Kaiqiang Song
First: 2025-08-21T17:55:54+00:00 · Latest: 2025-08-21T17:55:54+00:00
Abstract
Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.
中文标题/摘要
标题:LiveMCP-101:对支持MCP的智能体在挑战性查询下的压力测试与诊断
工具调用已成为AI智能体与现实世界交互并解决复杂任务的关键能力。虽然模型上下文协议(MCP)为工具集成提供了强大的标准化框架,但在基准测试AI智能体如何在真实动态场景中有效使用多样化MCP工具解决多步骤任务方面存在显著空白。本研究推出LiveMCP-101基准测试,包含101个精心筛选的真实世界查询,通过迭代式LLM重写和人工审核优化,要求协调使用包括网络搜索、文件操作、数学推理和数据分析在内的多种MCP工具。此外,我们引入了一种新颖的评估方法,利用真实执行计划而非原始API输出,更好地反映现实环境的动态特性。实验表明,即使前沿LLMs的成功率也低于60%,突显了工具协调方面的重大挑战。详细的消融实验和错误分析进一步揭示了不同的故障模式和令牌使用效率低下问题,为推进现有模型指明了具体方向。LiveMCP-101为评估真实世界智能体能力设立了严格标准,推动通过工具使用可靠执行复杂任务的自主AI系统发展。
Summary / 总结
Motivated by the need to benchmark AI agents' real-world tool-use capabilities beyond simple API calls, this work introduces LiveMCP-101, a benchmark of 101 complex queries requiring multi-step tool orchestration via the Model Context Protocol. The method involves curating realistic queries through LLM rewriting and manual refinement, and evaluating agents using ground-truth execution plans rather than raw API outputs to better reflect dynamic environments. Experimental results show that even top-performing LLMs achieve below 60% success, with detailed error analysis revealing persistent challenges in tool coordination and token inefficiency.
LiveMCP-101 的动机是解决缺乏基准测试来评估 AI 代理在现实动态场景中使用多样化 MCP 工具解决多步骤任务的能力。该方法包括创建一个包含 101 个需要协调工具使用的真实世界查询的基准,通过 LLM 重写和人工审查进行精炼,并引入一种基于真实执行计划而非原始 API 输出的新颖评估方法。主要实验结果表明,即使是最先进的 LLM 成功率也低于 60%,详细的错误分析揭示了不同的失败模式和令牌使用效率低下,突显了工具编排方面的挑战。
History