Scaling Group Inference for Diverse and High-Quality Generation
Authors: Gaurav Parmar, Or Patashnik, Daniil Ostashev, Kuan-Chieh Wang, Kfir Aberman, Srinivasa Narasimhan, Jun-Yan Zhu
Venue: www
First: 2025-08-21T17:59:57+00:00 · Latest: 2025-08-21T17:59:57+00:00
Comments: Project website: https://www.cs.cmu.edu/~group-inference, GitHub:
https://github.com/GaParmar/group-inference
Abstract
Generative models typically sample outputs independently, and recent
inference-time guidance and scaling algorithms focus on improving the quality
of individual samples. However, in real-world applications, users are often
presented with a set of multiple images (e.g., 4-8) for each prompt, where
independent sampling tends to lead to redundant results, limiting user choices
and hindering idea exploration. In this work, we introduce a scalable group
inference method that improves both the diversity and quality of a group of
samples. We formulate group inference as a quadratic integer assignment
problem: candidate outputs are modeled as graph nodes, and a subset is selected
to optimize sample quality (unary term) while maximizing group diversity
(binary term). To substantially improve runtime efficiency, we progressively
prune the candidate set using intermediate predictions, allowing our method to
scale up to large candidate sets. Extensive experiments show that our method
significantly improves group diversity and quality compared to independent
sampling baselines and recent inference algorithms. Our framework generalizes
across a wide range of tasks, including text-to-image, image-to-image, image
prompting, and video generation, enabling generative models to treat multiple
outputs as cohesive groups rather than independent samples.
中文标题/摘要
标题:规模化群体推理:实现多样化与高质量生成
生成模型通常独立采样输出,而近期的推理时引导与扩展算法主要关注提升单一样本质量。然而在实际应用中,用户常需针对每个提示获取多张图像(如4-8张),独立采样易导致结果冗余,限制用户选择并阻碍创意探索。本研究提出一种可扩展的群体推理方法,同步提升样本组的多样性与质量。我们将群体推理构建为二次整数分配问题:候选输出建模为图节点,通过选择子集优化样本质量(一元项)并最大化群体多样性(二元项)。为显著提升运行效率,采用中间预测逐步剪枝候选集,使方法能扩展至大规模候选集。大量实验表明,相较于独立采样基线及近期推理算法,本方法显著提升了群体多样性与质量。该框架可泛化至文本生成图像、图像到图像转换、图像提示及视频生成等广泛任务,使生成模型将多输出视为有机整体而非独立样本。
Summary / 总结
The motivation is to address the redundancy in independently sampled outputs from generative models, which limits user choice and exploration when presented with multiple results per prompt. The method formulates group inference as a quadratic integer assignment problem, selecting a subset of candidate outputs to optimize both quality and diversity, and uses progressive pruning for scalability. Experimental results demonstrate significant improvements in group diversity and quality across text-to-image, image-to-image, image prompting, and video generation tasks compared to baseline methods.
该研究旨在解决生成模型独立采样输出导致的冗余问题,这种冗余限制了用户在每次提示获得多个结果时的选择和探索。方法将群体推理构建为二次整数分配问题,通过选择候选输出子集来优化质量和多样性,并采用渐进剪枝以提高可扩展性。实验结果表明,在文本到图像、图像到图像和视频生成等任务中,该方法相比基线显著提升了群体多样性和质量。
ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling
Authors: Jinhyung Park, Javier Romero, Shunsuke Saito, Fabian Prada, Takaaki Shiratori, Yichen Xu, Federica Bogo, Shoou-I Yu, Kris Kitani, Rawal Khirodkar
Venue: ICCV 2025
First: 2025-08-21T17:58:56+00:00 · Latest: 2025-08-21T17:58:56+00:00
Comments: ICCV 2025; Website: https://jindapark.github.io/projects/atlas/
Abstract
Parametric body models offer expressive 3D representation of humans across a
wide range of poses, shapes, and facial expressions, typically derived by
learning a basis over registered 3D meshes. However, existing human mesh
modeling approaches struggle to capture detailed variations across diverse body
poses and shapes, largely due to limited training data diversity and
restrictive modeling assumptions. Moreover, the common paradigm first optimizes
the external body surface using a linear basis, then regresses internal
skeletal joints from surface vertices. This approach introduces problematic
dependencies between internal skeleton and outer soft tissue, limiting direct
control over body height and bone lengths. To address these issues, we present
ATLAS, a high-fidelity body model learned from 600k high-resolution scans
captured using 240 synchronized cameras. Unlike previous methods, we explicitly
decouple the shape and skeleton bases by grounding our mesh representation in
the human skeleton. This decoupling enables enhanced shape expressivity,
fine-grained customization of body attributes, and keypoint fitting independent
of external soft-tissue characteristics. ATLAS outperforms existing methods by
fitting unseen subjects in diverse poses more accurately, and quantitative
evaluations show that our non-linear pose correctives more effectively capture
complex poses compared to linear models.
中文标题/摘要
标题:ATLAS:解耦骨骼与形态参数以实现富有表现力的参数化人体建模
参数化人体模型通过基于配准三维网格学习基向量,能够跨多种姿态、体型和面部表情提供富有表现力的三维人体表示。然而,现有方法因训练数据多样性不足和建模假设限制,难以捕捉不同姿态与体型的细节变化。传统范式先通过线性基优化体表,再从表面顶点回归内部骨骼关节点,导致骨骼与软组织间存在不良依赖,限制了对身高和骨长的直接控制。为此,我们提出ATLAS模型——基于240台同步相机采集的60万次高分辨率扫描构建的高保真人体模型。该方法通过将网格表示锚定于人体骨骼,显式解耦形态与骨骼基向量,从而增强形态表现力、实现细粒度身体属性定制,并支持独立于软组织特征的关键点拟合。定量评估表明,ATLAS能更精准地拟合未见过的多姿态人体,其非线性姿态校正比线性模型更能有效捕捉复杂姿态。
Summary / 总结
The motivation behind ATLAS is to overcome limitations in existing parametric human models, which struggle with detailed variations across poses and shapes due to data constraints and modeling assumptions, and to decouple skeletal and shape parameters for better control. The method involves learning a high-fidelity body model from 600k high-resolution scans using 240 cameras, explicitly grounding the mesh representation in the human skeleton to separate shape and skeleton bases, enabling enhanced expressivity and customization. Experimental results show that ATLAS fits unseen subjects in diverse poses more accurately than existing methods, with quantitative evaluations confirming that its non-linear pose correctives capture complex poses more effectively than linear models.
ATLAS的动机是解决现有参数化人体模型在姿势和形状细节变化上的局限性,这些模型受数据和建模假设约束,且骨骼与表面参数纠缠。该方法基于大量3D扫描数据学习高保真人体模型,通过将网格基于人体骨骼显式解耦形状和骨骼基,以增强表达性和控制性。实验结果表明,ATLAS在多样姿势下对未见主体的拟合更准确,非线性姿势校正比线性模型更有效地捕捉复杂姿势。
Intern-S1: A Scientific Multimodal Foundation Model
Authors: Lei Bai, Zhongrui Cai, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, Yongkang Chen, Yu Cheng, Yu Cheng, Pei Chu, Tao Chu, Erfei Cui, Ganqu Cui, Long Cui, Ziyun Cui, Nianchen Deng, Ning Ding, Nanqin Dong, Peijie Dong, Shihan Dou, Sinan Du, Haodong Duan, Caihua Fan, Ben Gao, Changjiang Gao, Jianfei Gao, Songyang Gao, Yang Gao, Zhangwei Gao, Jiaye Ge, Qiming Ge, Lixin Gu, Yuzhe Gu, Aijia Guo, Qipeng Guo, Xu Guo, Conghui He, Junjun He, Yili Hong, Siyuan Hou, Caiyu Hu, Hanglei Hu, Jucheng Hu, Ming Hu, Zhouqi Hua, Haian Huang, Junhao Huang, Xu Huang, Zixian Huang, Zhe Jiang, Lingkai Kong, Linyang Li, Peiji Li, Pengze Li, Shuaibin Li, Tianbin Li, Wei Li, Yuqiang Li, Dahua Lin, Junyao Lin, Tianyi Lin, Zhishan Lin, Hongwei Liu, Jiangning Liu, Jiyao Liu, Junnan Liu, Kai Liu, Kaiwen Liu, Kuikun Liu, Shichun Liu, Shudong Liu, Wei Liu, Xinyao Liu, Yuhong Liu, Zhan Liu, Yinquan Lu, Haijun Lv, Hongxia Lv, Huijie Lv, Qidang Lv, Ying Lv, Chengqi Lyu, Chenglong Ma, Jianpeng Ma, Ren Ma, Runmin Ma, Runyuan Ma, Xinzhu Ma, Yichuan Ma, Zihan Ma, Sixuan Mi, Junzhi Ning, Wenchang Ning, Xinle Pang, Jiahui Peng, Runyu Peng, Yu Qiao, Jiantao Qiu, Xiaoye Qu, Yuan Qu, Yuchen Ren, Fukai Shang, Wenqi Shao, Junhao Shen, Shuaike Shen, Chunfeng Song, Demin Song, Diping Song, Chenlin Su, Weijie Su, Weigao Sun, Yu Sun, Qian Tan, Cheng Tang, Huanze Tang, Kexian Tang, Shixiang Tang, Jian Tong, Aoran Wang, Bin Wang, Dong Wang, Lintao Wang, Rui Wang, Weiyun Wang, Wenhai Wang, Yi Wang, Ziyi Wang, Ling-I Wu, Wen Wu, Yue Wu, Zijian Wu, Linchen Xiao, Shuhao Xing, Chao Xu, Huihui Xu, Jun Xu, Ruiliang Xu, Wanghan Xu, GanLin Yang, Yuming Yang, Haochen Ye, Jin Ye, Shenglong Ye, Jia Yu, Jiashuo Yu, Jing Yu, Fei Yuan, Bo Zhang, Chao Zhang, Chen Zhang, Hongjie Zhang, Jin Zhang, Qiaosheng Zhang, Qiuyinzhe Zhang, Songyang Zhang, Taolin Zhang, Wenlong Zhang, Wenwei Zhang, Yechen Zhang, Ziyang Zhang, Haiteng Zhao, Qian Zhao, Xiangyu Zhao, Xiangyu Zhao, Bowen Zhou, Dongzhan Zhou, Peiheng Zhou, Yuhao Zhou, Yunhua Zhou, Dongsheng Zhu, Lin Zhu, Yicheng Zou
First: 2025-08-21T17:58:00+00:00 · Latest: 2025-08-21T17:58:00+00:00
Abstract
In recent years, a plethora of open-source foundation models have emerged,
achieving remarkable progress in some widely attended fields, with performance
being quite close to that of closed-source models. However, in high-value but
more challenging scientific professional fields, either the fields still rely
on expert models, or the progress of general foundation models lags
significantly compared to those in popular areas, far from sufficient for
transforming scientific research and leaving substantial gap between
open-source models and closed-source models in these scientific domains. To
mitigate this gap and explore a step further toward Artificial General
Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped
with general understanding and reasoning capabilities with expertise to analyze
multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE)
model with 28 billion activated parameters and 241 billion total parameters,
continually pre-trained on 5T tokens, including over 2.5T tokens from
scientific domains. In the post-training stage, Intern-S1 undergoes offline and
then online reinforcement learning (RL) in InternBootCamp, where we propose
Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks
simultaneously. Through integrated innovations in algorithms, data, and
training systems, Intern-S1 achieved top-tier performance in online RL
training.On comprehensive evaluation benchmarks, Intern-S1 demonstrates
competitive performance on general reasoning tasks among open-source models and
significantly outperforms open-source models in scientific domains, surpassing
closed-source state-of-the-art models in professional tasks, such as molecular
synthesis planning, reaction condition prediction, predicting thermodynamic
stabilities for crystals. Our models are available at
https://huggingface.co/internlm/Intern-S1.
中文标题/摘要
标题:Intern-S1:科学多模态基础模型
近年来,开源基础模型大量涌现,在部分广受关注的领域取得显著进展,其性能已十分接近闭源模型。然而,在高价值但更具挑战性的科学专业领域,这些领域要么仍依赖专家模型,要么通用基础模型的进展远落后于热门领域,远不足以变革科学研究,且开源模型与闭源模型在这些科学领域存在巨大差距。为缩小这一差距并进一步探索通用人工智能(AGI),我们推出Intern-S1——一个具备通用理解与推理能力,并能分析多模态科学数据的专业通才模型。Intern-S1是多模态混合专家(MoE)模型,拥有280亿激活参数和2410亿总参数,基于5T token(其中包含超过2.5T科学领域token)进行持续预训练。在后训练阶段,该模型通过InternBootCamp先后进行离线和在线强化学习(RL),我们提出混合奖励机制(MoR)以协同千余项任务的RL训练。通过算法、数据和训练系统的集成创新,Intern-S1在在线RL训练中达到顶尖性能。在综合评估基准测试中,Intern-S1在开源模型中展现出通用推理任务的竞争优势,并在科学领域显著超越开源模型,于分子合成规划、反应条件预测、晶体热力学稳定性预测等专业任务中超越闭源最先进模型。模型详见:https://huggingface.co/internlm/Intern-S1。
Summary / 总结
Motivated by the significant performance gap between open-source and closed-source models in scientific domains, this work introduces Intern-S1, a multimodal Mixture-of-Experts foundation model designed to enhance general reasoning and specialized scientific analysis. The method involves continual pre-training on a large-scale scientific corpus and employs a novel Mixture-of-Rewards reinforcement learning approach during post-training to handle over 1000 tasks simultaneously. Experimental results show that Intern-S1 achieves competitive performance on general reasoning tasks and significantly outperforms existing open-source models in scientific applications, even surpassing state-of-the-art closed-source models in specialized tasks like molecular synthesis planning and crystal stability prediction.
针对科学领域中开源与闭源模型之间的显著性能差距,本研究提出了Intern-S1,一个多模态专家混合基础模型,具有280亿激活参数,通过持续预训练5T token(包括大量科学数据)并采用混合奖励的离线与在线强化学习进行优化。该方法整合了算法、数据和系统创新,以实现通用推理和专业化科学分析。实验结果表明,Intern-S1在通用推理任务上表现具有竞争力,在科学领域显著优于现有开源模型,甚至在分子合成规划、晶体稳定性预测等专业任务上超越了最先进的闭源模型。
Waver: Wave Your Way to Lifelike Video Generation
Authors: Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Zehuan Yuan, Bingyue Peng
First: 2025-08-21T17:56:10+00:00 · Latest: 2025-08-21T17:56:10+00:00
Abstract
We present Waver, a high-performance foundation model for unified image and
video generation. Waver can directly generate videos with durations ranging
from 5 to 10 seconds at a native resolution of 720p, which are subsequently
upscaled to 1080p. The model simultaneously supports text-to-video (T2V),
image-to-video (I2V), and text-to-image (T2I) generation within a single,
integrated framework. We introduce a Hybrid Stream DiT architecture to enhance
modality alignment and accelerate training convergence. To ensure training data
quality, we establish a comprehensive data curation pipeline and manually
annotate and train an MLLM-based video quality model to filter for the
highest-quality samples. Furthermore, we provide detailed training and
inference recipes to facilitate the generation of high-quality videos. Building
on these contributions, Waver excels at capturing complex motion, achieving
superior motion amplitude and temporal consistency in video synthesis. Notably,
it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial
Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming
existing open-source models and matching or surpassing state-of-the-art
commercial solutions. We hope this technical report will help the community
more efficiently train high-quality video generation models and accelerate
progress in video generation technologies. Official page:
https://github.com/FoundationVision/Waver.
中文标题/摘要
标题:Waver:以波动方式实现逼真视频生成
我们推出Waver,一个用于统一图像与视频生成的高性能基础模型。该模型能直接生成长度5至10秒、原生分辨率720p的视频,并后续提升至1080p。在单一集成框架内同时支持文本生成视频(T2V)、图像生成视频(I2V)及文本生成图像(T2I)功能。通过引入混合流DiT架构增强模态对齐并加速训练收敛。为确保训练数据质量,我们建立了完整的数据筛选流程,并手动标注训练基于MLLM的视频质量评估模型以筛选最优样本。此外,提供详细的训练与推理方案以促进高质量视频生成。基于这些创新,Waver在捕捉复杂运动、实现卓越运动幅度与时间一致性方面表现突出,在Artificial Analysis平台的T2V和I2V排行榜均位列前三(数据截至2025年7月30日北京时间10时),持续超越现有开源模型并媲美或领先最先进商业方案。本技术报告旨在助力社区更高效训练高质量视频生成模型,加速视频技术发展。官方页面:https://github.com/FoundationVision/Waver。
Summary / 总结
Motivated by the need for high-quality, unified video and image generation, Waver introduces a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training, alongside a rigorous data curation pipeline using an MLLM-based video quality model to filter training samples. The method supports text-to-video, image-to-video, and text-to-image generation in a single framework, producing 5-10 second videos at 720p native resolution, upscaled to 1080p. Experimentally, Waver excels in complex motion capture and temporal consistency, ranking Top 3 on T2V and I2V leaderboards, outperforming open-source models and matching or surpassing commercial solutions.
Waver旨在实现高质量、统一的视频和图像生成,其动机是解决现有模型在复杂运动和时间一致性上的不足。方法上采用混合流DiT架构以增强模态对齐和训练效率,结合严格的数据筛选流程和基于MLLM的质量过滤模型。实验结果表明,该模型能直接生成5-10秒720p视频并升级至1080p,支持文本到视频、图像到视频和文本到图像任务,在T2V和I2V排行榜中位列前三,超越开源模型并媲美商业方案。
LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
Authors: Ming Yin, Dinghan Shen, Silei Xu, Jianbing Han, Sixun Dong, Mian Zhang, Yebowen Hu, Shujian Liu, Simin Ma, Song Wang, Sathish Reddy Indurthi, Xun Wang, Yiran Chen, Kaiqiang Song
First: 2025-08-21T17:55:54+00:00 · Latest: 2025-08-21T17:55:54+00:00
Abstract
Tool calling has emerged as a critical capability for AI agents to interact
with the real world and solve complex tasks. While the Model Context Protocol
(MCP) provides a powerful standardized framework for tool integration, there is
a significant gap in benchmarking how well AI agents can effectively solve
multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In
this work, we present LiveMCP-101, a benchmark of 101 carefully curated
real-world queries, refined through iterative LLM rewriting and manual review,
that require coordinated use of multiple MCP tools including web search, file
operations, mathematical reasoning, and data analysis. Moreover, we introduce a
novel evaluation approach that leverages ground-truth execution plans rather
than raw API outputs, better reflecting the evolving nature of real-world
environments. Experiments show that even frontier LLMs achieve a success rate
below 60\%, highlighting major challenges in tool orchestration. Detailed
ablations and error analysis further reveal distinct failure modes and
inefficiencies in token usage, pointing to concrete directions for advancing
current models. LiveMCP-101 sets a rigorous standard for evaluating real-world
agent capabilities, advancing toward autonomous AI systems that reliably
execute complex tasks through tool use.
中文标题/摘要
标题:LiveMCP-101:对支持MCP的智能体进行挑战性查询的压力测试与诊断
工具调用已成为AI智能体与现实世界交互并解决复杂任务的关键能力。虽然模型上下文协议(MCP)为工具集成提供了强大的标准化框架,但在基准测试AI智能体如何在真实动态场景中有效使用多样化MCP工具解决多步骤任务方面存在显著空白。本研究推出LiveMCP-101基准测试,包含101个精心筛选的真实查询,通过迭代式LLM重写和人工审核优化,要求协调使用包括网络搜索、文件操作、数学推理和数据分析在内的多种MCP工具。此外,我们引入了一种新颖的评估方法,利用真实执行计划而非原始API输出,更好地反映现实环境的动态特性。实验表明,即使前沿LLMs的成功率也低于60%,突显了工具协调方面的重大挑战。详细的消融实验和错误分析进一步揭示了不同的故障模式和令牌使用效率低下的问题,为推进现有模型指明了具体方向。LiveMCP-101为评估真实世界智能体能力设立了严格标准,推动通过工具使用可靠执行复杂任务的自主AI系统发展。
Summary / 总结
Motivated by the need to benchmark AI agents' real-world tool-use capabilities beyond simple API calls, this work introduces LiveMCP-101, a benchmark of 101 challenging queries requiring coordinated use of multiple MCP tools like web search, file operations, and data analysis. The method involves curating realistic queries through LLM rewriting and manual review, and evaluating agents using ground-truth execution plans to better reflect dynamic environments. Experimental results show that even top-performing LLMs achieve below 60% success, with detailed error analysis revealing failures in tool orchestration and token inefficiency, highlighting key challenges for future model improvement.
本研究旨在评估AI代理在现实世界中工具使用能力,提出了LiveMCP-101基准,包含101个需要协调使用多种MCP工具(如网络搜索和数据分析)的查询,并通过真实执行计划而非原始API输出来评估。方法采用迭代式LLM重写和人工审核以模拟动态场景,实验表明即使顶级LLM成功率也低于60%,揭示了工具协调挑战和令牌使用低效问题,为模型改进指明了方向。