arXiv 论文速递

Snapshot: 20260423_0426

ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Authors: Zhengwentai Sun, Keru Zheng, Chenghong Li, Hongjie Liao, Xihe Yang, Heyuan Li, Yihao Zhi, Shuliang Ning, Shuguang Cui, Xiaoguang Han

First: 2026-04-21T17:47:26+00:00 · Latest: 2026-04-21T17:47:26+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.

中文标题/摘要

标题：ReImagine: 通过图像优先合成重新思考高质量人类视频生成

由于在有限的多视角数据下难以同时建模人类外观、运动和摄像机视角，人类视频生成仍然具有挑战性。现有方法通常分别处理这些因素，导致控制力有限或视觉质量降低。我们从图像优先的角度重新审视这一问题，通过图像生成学习高质量的人类外观，并将其作为视频合成的先验，将外观建模与时间一致性解耦。我们提出了一种结合预训练图像主干和基于SMPL-X的运动指导的可控制姿态和视角的流水线，并基于预训练的视频扩散模型引入了一个无需训练的时间细化阶段。我们的方法在不同姿态和视角下生成高质量、时间一致的视频。我们还发布了标准人类数据集和辅助模型，用于合成人类图像。代码和数据可在https://github.com/Taited/ReImagine公开获取。

Summary / 总结

The research aims to improve the quality and controllability of human video generation by addressing the challenges of modeling human appearance, motion, and camera viewpoint. The method proposes an image-first synthesis approach, using a pretrained image backbone for high-quality appearance and SMPL-X for motion guidance, with a training-free temporal refinement stage. Key findings include the generation of high-quality, temporally consistent videos under various poses and viewpoints, and the release of a canonical human dataset and an auxiliary model for compositional human image synthesis.

研究旨在通过解决人类外观、运动和摄像机视角建模的挑战来提高人类视频生成的质量和可控性。方法采用图像优先合成策略，使用预训练的图像骨干网络进行高质量外观建模，并结合SMPL-X进行运动指导，同时采用基于预训练视频扩散模型的无训练阶段进行时间上的细化。主要发现包括在各种姿态和视角下生成高质量、时间一致的视频，并发布了标准的人类数据集和辅助模型以用于合成的人类图像生成。

InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement

Authors: Nikita Kister, Pradyumna YM, István Sárándi, Jiayi Wang, Anna Khoreva, Gerard Pons-Moll

First: 2026-04-21T16:53:18+00:00 · Latest: 2026-04-21T16:53:18+00:00