arXiv 论文速递

Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

Authors: Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, Jianfei Chen, Song Han, Kurt Keutzer, Ion Stoica

First: 2025-05-24T21:30:29+00:00 · Latest: 2025-10-07T17:25:03+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates top-p dynamic budget control and customized kernel implementations, achieving up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively. Our code is open-sourced at \href{https://github.com/svg-project/Sparse-VideoGen}{https://github.com/svg-project/Sparse-VideoGen}.

中文标题/摘要

标题：Sparse VideoGen2：通过语义感知排列加速视频生成

扩散变换器（DiTs）对于视频生成至关重要，但由于注意力机制的二次复杂性导致了显著的延迟。通过仅计算关键令牌，稀疏注意力减少了计算成本，并提供了一种有希望的加速方法。然而，我们发现现有方法在相同的计算预算下未能达到最优生成质量，原因有两个：（1）关键令牌识别不准确：当前方法基于位置而非语义对令牌进行聚类，导致聚合表示不够精确。（2）计算浪费过多：关键令牌分散在非关键令牌中，导致在优化处理连续令牌的GPU上浪费了计算资源。在本文中，我们提出SVG2，这是一种无需训练的框架，旨在最大化识别准确性并最小化计算浪费，实现生成质量和效率之间的帕累托前沿权衡。SVG2的核心是语义感知排列，该方法使用k-means根据语义相似性对令牌进行聚类和重新排序。这种方法确保了精确的聚类表示，提高了识别准确性，并使关键令牌的布局更加密集，从而可以在不填充的情况下实现高效的计算。此外，SVG2集成了top-p动态预算控制和定制内核实现，分别在HunyuanVideo和Wan 2.1上实现了高达2.30倍和1.89倍的加速，同时保持PSNR分别为30和26。我们的代码已开源于https://github.com/svg-project/Sparse-VideoGen。

Summary / 总结

This paper addresses the latency issue in video generation using Diffusion Transformers (DiTs) by proposing SVG2, a framework that uses semantic-aware permutation to identify critical tokens accurately and minimize computation waste. SVG2 achieves a balance between generation quality and efficiency, with up to 2.30x and 1.89x speedup on HunyuanVideo and Wan 2.1, respectively, while maintaining PSNR scores of up to 30 and 26. The method involves clustering and reordering tokens based on semantic similarity using k-means and integrating top-p dynamic budget control and customized kernel implementations.

本文提出SVG2框架，通过语义感知的排列来准确识别关键令牌并最小化计算浪费，以解决使用扩散变换器（DiTs）进行视频生成时的延迟问题。SVG2在HunyuanVideo和Wan 2.1上分别实现了高达2.30倍和1.89倍的速度提升，同时保持PSNR分数分别为30和26。该方法包括使用k-means根据语义相似性聚类和重新排列令牌，并集成top-p动态预算控制和定制内核实现。