arXiv 论文速递

Snapshot: 20260516_0436

Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

Authors: Yifan Wang, Tong He

First: 2026-05-14T17:58:26+00:00 · Latest: 2026-05-14T17:58:26+00:00

Comments: Project page: https://yyfz.github.io/warp-as-history/

Abstract

Camera-controlled video generation has made substantial progress, enabling generated videos to follow prescribed viewpoint trajectories. However, existing methods usually learn camera-specific conditioning through camera encoders, control branches, or attention and positional-encoding modifications, which often require post-training on large-scale camera-annotated videos. Training-free alternatives avoid such post-training, but often shift the cost to test-time optimization or extra denoising-time guidance. We propose Warp-as-History, a simple interface that turns camera-induced warps into camera-warped pseudo-history with target-frame positional alignment and visible-token selection. Given a target camera trajectory, we construct camera-warped pseudo-history from past observations and feed it through the model's visual-history pathway. Crucially, we align its positional encoding with the target frames being denoised and remove warped-history tokens without valid source observations. Without any training, architectural modification, or test-time optimization, this interface reveals a non-trivial zero-shot capability of a frozen video generation model to follow camera trajectories. Moreover, lightweight offline LoRA finetuning on only one camera-annotated video further improves this capability and generalizes to unseen videos, improving camera adherence, visual quality, and motion dynamics without test-time optimization or target-video adaptation. Extensive experiments on diverse datasets confirm the effectiveness of our method.

Summary / 总结

Warp-as-History proposes a simple method to enable a video generation model to follow camera trajectories without requiring post-training on camera-annotated videos. By aligning positional encoding and removing invalid tokens, it generates camera-warped pseudo-history from past observations. This method reveals the model's zero-shot capability and can be further improved with lightweight offline LoRA finetuning on a single annotated video, enhancing camera adherence, visual quality, and motion dynamics.

Warp-as-History 提出了一种简单的方法，使冻结的视频生成模型能够跟随摄像机轨迹，无需在摄像机标注视频上进行后训练。通过对齐位置编码并移除无效令牌，它从过去的观察中生成摄像机扭曲的伪历史。这种方法揭示了模型的零样本能力，并且可以通过对单个标注视频进行轻量级的离线 LoRA 微调来进一步改进，从而增强摄像机对准、视觉质量和运动动态。

Does Synthetic Layered Design Data Benefit Layered Design Decomposition?

Authors: Kam Man Wu, Haolin Yang, Qingyu Chen, Yihu Tang, Jingye Chen, Qifeng Chen

First: 2026-05-14T17:55:11+00:00 · Latest: 2026-05-14T17:55:11+00:00

Comments: 22 pages, 10 figures. Code is available at https://github.com/YangHaolin0526/SynLayers

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in image generation have made it easy to produce high-quality images. However, these outputs are inherently flattened, entangling foreground elements, background, and text within a fixed canvas. As a result, flexible post-generation editing remains challenging, revealing a clear last-mile gap toward practical usability. Existing approaches either rely on scarce proprietary layered assets or construct partially synthetic data from limited structural priors. However, both strategies face fundamental challenges in scalability. In this work, we investigate whether pure synthetic layered data can improve graphic design decomposition. We make the assumption that, in graphic design, effective decomposition does not require modeling inter-layer dependencies as precisely as in natural-image composition, since design elements are often intentionally arranged as modular and semantically separable components. Concretely, we conduct a data-centric study based on CLD baseline, which is a state-of-the-art layer decomposition framework. Based on the baseline, we construct our own synthetic dataset, SynLayers, generate textual supervision using vision language models, and automate inference inputs with VLM-predicted bounding boxes. Our study reveals three key findings: (1) even training with purely synthetic data can outperform non-scalable alternatives such as the widely used PrismLayersPro dataset, demonstrating its viability as a scalable and effective substitute; (2) performance consistently improves with increased training data scale, while gains begin to saturate at around 50K samples; and (3) synthetic data enables balanced control over layer-count distributions, avoiding the layer-count imbalance commonly observed in real-world datasets. We hope this data-centric study encourages broader adoption of synthetic data as a practical foundation for layered design editing systems.

Do-Undo Bench: Reversibility for Action Understanding in Image Generation

Authors: Shweta Mahajan, Shreya Kadambi, Hoang Le, Rajeev Yasarla, Apratim Bhattacharyya, Munawar Hayat, Fatih Porikli

First: 2025-12-15T18:03:42+00:00 · Latest: 2026-05-14T17:13:30+00:00

Comments: Project page: https://s-mahajan.github.io/Do-Undo-Bench/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. Unlike prior work that relies on prompt-based image generation and editing to perform action-conditioned image manipulation, our training hypothesis requires models to simulate the outcome of a real-world action and then reverse it to the original state. This forward-reverse requirement tests genuine cause-and-effect understanding rather than stylistic or semantic edits. We curate a high-quality benchmark of reversible actions from real-world scenarios to enable robust action grounding. Our experiments reveal that current models struggle with action reversibility, highlighting the need to evaluate action understanding. Do-Undo provides an intuitive testbed for evaluating and advancing action-aware generation in multimodal systems that must reason about real-world dynamics.

中文标题/摘要

标题：Do-Undo 基准：图像生成中的动作理解可逆性

我们提出了 Do-Undo 任务和基准，以解决视觉-语言模型中的关键问题：理解并生成由真实世界动作驱动的场景变换。与先前依赖提示驱动的图像生成和编辑来执行动作条件下的图像操作的工作不同，我们的训练假设要求模型模拟真实世界动作的结果，然后将其恢复到原始状态。这一正向-反向要求测试的是真正的因果理解，而不是风格或语义编辑。我们从真实场景中精心策划了一个高质量的可逆动作基准，以实现稳健的动作定位。我们的实验表明，当前模型在动作可逆性方面存在困难，突显了评估动作理解的必要性。Do-Undo 为评估和推进多模态系统中的动作感知提供了一个直观的测试平台，这些系统必须推理真实世界的动态。

Summary / 总结

This work introduces the Do--Do and benchmark for addressing a gap in image-language generation, introducing plausible scene transformations based real-world actions.. The method involves introducing a hypothesis on image-conditioned image manipulation, and curating a high-reverse benchmark from real on-world scenarios to evaluate robust image generation.. The findings highlight current models struggling with on-reverse operations highlight highlight highlight highlight highlight the need for a intuitive testbed for evaluating and advancing on-aware generation generation.-

提出了Do-Undo任务和基准，旨在评估视觉-语言模型在基于真实世界动作理解并生成合理场景变换的能力。不同于以往依赖提示进行图像生成的方法，该基准要求模型模拟动作并将其恢复到原始状态，测试其真正的因果理解能力。实验结果显示当前模型在动作可逆性方面存在困难，表明需要在多模态系统中提高动作理解能力。

On the Cultural Anachronism and Temporal Reasoning in Vision Language Models

Authors: Mukul Ranjan, Prince Jha, Khushboo Kumari, Zhiqiang Shen

First: 2026-05-14T16:58:16+00:00 · Latest: 2026-05-14T16:58:16+00:00

Comments: Project Page: https://khushboo0012.github.io/tab-vlm-webpage/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state-of-the-art models reveal significant deficiencies on our benchmark, and even the best model (GPT-5.2) achieves only 58.7% overall accuracy. The performance gap persists across varying architectures and scales, suggesting that cultural anachronism represents a significant limitation in visual AI systems, regardless of model size. These findings highlight the disparity between current VLM capabilities and the requirements for accurately interpreting cultural heritage materials, particularly for non-Western visual cultures underrepresented in training data. Our benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems that interact with historical artifacts. The dataset and code are available in our project page.

中文标题/摘要

标题：视觉语言模型中的文化错置与时间推理问题

视觉-语言模型（VLMs）越来越多地应用于文化遗产材料，从数字档案到教育平台。本文指出了这些模型在解释历史文物时的一个根本问题。我们将其定义为文化错置现象，即使用不适当的时间概念、材料或文化框架来误解历史物件。为了量化这一现象，我们引入了视觉语言模型的时间错置基准（TAB-VLM），这是一个包含600个问题的数据集，涵盖六个类别，旨在评估1600件印度文化遗产物件（从史前到现代）的时间推理能力。对十种最先进的模型进行系统评估显示，它们在基准测试中的表现存在显著缺陷，即使最好的模型（GPT-5.2）也只能达到58.7%的整体准确率。性能差距在不同架构和规模下持续存在，表明文化错置是视觉AI系统的一个重要限制，无论模型大小如何。这些发现突显了当前VLM能力与准确解释文化遗产材料之间存在的差距，特别是对于在训练数据中代表性不足的非西方视觉文化。我们的基准为增强与历史文物互动的多模态AI系统的时序认知提供了基础。数据集和代码可在我们的项目页面获取。

LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection

Authors: Mitchell Piehl, Muchao Ye

First: 2026-05-14T16:48:03+00:00 · Latest: 2026-05-14T16:48:03+00:00