arXiv 论文速递

Snapshot: 20260517_0425

Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

Authors: Yifan Wang, Tong He

First: 2026-05-14T17:58:26+00:00 · Latest: 2026-05-14T17:58:26+00:00

Comments: Project page: https://yyfz.github.io/warp-as-history/

Abstract

Camera-controlled video generation has made substantial progress, enabling generated videos to follow prescribed viewpoint trajectories. However, existing methods usually learn camera-specific conditioning through camera encoders, control branches, or attention and positional-encoding modifications, which often require post-training on large-scale camera-annotated videos. Training-free alternatives avoid such post-training, but often shift the cost to test-time optimization or extra denoising-time guidance. We propose Warp-as-History, a simple interface that turns camera-induced warps into camera-warped pseudo-history with target-frame positional alignment and visible-token selection. Given a target camera trajectory, we construct camera-warped pseudo-history from past observations and feed it through the model's visual-history pathway. Crucially, we align its positional encoding with the target frames being denoised and remove warped-history tokens without valid source observations. Without any training, architectural modification, or test-time optimization, this interface reveals a non-trivial zero-shot capability of a frozen video generation model to follow camera trajectories. Moreover, lightweight offline LoRA finetuning on only one camera-annotated video further improves this capability and generalizes to unseen videos, improving camera adherence, visual quality, and motion dynamics without test-time optimization or target-video adaptation. Extensive experiments on diverse datasets confirm the effectiveness of our method.

Summary / 总结

Warp-as-History proposes a simple method to enable a frozen video generation model to follow camera trajectories without training or test-time optimization. It transforms camera-induced warps into camera-warped pseudo-history, aligning positional encoding with target frames and removing invalid tokens. This method demonstrates zero-shot capability and can be further improved with lightweight offline LoRA finetuning on a single camera-annotated video, enhancing camera adherence, visual quality, and motion dynamics on unseen videos.

Warp-as-History 提出了一种简单方法，使冻结的视频生成模型能够在无需训练或测试时优化的情况下跟随摄像机轨迹。该方法将摄像机诱导的扭曲转换为摄像机扭曲的伪历史，并将位置编码与目标帧对齐，移除无效的令牌。这种方法展示了零样本能力，并且可以通过对单个摄像机标注视频进行轻量级的离线 LoRA 微调进一步改进，从而提高摄像机跟随性、视觉质量和运动动态性，适用于未见过的视频。

Does Synthetic Layered Design Data Benefit Layered Design Decomposition?

Authors: Kam Man Wu, Haolin Yang, Qingyu Chen, Yihu Tang, Jingye Chen, Qifeng Chen

First: 2026-05-14T17:55:11+00:00 · Latest: 2026-05-14T17:55:11+00:00

Comments: 22 pages, 10 figures. Code is available at https://github.com/YangHaolin0526/SynLayers

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in image generation have made it easy to produce high-quality images. However, these outputs are inherently flattened, entangling foreground elements, background, and text within a fixed canvas. As a result, flexible post-generation editing remains challenging, revealing a clear last-mile gap toward practical usability. Existing approaches either rely on scarce proprietary layered assets or construct partially synthetic data from limited structural priors. However, both strategies face fundamental challenges in scalability. In this work, we investigate whether pure synthetic layered data can improve graphic design decomposition. We make the assumption that, in graphic design, effective decomposition does not require modeling inter-layer dependencies as precisely as in natural-image composition, since design elements are often intentionally arranged as modular and semantically separable components. Concretely, we conduct a data-centric study based on CLD baseline, which is a state-of-the-art layer decomposition framework. Based on the baseline, we construct our own synthetic dataset, SynLayers, generate textual supervision using vision language models, and automate inference inputs with VLM-predicted bounding boxes. Our study reveals three key findings: (1) even training with purely synthetic data can outperform non-scalable alternatives such as the widely used PrismLayersPro dataset, demonstrating its viability as a scalable and effective substitute; (2) performance consistently improves with increased training data scale, while gains begin to saturate at around 50K samples; and (3) synthetic data enables balanced control over layer-count distributions, avoiding the layer-count imbalance commonly observed in real-world datasets. We hope this data-centric study encourages broader adoption of synthetic data as a practical foundation for layered design editing systems.

Summary / 总结

This study investigates whether synthetic layered data can improve graphic design decomposition. By constructing a synthetic dataset called SynLayers and using a state-of-the-art layer decomposition framework, the researchers found that training with purely synthetic data outperforms non-scalable alternatives like PrismLayersPro, especially with larger datasets. The study also shows that synthetic data helps achieve balanced layer-count distributions, which is a common issue in real-world datasets.

该研究探讨了使用合成层数据来改进图形设计分解的方法。通过构建合成数据集SynLayers，并使用视觉语言模型进行文本监督和边界框预测，研究证明，使用纯合成数据训练可以超越现有方法，尤其是在样本量达到约50K时。研究还发现，合成数据有助于实现层数量分布的均衡，这在现实世界的数据集中是难以实现的。

Do-Undo Bench: Reversibility for Action Understanding in Image Generation

Authors: Shweta Mahajan, Shreya Kadambi, Hoang Le, Rajeev Yasarla, Apratim Bhattacharyya, Munawar Hayat, Fatih Porikli

First: 2025-12-15T18:03:42+00:00 · Latest: 2026-05-14T17:13:30+00:00

Comments: Project page: https://s-mahajan.github.io/Do-Undo-Bench/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. Unlike prior work that relies on prompt-based image generation and editing to perform action-conditioned image manipulation, our training hypothesis requires models to simulate the outcome of a real-world action and then reverse it to the original state. This forward-reverse requirement tests genuine cause-and-effect understanding rather than stylistic or semantic edits. We curate a high-quality benchmark of reversible actions from real-world scenarios to enable robust action grounding. Our experiments reveal that current models struggle with action reversibility, highlighting the need to evaluate action understanding. Do-Undo provides an intuitive testbed for evaluating and advancing action-aware generation in multimodal systems that must reason about real-world dynamics.

Summary / 总结

This work introduces the Do--Doo-Bench, a benchmark for evaluating the ability of models to generate and reverse plausible scene-world scene transformations based. It given real-world actions... prompt-based image generation and editing. Unlike previous methods,, which rely on prompts-world prompts, Do--Doo-Bench introduces on-condition image manipulation on based on the hypothesis that the outcome of a real-world action can can can can be reversed to generate a on-reverse on on genuine on-and-effect on rather on stylistic and semantic edits. The benchmark curates high a high high-quality set of reversible actions from on-world scenarios to enable robust on-world scene-generation on.. the assumption that current models struggle with on-revers,., highlighting.

提出了Do-Undo任务和基准，旨在评估视觉-语言模型理解并根据真实世界动作生成合理场景变换的能力。不同于以往依赖提示进行图像生成和编辑的方法，该基准要求模型模拟动作并将其逆转回原始状态，测试其真正的因果理解能力。实验表明，当前模型在动作逆转方面存在困难，表明需要在多模态系统中提高动作理解能力。

On the Cultural Anachronism and Temporal Reasoning in Vision Language Models

Authors: Mukul Ranjan, Prince Jha, Khushboo Kumari, Zhiqiang Shen

First: 2026-05-14T16:58:16+00:00 · Latest: 2026-05-14T16:58:16+00:00

Comments: Project Page: https://khushboo0012.github.io/tab-vlm-webpage/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state-of-the-art models reveal significant deficiencies on our benchmark, and even the best model (GPT-5.2) achieves only 58.7% overall accuracy. The performance gap persists across varying architectures and scales, suggesting that cultural anachronism represents a significant limitation in visual AI systems, regardless of model size. These findings highlight the disparity between current VLM capabilities and the requirements for accurately interpreting cultural heritage materials, particularly for non-Western visual cultures underrepresented in training data. Our benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems that interact with historical artifacts. The dataset and code are available in our project page.

中文标题/摘要

标题：视觉语言模型中的文化错置与时间推理问题

视觉-语言模型（VLMs）越来越多地应用于文化遗产材料，从数字档案到教育平台。本文指出了这些模型在解释历史文物时的一个根本问题。我们将这种现象定义为文化错置，即使用不适当的时间概念、材料或文化框架来误解历史物件。为了量化这一现象，我们引入了视觉语言模型的时间错置基准（TAB-VLM），这是一个包含600个问题的数据集，涵盖六个类别，旨在评估1600件印度文化遗产物件（从史前到现代）的时间推理能力。对十种最先进的模型进行系统评估显示，它们在基准测试中的表现存在显著缺陷，即使最好的模型（GPT-5.2）也只能达到58.7%的整体准确率。性能差距在不同架构和规模下依然存在，表明文化错置是视觉AI系统的一个重要限制，无论模型大小如何。这些发现突显了当前VLM能力与准确解释文化遗产材料之间存在的差距，特别是对于在训练数据中代表性不足的非西方视觉文化。我们的基准为增强与历史文物互动的多模态AI系统的时序认知提供了基础。数据集和代码可在我们的项目页面获取。

Summary / 总结

This work addresses the issue of cultural anachronism in Vision-Language Models (VLMs) when interpreting historical artifacts. It introduces the Temporal Anachronism Benchmark for VLMs (TAB-VLM) to evaluate temporal reasoning, using 600 questions on 1,600 Indian cultural artifacts. Evaluations of ten state-of-the-art models show significant deficiencies, with the best model achieving only 58.7% accuracy, indicating a critical limitation in VLMs for accurately interpreting cultural heritage materials, especially for non-Western cultures. The benchmark provides a foundation for improving temporal cognition in multimodal AI systems.

这项研究关注视觉语言模型（VLM）在解读历史文物时存在的文化错置问题。引入了时间错置基准测试（TAB-VLM），使用600个问题涵盖6个类别，针对1,600件印度文化艺术品。对十种最先进的模型进行评估显示了显著的不足，最佳模型的准确率仅为58.7%，表明VLM在准确解读文化遗产材料，尤其是非西方视觉文化方面存在重大局限性。该基准测试为改进多模态AI系统的时间认知提供了新的标准。

LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection

Authors: Mitchell Piehl, Muchao Ye

First: 2026-05-14T16:48:03+00:00 · Latest: 2026-05-14T16:48:03+00:00