arXiv 论文速递

Snapshot: 20260518_0409

Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

Authors: Yifan Wang, Tong He

First: 2026-05-14T17:58:26+00:00 · Latest: 2026-05-14T17:58:26+00:00

Comments: Project page: https://yyfz.github.io/warp-as-history/

Abstract

Camera-controlled video generation has made substantial progress, enabling generated videos to follow prescribed viewpoint trajectories. However, existing methods usually learn camera-specific conditioning through camera encoders, control branches, or attention and positional-encoding modifications, which often require post-training on large-scale camera-annotated videos. Training-free alternatives avoid such post-training, but often shift the cost to test-time optimization or extra denoising-time guidance. We propose Warp-as-History, a simple interface that turns camera-induced warps into camera-warped pseudo-history with target-frame positional alignment and visible-token selection. Given a target camera trajectory, we construct camera-warped pseudo-history from past observations and feed it through the model's visual-history pathway. Crucially, we align its positional encoding with the target frames being denoised and remove warped-history tokens without valid source observations. Without any training, architectural modification, or test-time optimization, this interface reveals a non-trivial zero-shot capability of a frozen video generation model to follow camera trajectories. Moreover, lightweight offline LoRA finetuning on only one camera-annotated video further improves this capability and generalizes to unseen videos, improving camera adherence, visual quality, and motion dynamics without test-time optimization or target-video adaptation. Extensive experiments on diverse datasets confirm the effectiveness of our method.

Summary / 总结

Camera-controlled video generation has made substantial progress, enabling generated videos to follow prescribed viewpoint trajectories.

Does Synthetic Layered Design Data Benefit Layered Design Decomposition?

Authors: Kam Man Wu, Haolin Yang, Qingyu Chen, Yihu Tang, Jingye Chen, Qifeng Chen

First: 2026-05-14T17:55:11+00:00 · Latest: 2026-05-14T17:55:11+00:00

Comments: 22 pages, 10 figures. Code is available at https://github.com/YangHaolin0526/SynLayers

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in image generation have made it easy to produce high-quality images. However, these outputs are inherently flattened, entangling foreground elements, background, and text within a fixed canvas. As a result, flexible post-generation editing remains challenging, revealing a clear last-mile gap toward practical usability. Existing approaches either rely on scarce proprietary layered assets or construct partially synthetic data from limited structural priors. However, both strategies face fundamental challenges in scalability. In this work, we investigate whether pure synthetic layered data can improve graphic design decomposition. We make the assumption that, in graphic design, effective decomposition does not require modeling inter-layer dependencies as precisely as in natural-image composition, since design elements are often intentionally arranged as modular and semantically separable components. Concretely, we conduct a data-centric study based on CLD baseline, which is a state-of-the-art layer decomposition framework. Based on the baseline, we construct our own synthetic dataset, SynLayers, generate textual supervision using vision language models, and automate inference inputs with VLM-predicted bounding boxes. Our study reveals three key findings: (1) even training with purely synthetic data can outperform non-scalable alternatives such as the widely used PrismLayersPro dataset, demonstrating its viability as a scalable and effective substitute; (2) performance consistently improves with increased training data scale, while gains begin to saturate at around 50K samples; and (3) synthetic data enables balanced control over layer-count distributions, avoiding the layer-count imbalance commonly observed in real-world datasets. We hope this data-centric study encourages broader adoption of synthetic data as a practical foundation for layered design editing systems.

Summary / 总结

Recent advances in image generation have made it easy to produce high-quality images.

Do-Undo Bench: Reversibility for Action Understanding in Image Generation

Authors: Shweta Mahajan, Shreya Kadambi, Hoang Le, Rajeev Yasarla, Apratim Bhattacharyya, Munawar Hayat, Fatih Porikli

First: 2025-12-15T18:03:42+00:00 · Latest: 2026-05-14T17:13:30+00:00

Comments: Project page: https://s-mahajan.github.io/Do-Undo-Bench/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. Unlike prior work that relies on prompt-based image generation and editing to perform action-conditioned image manipulation, our training hypothesis requires models to simulate the outcome of a real-world action and then reverse it to the original state. This forward-reverse requirement tests genuine cause-and-effect understanding rather than stylistic or semantic edits. We curate a high-quality benchmark of reversible actions from real-world scenarios to enable robust action grounding. Our experiments reveal that current models struggle with action reversibility, highlighting the need to evaluate action understanding. Do-Undo provides an intuitive testbed for evaluating and advancing action-aware generation in multimodal systems that must reason about real-world dynamics.

Summary / 总结

We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions.

On the Cultural Anachronism and Temporal Reasoning in Vision Language Models

Authors: Mukul Ranjan, Prince Jha, Khushboo Kumari, Zhiqiang Shen

First: 2026-05-14T16:58:16+00:00 · Latest: 2026-05-14T16:58:16+00:00

Comments: Project Page: https://khushboo0012.github.io/tab-vlm-webpage/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state-of-the-art models reveal significant deficiencies on our benchmark, and even the best model (GPT-5.2) achieves only 58.7% overall accuracy. The performance gap persists across varying architectures and scales, suggesting that cultural anachronism represents a significant limitation in visual AI systems, regardless of model size. These findings highlight the disparity between current VLM capabilities and the requirements for accurately interpreting cultural heritage materials, particularly for non-Western visual cultures underrepresented in training data. Our benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems that interact with historical artifacts. The dataset and code are available in our project page.

Summary / 总结

Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms.

LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection

Authors: Mitchell Piehl, Muchao Ye

First: 2026-05-14T16:48:03+00:00 · Latest: 2026-05-14T16:48:03+00:00