arXiv 论文速递

Snapshot: 20260524_0411

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

Authors: Lee Hsin-Ying, Hanwen Jiang, Yiqun Mei, Jing Shi, Ming-Hsuan Yang, Zhixin Shu

Venue: ICML 2026

First: 2026-05-21T17:59:36+00:00 · Latest: 2026-05-21T17:59:36+00:00

Comments: ICML 2026. Project page: https://motimotion.github.io/

Abstract

Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative priors. To support systematic evaluation, we curate a new image-to-video benchmark, MotiBench, consisting of interaction-centric scenes where new events are triggered by motion. Both VLM-based evaluation and a human study on MotiBench demonstrate that MotiMotion produces videos with more plausible object behaviors and interaction, and is preferred over existing approaches.

Summary / 总结

Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete.

AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

Authors: Wenxuan Guo, Xiuwei Xu, Yichen Liu, Xiangyu Li, Hang Yin, Huangxing Chen, Wenzhao Zheng, Jianjiang Feng, Jie Zhou, Jiwen Lu

Venue: CVPR 2026

First: 2026-05-21T17:58:26+00:00 · Latest: 2026-05-21T17:58:26+00:00

Comments: Accepted to CVPR 2026. Project page: https://gwxuan.github.io/AwareVLN/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene. Conversely, explicitly building a scene map for heuristic planning is intuitively appealing but relies on additional 3D sensors and hinders large-scale vision-language pre-training. To bridge this gap, we propose AwareVLN, a novel framework that equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent's state and task progress in a fully end-to-end and data-driven manner. Our approach features two key innovations: (1) a structural reasoning module that fosters spatial and task-oriented self-awareness, and (2) an automatic data engine with progress division for effective training. Extensive experiments on various datasets in Habitat simulator show our AwareVLN significantly outperforms previous state-of-the-art vision-language navigation methods. Project page: https://gwxuan.github.io/AwareVLN/.

Summary / 总结

Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment.

The Value of Covariance Matching in Gaussian DDPMs and the Lanczos Sampler

Authors: Md Sahil Akhtar, Aymane El Gadarri, Vivek F. Farias, Adam D. Jozefiak

First: 2026-05-21T16:57:27+00:00 · Latest: 2026-05-21T16:57:27+00:00