arXiv 论文速递

VoCap: Video Object Captioning and Segmentation from Any Prompt

Authors: Jasper Uijlings, Xingyi Zhou, Xiuye Gu, Arsha Nagrani, Anurag Arnab, Alireza Fathi, David Ross, Cordelia Schmid

First: 2025-08-29T17:43:58+00:00 · Latest: 2025-08-29T17:43:58+00:00

Abs · PDF · Code1

Abstract

Understanding objects in videos in terms of fine-grained localization masks and detailed semantic properties is a fundamental task in video understanding. In this paper, we propose VoCap, a flexible video model that consumes a video and a prompt of various modalities (text, box or mask), and produces a spatio-temporal masklet with a corresponding object-centric caption. As such our model addresses simultaneously the tasks of promptable video object segmentation, referring expression segmentation, and object captioning. Since obtaining data for this task is tedious and expensive, we propose to annotate an existing large-scale segmentation dataset (SAV) with pseudo object captions. We do so by preprocessing videos with their ground-truth masks to highlight the object of interest and feed this to a large Vision Language Model (VLM). For an unbiased evaluation, we collect manual annotations on the validation set. We call the resulting dataset SAV-Caption. We train our VoCap model at scale on a SAV-Caption together with a mix of other image and video datasets. Our model yields state-of-the-art results on referring expression video object segmentation, is competitive on semi-supervised video object segmentation, and establishes a benchmark for video object captioning. Our dataset will be made available at https://github.com/google-deepmind/vocap.

中文标题/摘要

标题：VoCap：从任意提示进行视频对象描述和分割

理解视频中的对象，以精细粒度的定位掩码和详细的语义属性为单位，是视频理解中的基本任务。在本文中，我们提出了一种灵活的视频模型VoCap，该模型可以消耗视频和各种模态（文本、框或掩码）的提示，并生成相应的时空掩码片段和对象为中心的描述。因此，我们的模型同时解决了可提示视频对象分割、指示表达分割和对象描述的任务。由于获取此类任务的数据既繁琐又昂贵，我们建议对现有的大规模分割数据集（SAV）进行伪对象描述注释。我们通过预处理带有真实掩码的视频以突出显示目标对象，并将其输入大型视觉语言模型（VLM）来实现这一点。为了进行公平的评估，我们在验证集上收集了人工注释。我们称该数据集为SAV-描述。我们在SAV-描述以及多种图像和视频数据集的混合数据集上大规模训练我们的VoCap模型。我们的模型在指示表达视频对象分割上达到了最先进的结果，在半监督视频对象分割上具有竞争力，并建立了视频对象描述的基准。我们的数据集将在https://github.com/google-deepmind/vocap/提供。

Summary / 总结

VoCap is a video model that takes a video and a prompt (text, box, or mask) and generates a spatio-temporal mask and an object-centric caption. It addresses video object segmentation, referring expression segmentation, and object captioning. VoCap was trained on a new dataset, SAV-Caption, created by annotating an existing segmentation dataset with pseudo object captions. The model outperforms previous methods on referring expression video object segmentation and achieves competitive results on semi-supervised video object segmentation, setting a new benchmark for video object captioning.

VoCap 是一种视频模型，它接受视频和提示（文本、框或掩码），并生成时空掩码和对象中心的描述。它同时解决了视频对象分割、引用表达视频对象分割和对象描述的任务。VoCap 在一个新数据集 SAV-Caption 上进行了训练，该数据集是通过为现有分割数据集添加伪对象描述注释而创建的。该模型在引用表达视频对象分割上超越了先前的方法，并在半监督视频对象分割上取得了竞争力的结果，为视频对象描述设定了新的基准。

Tree-Guided Diffusion Planner

Authors: Hyeonseong Jeon, Cheolhong Min, Jaesik Park

First: 2025-08-29T17:27:44+00:00 · Latest: 2025-08-29T17:27:44+00:00

Comments: 20 pages, 11 figures, 14 tables (main paper + appendix) / under review / project page will be available after the paper becomes public in arxiv

Abs · PDF

Abstract

Planning with pretrained diffusion models has emerged as a promising approach for solving test-time guided control problems. However, standard gradient guidance typically performs optimally under convex and differentiable reward landscapes, showing substantially reduced effectiveness in real-world scenarios involving non-convex objectives, non-differentiable constraints, and multi-reward structures. Furthermore, recent supervised planning approaches require task-specific training or value estimators, which limits test-time flexibility and zero-shot generalization. We propose a Tree-guided Diffusion Planner (TDP), a zero-shot test-time planning framework that balances exploration and exploitation through structured trajectory generation. We frame test-time planning as a tree search problem using a bi-level sampling process: (1) diverse parent trajectories are produced via training-free particle guidance to encourage broad exploration, and (2) sub-trajectories are refined through fast conditional denoising guided by task objectives. TDP addresses the limitations of gradient guidance by exploring diverse trajectory regions and harnessing gradient information across this expanded solution space using only pretrained models and test-time reward signals. We evaluate TDP on three diverse tasks: maze gold-picking, robot arm block manipulation, and AntMaze multi-goal exploration. TDP consistently outperforms state-of-the-art approaches on all tasks. The project page can be found at: tree-diffusion-planner.github.io.

Summary / 总结

The Tree-Guided Diffusion Planner (TDP) addresses the limitations of gradient guidance in solving complex, non-convex problems by balancing exploration and exploitation through structured trajectory generation. TDP uses a bi-level sampling process where diverse parent trajectories are generated via particle guidance and then refined through fast conditional denoising. On three diverse tasks, TDP outperforms state-of-the-art approaches, demonstrating its effectiveness in real-world scenarios.

研究旨在解决梯度指导在处理非凸目标和非可微约束的实际规划问题中的局限性。TDP 是一种零样本测试时规划框架，采用树引导方法平衡探索和利用。它通过生成多样化的父轨迹进行广泛探索，并通过快速条件去噪进行细化。TDP 在迷宫金币收集、机器人手臂块操作和AntMaze多目标探索任务中均优于现有最佳方法。

PiCSAR: Probabilistic Confidence Selection And Ranking

Authors: Joshua Ong Jun Leang, Zheng Zhao, Aryo Pradipta Gema, Sohee Yang, Wai-Chung Kwan, Xuanli He, Wenda Li, Pasquale Minervini, Eleonora Giunchiglia, Shay B. Cohen

First: 2025-08-29T17:03:47+00:00 · Latest: 2025-08-29T17:03:47+00:00

Abs · PDF

Abstract

Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.

中文标题/摘要

标题：PiCSAR：概率置信度选择与排序

最佳-n采样通过生成多个候选解决方案并选择具有最高奖励的解决方案，提高了大型语言模型（LLMs）和大型推理模型（LRMs）的准确性。推理任务的关键挑战是设计一个评分函数，能够在不访问正确答案的情况下识别正确的推理链。我们提出了一种简单且无需训练的方法——概率置信度选择与排序（PiCSAR）：该方法使用推理和最终答案的联合对数似然性对每个候选生成进行评分。推理和最终答案的联合对数似然性自然分解为推理置信度和答案置信度。PiCSAR 在多个基准测试中取得了显著的改进（在 MATH500 上 +10.18，在 AIME2025 上 +9.81），在 20 次比较中有 16 次优于基线，使用至少少 2 倍的样本。我们的分析表明，正确的推理链在推理和答案置信度方面表现出显著更高的值，这证明了 PiCSAR 的有效性。

Summary / 总结

PiCSAR is a training-free method that improves the accuracy of large language models and reasoning models by selecting the best candidate solution based on the joint log-likelihood of reasoning and final answer. It outperforms baselines with fewer samples across various benchmarks, achieving significant gains such as +10.18 on MATH500 and +9.81 on AIME2025.

PiCSAR 是一种无需训练的方法，通过基于推理和最终答案的联合对数似然性来选择最佳候选解决方案，从而提高大型语言模型和推理模型的准确性。它在各种基准测试中优于基线模型，尤其是在 MATH500 上提高了 +10.18，在 AIME2025 上提高了 +9.81。

CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models

Authors: João Valente, Atabak Dehban, Rodrigo Ventura

First: 2025-08-29T15:57:43+00:00 · Latest: 2025-08-29T15:57:43+00:00

Abs · PDF

Abstract

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities across various multimodal tasks. They continue, however, to struggle with trivial scenarios such as reading values from Digital Measurement Devices (DMDs), particularly in real-world conditions involving clutter, occlusions, extreme viewpoints, and motion blur; common in head-mounted cameras and Augmented Reality (AR) applications. Motivated by these limitations, this work introduces CAD2DMD-SET, a synthetic data generation tool designed to support visual question answering (VQA) tasks involving DMDs. By leveraging 3D CAD models, advanced rendering, and high-fidelity image composition, our tool produces diverse, VQA-labelled synthetic DMD datasets suitable for fine-tuning LVLMs. Additionally, we present DMDBench, a curated validation set of 1,000 annotated real-world images designed to evaluate model performance under practical constraints. Benchmarking three state-of-the-art LVLMs using Average Normalised Levenshtein Similarity (ANLS) and further fine-tuning LoRA's of these models with CAD2DMD-SET's generated dataset yielded substantial improvements, with InternVL showcasing a score increase of 200% without degrading on other tasks. This demonstrates that the CAD2DMD-SET training dataset substantially improves the robustness and performance of LVLMs when operating under the previously stated challenging conditions. The CAD2DMD-SET tool is expected to be released as open-source once the final version of this manuscript is prepared, allowing the community to add different measurement devices and generate their own datasets.

Summary / 总结

This work addresses the limitations of Large Vision-Language Models (LVLMs) in reading values from Digital Measurement Devices (DMDs) under challenging conditions. It introduces CAD2DMD-SET, a synthetic data generation tool that uses 3D CAD models and advanced rendering to create diverse, VQA-labelled datasets. Benchmarking three state-of-the-art LVLMs with CAD2DMD-SET and DMDBench showed significant improvements, particularly for InternVL, which saw a 200% score increase without degrading on other tasks. The tool aims to enhance LVLM robustness in real-world conditions.

该研究针对大型视觉-语言模型（LVLM）在处理数字测量设备（DMD）读取值时在现实条件下的局限性。引入了CAD2DMD-SET，这是一种从3D CAD模型生成多样化的、带有VQA标签的合成数据集的工具。通过使用CAD2DMD-SET和DMDBench对三种最先进的LVLM进行基准测试，显示出显著的改进，特别是InternVL，其得分提高了200%，且在其他任务上没有退步。这表明CAD2DMD-SET能够增强LVLM在挑战性条件下的鲁棒性和性能。

PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation

Authors: Jiho Choi, Seojeong Park, Seongjong Song, Hyunjung Shim

First: 2025-08-29T15:36:06+00:00 · Latest: 2025-08-29T15:36:06+00:00

Abs · PDF

Abstract

We present a novel training-free framework, \textit{PosterForest}, for automated scientific poster generation. Unlike prior approaches, which largely neglect the hierarchical structure of scientific documents and the semantic integration of textual and visual elements, our method addresses both challenges directly. We introduce the \textit{Poster Tree}, a hierarchical intermediate representation that jointly encodes document structure and visual-textual relationships at multiple levels. Our framework employs a multi-agent collaboration strategy, where agents specializing in content summarization and layout planning iteratively coordinate and provide mutual feedback. This approach enables the joint optimization of logical consistency, content fidelity, and visual coherence. Extensive experiments on multiple academic domains show that our method outperforms existing baselines in both qualitative and quantitative evaluations. The resulting posters achieve quality closest to expert-designed ground truth and deliver superior information preservation, structural clarity, and user preference.

Summary / 总结

The research motivation is to develop a training-free framework, PosterForest, for automated scientific poster generation that addresses the hierarchical structure and semantic integration challenges. The method uses a Poster Tree as a hierarchical intermediate representation and a multi-agent collaboration strategy involving content summarization and layout planning agents. The key experimental findings show that PosterForest outperforms existing baselines in both qualitative and quantitative evaluations, achieving quality close to expert-designed posters and superior information preservation, structural clarity, and user preference.

研究动机是开发一个无需训练的框架PosterForest，用于自动化生成科学海报，解决科学文档的层次结构和语义集成问题。方法使用Poster Tree作为中间的层次表示，并采用内容摘要和布局规划的多智能体协作策略。关键实验发现表明，PosterForest在定性和定量评估中均优于现有基线，生成的海报质量接近专家设计的原型，并且具有更好的信息保留、结构清晰度和用户偏好。