VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space
Authors: Lin Li, Zehuan Huang, Haoran Feng, Gengxiong Zhuang, Rui Chen, Chunchao Guo, Lu Sheng
First: 2025-08-26T17:59:47+00:00 · Latest: 2025-08-26T17:59:47+00:00
Comments: Project page: https://huanngzh.github.io/VoxHammer-Page/
Abstract
3D local editing of specified regions is crucial for game industry and robot
interaction. Recent methods typically edit rendered multi-view images and then
reconstruct 3D models, but they face challenges in precisely preserving
unedited regions and overall coherence. Inspired by structured 3D generative
models, we propose VoxHammer, a novel training-free approach that performs
precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer
first predicts its inversion trajectory and obtains its inverted latents and
key-value tokens at each timestep. Subsequently, in the denoising and editing
phase, we replace the denoising features of preserved regions with the
corresponding inverted latents and cached key-value tokens. By retaining these
contextual features, this approach ensures consistent reconstruction of
preserved areas and coherent integration of edited parts. To evaluate the
consistency of preserved regions, we constructed Edit3D-Bench, a
human-annotated dataset comprising hundreds of samples, each with carefully
labeled 3D editing regions. Experiments demonstrate that VoxHammer
significantly outperforms existing methods in terms of both 3D consistency of
preserved regions and overall quality. Our method holds promise for
synthesizing high-quality edited paired data, thereby laying the data
foundation for in-context 3D generation. See our project page at
https://huanngzh.github.io/VoxHammer-Page/.
Summary / 总结
3D local editing of specified regions is crucial for game industry and robot interaction.
Articulate3D: Zero-Shot Text-Driven 3D Object Posing
Authors: Oishi Deb, Anjun Hu, Ashkan Khakzar, Philip Torr, Christian Rupprecht
First: 2025-08-26T17:59:17+00:00 · Latest: 2025-08-26T17:59:17+00:00
Comments: Project page:https://odeb1.github.io/articulate3d_page_deb/
Abstract
We propose a training-free method, Articulate3D, to pose a 3D asset through
language control. Despite advances in vision and language models, this task
remains surprisingly challenging. To achieve this goal, we decompose the
problem into two steps. We modify a powerful image-generator to create target
images conditioned on the input image and a text instruction. We then align the
mesh to the target images through a multi-view pose optimisation step. In
detail, we introduce a self-attention rewiring mechanism (RSActrl) that
decouples the source structure from pose within an image generative model,
allowing it to maintain a consistent structure across varying poses. We
observed that differentiable rendering is an unreliable signal for articulation
optimisation; instead, we use keypoints to establish correspondences between
input and target images. The effectiveness of Articulate3D is demonstrated
across a diverse range of 3D objects and free-form text prompts, successfully
manipulating poses while maintaining the original identity of the mesh.
Quantitative evaluations and a comparative user study, in which our method was
preferred over 85\% of the time, confirm its superiority over existing
approaches. Project page:https://odeb1.github.io/articulate3d_page_deb/
Summary / 总结
We propose a training-free method, Articulate3D, to pose a 3D asset through language control.
Route-and-Execute: Auditable Model-Card Matching and Specialty-Level Deployment
Authors: Shayan Vassef, Soorya Ram Shimegekar, Abhay Goyal, Koustuv Saha, Pi Zonooz, Navin Kumar
First: 2025-08-22T23:34:37+00:00 · Latest: 2025-08-26T17:13:21+00:00
Abstract
Clinical workflows are fragmented as a patchwork of scripts and task-specific
networks that often handle triage, task selection, and model deployment. These
pipelines are rarely streamlined for data science pipeline, reducing efficiency
and raising operational costs. Workflows also lack data-driven model
identification (from imaging/tabular inputs) and standardized delivery of model
outputs. In response, we present a practical, healthcare-first framework that
uses a single vision-language model (VLM) in two complementary roles. First
(Solution 1), the VLM acts as an aware model-card matcher that routes an
incoming image to the appropriate specialist model via a three-stage workflow
(modality -> primary abnormality -> model-card id). Checks are provided by (i)
stagewise prompts that allow early exit via None/Normal/Other and (ii) a
stagewise answer selector that arbitrates between the top-2 candidates at each
stage, reducing the chance of an incorrect selection and aligning the workflow
with clinical risk tolerance. Second (Solution 2), we fine-tune the VLM on
specialty-specific datasets ensuring a single model covers multiple downstream
tasks within each specialty, maintaining performance while simplifying
deployment. Across gastroenterology, hematology, ophthalmology, and pathology,
our single-model deployment matches or approaches specialized baselines.
Compared with pipelines composed of many task-specific agents, this approach
shows that one VLM can both decide and do. It may reduce effort by data
scientists, shorten monitoring, increase the transparency of model selection
(with per-stage justifications), and lower integration overhead.
Summary / 总结
Clinical workflows are fragmented as a patchwork of scripts and task-specific networks that often handle triage, task selection, and model deployment.
mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation
Authors: Chan-Wei Hu, Yueqi Wang, Shuo Xing, Chia-Ju Chen, Suofei Feng, Ryan Rossi, Zhengzhong Tu
First: 2025-05-29T23:32:03+00:00 · Latest: 2025-08-26T16:42:37+00:00
Comments: 16 pages
Abstract
Large Vision-Language Models (LVLMs) have made remarkable strides in
multimodal tasks such as visual question answering, visual grounding, and
complex reasoning. However, they remain limited by static training data,
susceptibility to hallucinations, and inability to verify claims against
up-to-date, external evidence, compromising their performance in dynamic
real-world applications. Retrieval-Augmented Generation (RAG) offers a
practical solution to mitigate these challenges by allowing the LVLMs to access
large-scale knowledge databases via retrieval mechanisms, thereby grounding
model outputs in factual, contextually relevant information. Here in this
paper, we conduct the first systematic dissection of the multimodal RAG
pipeline for LVLMs, explicitly investigating (1) the retrieval phase: on the
modality configurations and retrieval strategies, (2) the re-ranking stage: on
strategies to mitigate positional biases and improve the relevance of retrieved
evidence, and (3) the generation phase: we further investigate how to best
integrate retrieved candidates into the final generation process. Finally, we
extend to explore a unified agentic framework that integrates re-ranking and
generation through self-reflection, enabling LVLMs to select relevant evidence
and suppress irrelevant context dynamically. Our full-stack exploration of RAG
for LVLMs yields substantial insights, resulting in an average performance
boost of 5% without any fine-tuning.
Summary / 总结
Large Vision-Language Models (LVLMs) have made remarkable strides in multimodal tasks such as visual question answering, visual grounding, and complex reasoning.
Architecting Clinical Collaboration: Multi-Agent Reasoning Systems for Multimodal Medical VQA
Authors: Karishma Thakrar, Shreyas Basavatia, Akshay Daftardar
First: 2025-07-07T22:31:56+00:00 · Latest: 2025-08-26T14:02:57+00:00
Abstract
Dermatological care via telemedicine often lacks the rich context of
in-person visits. Clinicians must make diagnoses based on a handful of images
and brief descriptions, without the benefit of physical exams, second opinions,
or reference materials. While many medical AI systems attempt to bridge these
gaps with domain-specific fine-tuning, this work hypothesized that mimicking
clinical reasoning processes could offer a more effective path forward. This
study tested seven vision-language models on medical visual question answering
across six configurations: baseline models, fine-tuned variants, and both
augmented with either reasoning layers that combine multiple model
perspectives, analogous to peer consultation, or retrieval-augmented generation
that incorporates medical literature at inference time, serving a role similar
to reference-checking. While fine-tuning degraded performance in four of seven
models with an average 30% decrease, baseline models collapsed on test data.
Clinical-inspired architectures, meanwhile, achieved up to 70% accuracy,
maintaining performance on unseen data while generating explainable,
literature-grounded outputs critical for clinical adoption. These findings
demonstrate that medical AI succeeds by reconstructing the collaborative and
evidence-based practices fundamental to clinical diagnosis.
Summary / 总结
Dermatological care via telemedicine often lacks the rich context of in-person visits.
ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval
Authors: Yi Pan, Yujia Zhang, Michael Kampffmeyer, Xiaoguang Zhao
Venue: EMNLP 2025
First: 2025-08-26T13:42:48+00:00 · Latest: 2025-08-26T13:42:48+00:00
Comments: Accepted by EMNLP 2025 Findings
Abstract
Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task
that involves retrieving videos based on queries relevant to only specific
segments. While existing works follow the paradigm of developing models to
process unimodal features, powerful pretrained vision-language models like CLIP
remain underexplored in this field. To bridge this gap, we propose ProPy, a
model with systematic architectural adaption of CLIP specifically designed for
PRVR. Drawing insights from the semantic relevance of multi-granularity events,
ProPy introduces two key innovations: (1) A Prompt Pyramid structure that
organizes event prompts to capture semantics at multiple granularity levels,
and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that
enables dynamic semantic interaction among events. With these designs, ProPy
achieves SOTA performance on three public datasets, outperforming previous
models by significant margins. Code is available at
https://github.com/BUAAPY/ProPy.
Summary / 总结
Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task that involves retrieving videos based on queries relevant to only specific segments.
ForgetMe: Evaluating Selective Forgetting in Generative Models
Authors: Zhenyu Yu, Mohd Yamani Inda Idris, Pei Wang
First: 2025-04-17T01:44:57+00:00 · Latest: 2025-08-26T13:04:59+00:00
Abstract
The widespread adoption of diffusion models in image generation has increased
the demand for privacy-compliant unlearning. However, due to the
high-dimensional nature and complex feature representations of diffusion
models, achieving selective unlearning remains challenging, as existing methods
struggle to remove sensitive information while preserving the consistency of
non-sensitive regions. To address this, we propose an Automatic Dataset
Creation Framework based on prompt-based layered editing and training-free
local feature removal, constructing the ForgetMe dataset and introducing the
Entangled evaluation metric. The Entangled metric quantifies unlearning
effectiveness by assessing the similarity and consistency between the target
and background regions and supports both paired (Entangled-D) and unpaired
(Entangled-S) image data, enabling unsupervised evaluation. The ForgetMe
dataset encompasses a diverse set of real and synthetic scenarios, including
CUB-200-2011 (Birds), Stanford-Dogs, ImageNet, and a synthetic cat dataset. We
apply LoRA fine-tuning on Stable Diffusion to achieve selective unlearning on
this dataset and validate the effectiveness of both the ForgetMe dataset and
the Entangled metric, establishing them as benchmarks for selective unlearning.
Our work provides a scalable and adaptable solution for advancing
privacy-preserving generative AI.
Summary / 总结
The widespread adoption of diffusion models in image generation has increased the demand for privacy-compliant unlearning.
Ask Me Again Differently: GRAS for Measuring Bias in Vision Language Models on Gender, Race, Age, and Skin Tone
Authors: Shaivi Malik, Hasnat Md Abdullah, Sriparna Saha, Amit Sheth
First: 2025-08-26T12:41:35+00:00 · Latest: 2025-08-26T12:41:35+00:00
Abstract
As Vision Language Models (VLMs) become integral to real-world applications,
understanding their demographic biases is critical. We introduce GRAS, a
benchmark for uncovering demographic biases in VLMs across gender, race, age,
and skin tone, offering the most diverse coverage to date. We further propose
the GRAS Bias Score, an interpretable metric for quantifying bias. We benchmark
five state-of-the-art VLMs and reveal concerning bias levels, with the least
biased model attaining a GRAS Bias Score of only 2 out of 100. Our findings
also reveal a methodological insight: evaluating bias in VLMs with visual
question answering (VQA) requires considering multiple formulations of a
question. Our code, data, and evaluation results are publicly available.
Summary / 总结
As Vision Language Models (VLMs) become integral to real-world applications, understanding their demographic biases is critical.
Enhancing Document VQA Models via Retrieval-Augmented Generation
Authors: Eric López, Artemis Llabrés, Ernest Valveny
First: 2025-08-26T12:32:55+00:00 · Latest: 2025-08-26T12:32:55+00:00
Comments: Accepted at Workshop on Machine Learning in Document Analysis and
Recognition (ICDAR WML 2025), Wuhan, China
Abstract
Document Visual Question Answering (Document VQA) must cope with documents
that span dozens of pages, yet leading systems still concatenate every page or
rely on very large vision-language models, both of which are memory-hungry.
Retrieval-Augmented Generation (RAG) offers an attractive alternative, first
retrieving a concise set of relevant segments before generating answers from
this selected evidence. In this paper, we systematically evaluate the impact of
incorporating RAG into Document VQA through different retrieval variants -
text-based retrieval using OCR tokens and purely visual retrieval without OCR -
across multiple models and benchmarks. Evaluated on the multi-page datasets
MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the
"concatenate-all-pages" baseline by up to +22.5 ANLS, while the visual variant
achieves +5.0 ANLS improvement without requiring any text extraction. An
ablation confirms that retrieval and reranking components drive most of the
gain, whereas the layout-guided chunking strategy - proposed in several recent
works to leverage page structure - fails to help on these datasets. Our
experiments demonstrate that careful evidence selection consistently boosts
accuracy across multiple model sizes and multi-page benchmarks, underscoring
its practical value for real-world Document VQA.
Summary / 总结
Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry.
Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs
Authors: Somraj Gautam, Abhirama Subramanyam Penamakuri, Abhishek Bhandari, Gaurav Harit
First: 2025-08-24T12:43:27+00:00 · Latest: 2025-08-26T12:16:26+00:00
Abstract
We introduce MMCRICBENCH-3K, a benchmark for Visual Question Answering (VQA)
on cricket scorecards, designed to evaluate large vision-language models
(LVLMs) on complex numerical and cross-lingual reasoning over semi-structured
tabular images. MMCRICBENCH-3K comprises 1,463 synthetically generated
scorecard images from ODI, T20, and Test formats, accompanied by 1,500 English
QA pairs. It includes two subsets: MMCRICBENCH-E-1.5K, featuring English
scorecards, and MMCRICBENCH-H-1.5K, containing visually similar Hindi
scorecards, with all questions and answers kept in English to enable controlled
cross-script evaluation. The task demands reasoning over structured numerical
data, multi-image context, and implicit domain knowledge. Empirical results
show that even state-of-the-art LVLMs, such as GPT-4o and Qwen2.5VL, struggle
on the English subset despite it being their primary training language and
exhibit a further drop in performance on the Hindi subset. This reveals key
limitations in structure-aware visual text understanding, numerical reasoning,
and cross-lingual generalization. The dataset is publicly available via Hugging
Face at https://huggingface.co/datasets/DIALab/MMCricBench, to promote LVLM
research in this direction.
Summary / 总结
We introduce MMCRICBENCH-3K, a benchmark for Visual Question Answering (VQA) on cricket scorecards, designed to evaluate large vision-language models (LVLMs) on complex numerical and cross-lingual reasoning over semi-structured tabular images.
Prototype-Guided Diffusion: Visual Conditioning without External Memory
Authors: Bilal Faye, Hanane Azzag, Mustapha Lebbah
First: 2025-08-13T16:18:35+00:00 · Latest: 2025-08-26T10:29:55+00:00
Abstract
Diffusion models have emerged as a leading framework for high-quality image
generation, offering stable training and strong performance across diverse
domains. However, they remain computationally intensive, particularly during
the iterative denoising process. Latent-space models like Stable Diffusion
alleviate some of this cost by operating in compressed representations, though
at the expense of fine-grained detail. More recent approaches such as
Retrieval-Augmented Diffusion Models (RDM) address efficiency by conditioning
denoising on similar examples retrieved from large external memory banks. While
effective, these methods introduce drawbacks: they require costly storage and
retrieval infrastructure, depend on static vision-language models like CLIP for
similarity, and lack adaptability during training. We propose the Prototype
Diffusion Model (PDM), a method that integrates prototype learning directly
into the diffusion process for efficient and adaptive visual conditioning -
without external memory. Instead of retrieving reference samples, PDM
constructs a dynamic set of compact visual prototypes from clean image features
using contrastive learning. These prototypes guide the denoising steps by
aligning noisy representations with semantically relevant visual patterns,
enabling efficient generation with strong semantic grounding. Experiments show
that PDM maintains high generation quality while reducing computational and
storage overhead, offering a scalable alternative to retrieval-based
conditioning in diffusion models.
Summary / 总结
Diffusion models have emerged as a leading framework for high-quality image generation, offering stable training and strong performance across diverse domains.
M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering
Authors: Yanshu Li, Yi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, Ruixiang Tang
First: 2025-04-06T22:02:21+00:00 · Latest: 2025-08-26T10:19:05+00:00
Comments: COLM 2025, 30 pages, 10 figures, 16 tables
Abstract
Multimodal in-context learning (ICL) equips Large Vision-language Models
(LVLMs) with the ability to adapt to new tasks via multiple user-provided
demonstrations, without requiring any model parameter updates. However, its
effectiveness is constrained by the token-intensive nature of multimodal inputs
and the complexity of cross-modal few-shot reasoning, which together hinder
LVLMs from extracting useful patterns from demonstrations. To address these
challenges, we propose \textbf{M$^2$IV}, a novel representation engineering
approach that replaces explicit token-level demonstrations with a set of
learnable Multimodal In-context Vectors directly injected into the residual
streams of LVLMs. By analyzing the distinct roles of multi-head attention (MHA)
and multi-layer perceptrons (MLP) in the ICL process, we design a training
strategy that enables M$^2$IV to perform fine-grained semantic distillation and
robust cross-modal representation learning. M$^2$IV not only improves
performance across diverse tasks and LVLMs but also significantly reduces token
overhead, enabling graceful scaling to many-shot scenarios. To further enhance
usability, we introduce \textbf{VLibrary}, a repository that stores trained
M$^2$IVs for flexible retrieval and injection. With VLibrary, users can steer
pre-trained LVLMs in a customized manner that meets diverse requirements.
Extensive experiments demonstrate that M$^2$IV consistently outperforms vanilla
ICL and prior representation engineering baselines, achieving an average
accuracy gain of 3.74\% with substantial improvements in overall efficiency.
Summary / 总结
Multimodal in-context learning (ICL) equips Large Vision-language Models (LVLMs) with the ability to adapt to new tasks via multiple user-provided demonstrations, without requiring any model parameter updates.
Video CLIP Model for Multi-View Echocardiography Interpretation
Authors: Ryo Takizawa, Satoshi Kodera, Tempei Kabayama, Ryo Matsuoka, Yuta Ando, Yuto Nakamura, Haruki Settai, Norihiko Takeda
First: 2025-04-26T05:11:15+00:00 · Latest: 2025-08-26T10:06:14+00:00
Abstract
Echocardiography records ultrasound videos of the heart, enabling clinicians
to assess cardiac function. Recent advances in large-scale vision-language
models (VLMs) have spurred interest in automating echocardiographic
interpretation. However, most existing medical VLMs rely on single-frame
(image) inputs, which can reduce diagnostic accuracy for conditions
identifiable only through cardiac motion. In addition, echocardiographic videos
are captured from multiple views, each varying in suitability for detecting
specific conditions. Leveraging multiple views may therefore improve diagnostic
performance. We developed a video-language model that processes full video
sequences from five standard views, trained on 60,747 echocardiographic
video-report pairs. We evaluated the gains in retrieval performance from video
input and multi-view support, including the contributions of various pretrained
models.
Summary / 总结
Echocardiography records ultrasound videos of the heart, enabling clinicians to assess cardiac function.
Toward Robust Medical Fairness: Debiased Dual-Modal Alignment via Text-Guided Attribute-Disentangled Prompt Learning for Vision-Language Models
Authors: Yuexuan Xia, Benteng Ma, Jiang He, Zhiyong Wang, Qi Dou, Yong Xia
First: 2025-08-26T10:01:23+00:00 · Latest: 2025-08-26T10:01:23+00:00
Abstract
Ensuring fairness across demographic groups in medical diagnosis is essential
for equitable healthcare, particularly under distribution shifts caused by
variations in imaging equipment and clinical practice. Vision-language models
(VLMs) exhibit strong generalization, and text prompts encode identity
attributes, enabling explicit identification and removal of sensitive
directions. However, existing debiasing approaches typically address vision and
text modalities independently, leaving residual cross-modal misalignment and
fairness gaps. To address this challenge, we propose DualFairVL, a multimodal
prompt-learning framework that jointly debiases and aligns cross-modal
representations. DualFairVL employs a parallel dual-branch architecture that
separates sensitive and target attributes, enabling disentangled yet aligned
representations across modalities. Approximately orthogonal text anchors are
constructed via linear projections, guiding cross-attention mechanisms to
produce fused features. A hypernetwork further disentangles attribute-related
information and generates instance-aware visual prompts, which encode
dual-modal cues for fairness and robustness. Prototype-based regularization is
applied in the visual branch to enforce separation of sensitive features and
strengthen alignment with textual anchors. Extensive experiments on eight
medical imaging datasets across four modalities show that DualFairVL achieves
state-of-the-art fairness and accuracy under both in- and out-of-distribution
settings, outperforming full fine-tuning and parameter-efficient baselines with
only 3.6M trainable parameters. Code will be released upon publication.
Summary / 总结
Ensuring fairness across demographic groups in medical diagnosis is essential for equitable healthcare, particularly under distribution shifts caused by variations in imaging equipment and clinical practice.
Weakly-Supervised 3D Visual Grounding based on Visual Language Alignment
Authors: Xiaoxu Xu, Yitian Yuan, Qiudan Zhang, Wenhui Wu, Zequn Jie, Lin Ma, Xu Wang
First: 2023-12-15T09:08:14+00:00 · Latest: 2025-08-26T08:50:35+00:00
Abstract
Learning to ground natural language queries to target objects or regions in
3D point clouds is quite essential for 3D scene understanding. Nevertheless,
existing 3D visual grounding approaches require a substantial number of
bounding box annotations for text queries, which is time-consuming and
labor-intensive to obtain. In this paper, we propose 3D-VLA, a weakly
supervised approach for 3D visual grounding based on Visual Linguistic
Alignment. Our 3D-VLA exploits the superior ability of current large-scale
vision-language models (VLMs) on aligning the semantics between texts and 2D
images, as well as the naturally existing correspondences between 2D images and
3D point clouds, and thus implicitly constructs correspondences between texts
and 3D point clouds with no need for fine-grained box annotations in the
training procedure. During the inference stage, the learned text-3D
correspondence will help us ground the text queries to the 3D target objects
even without 2D images. To the best of our knowledge, this is the first work to
investigate 3D visual grounding in a weakly supervised manner by involving
large scale vision-language models, and extensive experiments on ReferIt3D and
ScanRefer datasets demonstrate that our 3D-VLA achieves comparable and even
superior results over the fully supervised methods.
Summary / 总结
Learning to ground natural language queries to target objects or regions in 3D point clouds is quite essential for 3D scene understanding.
Hidden Tail: Adversarial Image Causing Stealthy Resource Consumption in Vision-Language Models
Authors: Rui Zhang, Zihan Wang, Tianli Yang, Hongwei Li, Wenbo Jiang, Qingchuan Zhao, Yang Liu, Guowen Xu
First: 2025-08-26T08:40:22+00:00 · Latest: 2025-08-26T08:40:22+00:00
Abstract
Vision-Language Models (VLMs) are increasingly deployed in real-world
applications, but their high inference cost makes them vulnerable to resource
consumption attacks. Prior attacks attempt to extend VLM output sequences by
optimizing adversarial images, thereby increasing inference costs. However,
these extended outputs often introduce irrelevant abnormal content,
compromising attack stealthiness. This trade-off between effectiveness and
stealthiness poses a major limitation for existing attacks. To address this
challenge, we propose \textit{Hidden Tail}, a stealthy resource consumption
attack that crafts prompt-agnostic adversarial images, inducing VLMs to
generate maximum-length outputs by appending special tokens invisible to users.
Our method employs a composite loss function that balances semantic
preservation, repetitive special token induction, and suppression of the
end-of-sequence (EOS) token, optimized via a dynamic weighting strategy.
Extensive experiments show that \textit{Hidden Tail} outperforms existing
attacks, increasing output length by up to 19.2$\times$ and reaching the
maximum token limit, while preserving attack stealthiness. These results
highlight the urgent need to improve the robustness of VLMs against
efficiency-oriented adversarial threats. Our code is available at
https://github.com/zhangrui4041/Hidden_Tail.
Summary / 总结
Vision-Language Models (VLMs) are increasingly deployed in real-world applications, but their high inference cost makes them vulnerable to resource consumption attacks.
Robust and Label-Efficient Deep Waste Detection
Authors: Hassan Abid, Khan Muhammad, Muhammad Haris Khan
First: 2025-08-26T08:34:04+00:00 · Latest: 2025-08-26T08:34:04+00:00
Comments: Accepted to BMVC 2025
Abstract
Effective waste sorting is critical for sustainable recycling, yet AI
research in this domain continues to lag behind commercial systems due to
limited datasets and reliance on legacy object detectors. In this work, we
advance AI-driven waste detection by establishing strong baselines and
introducing an ensemble-based semi-supervised learning framework. We first
benchmark state-of-the-art Open-Vocabulary Object Detection (OVOD) models on
the real-world ZeroWaste dataset, demonstrating that while class-only prompts
perform poorly, LLM-optimized prompts significantly enhance zero-shot accuracy.
Next, to address domain-specific limitations, we fine-tune modern
transformer-based detectors, achieving a new baseline of 51.6 mAP. We then
propose a soft pseudo-labeling strategy that fuses ensemble predictions using
spatial and consensus-aware weighting, enabling robust semi-supervised
training. Applied to the unlabeled ZeroWaste-s subset, our pseudo-annotations
achieve performance gains that surpass fully supervised training, underscoring
the effectiveness of scalable annotation pipelines. Our work contributes to the
research community by establishing rigorous baselines, introducing a robust
ensemble-based pseudo-labeling pipeline, generating high-quality annotations
for the unlabeled ZeroWaste-s subset, and systematically evaluating OVOD models
under real-world waste sorting conditions. Our code is available at:
https://github.com/h-abid97/robust-waste-detection.
Summary / 总结
Effective waste sorting is critical for sustainable recycling, yet AI research in this domain continues to lag behind commercial systems due to limited datasets and reliance on legacy object detectors.
Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods
Authors: Qinqian Lei, Bo Wang, Robby T. Tan
First: 2025-08-26T07:30:53+00:00 · Latest: 2025-08-26T07:30:53+00:00
Abstract
Prior human-object interaction (HOI) detection methods have integrated early
vision-language models (VLMs) such as CLIP, but only as supporting components
within their frameworks. In contrast, recent advances in large, generative VLMs
suggest that these models may already possess strong ability to understand
images involving HOI. This naturally raises an important question: can
general-purpose standalone VLMs effectively solve HOI detection, and how do
they compare with specialized HOI methods? Answering this requires a benchmark
that can accommodate both paradigms. However, existing HOI benchmarks such as
HICO-DET were developed before the emergence of modern VLMs, and their
evaluation protocols require exact matches to annotated HOI classes. This is
poorly aligned with the generative nature of VLMs, which often yield multiple
valid interpretations in ambiguous cases. For example, a static image may
capture a person mid-motion with a frisbee, which can plausibly be interpreted
as either "throwing" or "catching". When only "catching" is annotated, the
other, though equally plausible for the image, is marked incorrect when exact
matching is used. As a result, correct predictions might be penalized,
affecting both VLMs and HOI-specific methods. To avoid penalizing valid
predictions, we introduce a new benchmark that reformulates HOI detection as a
multiple-answer multiple-choice task, where each question includes only
ground-truth positive options and a curated set of negatives that are
constructed to reduce ambiguity (e.g., when "catching" is annotated, "throwing"
is not selected as a negative to avoid penalizing valid predictions). The
proposed evaluation protocol is the first of its kind for both VLMs and HOI
methods, enabling direct comparison and offering new insight into the current
state of progress in HOI understanding.
Summary / 总结
Prior human-object interaction (HOI) detection methods have integrated early vision-language models (VLMs) such as CLIP, but only as supporting components within their frameworks.
Prompting with Sign Parameters for Low-resource Sign Language Instruction Generation
Authors: Md Tariquzzaman, Md Farhan Ishmam, Saiyma Sittul Muna, Md Kamrul Hasan, Hasan Mahmud
Venue: ICCV 2025
First: 2025-08-22T04:11:28+00:00 · Latest: 2025-08-26T06:32:51+00:00
Comments: CV4A11y@ICCV 2025
Abstract
Sign Language (SL) enables two-way communication for the deaf and
hard-of-hearing community, yet many sign languages remain under-resourced in
the AI space. Sign Language Instruction Generation (SLIG) produces step-by-step
textual instructions that enable non-SL users to imitate and learn SL gestures,
promoting two-way interaction. We introduce BdSLIG, the first Bengali SLIG
dataset, used to evaluate Vision Language Models (VLMs) (i) on under-resourced
SLIG tasks, and (ii) on long-tail visual concepts, as Bengali SL is unlikely to
appear in the VLM pre-training data. To enhance zero-shot performance, we
introduce Sign Parameter-Infused (SPI) prompting, which integrates standard SL
parameters, like hand shape, motion, and orientation, directly into the textual
prompts. Subsuming standard sign parameters into the prompt makes the
instructions more structured and reproducible than free-form natural text from
vanilla prompting. We envision that our work would promote inclusivity and
advancement in SL learning systems for the under-resourced communities.
Summary / 总结
Sign Language (SL) enables two-way communication for the deaf and hard-of-hearing community, yet many sign languages remain under-resourced in the AI space.
Utilizing Training Data to Improve LLM Reasoning for Tabular Understanding
Authors: Chufan Gao, Jintai Chen, Jimeng Sun
First: 2025-08-26T04:46:54+00:00 · Latest: 2025-08-26T04:46:54+00:00
Abstract
Automated tabular understanding and reasoning are essential tasks for data
scientists. Recently, Large language models (LLMs) have become increasingly
prevalent in tabular reasoning tasks. Previous work focuses on (1) finetuning
LLMs using labeled data or (2) Training-free prompting LLM agents using
chain-of-thought (CoT). Finetuning offers dataset-specific learning at the cost
of generalizability. Training-free prompting is highly generalizable but does
not take full advantage of training data. In this paper, we propose a novel
prompting-based reasoning approach, Learn then Retrieve: LRTab, which
integrates the benefits of both by retrieving relevant information learned from
training data. We first use prompting to obtain CoT responses over the training
data. For incorrect CoTs, we prompt the LLM to predict Prompt Conditions to
avoid the error, learning insights from the data. We validate the effectiveness
of Prompt Conditions using validation data. Finally, at inference time, we
retrieve the most relevant Prompt Conditions for additional context for table
understanding. We provide comprehensive experiments on WikiTQ and Tabfact,
showing that LRTab is interpretable, cost-efficient, and can outperform
previous baselines in tabular reasoning.
Summary / 总结
Automated tabular understanding and reasoning are essential tasks for data scientists.
PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality
Authors: Nanxi Li, Zhengyue Zhao, Chaowei Xiao
First: 2025-08-26T03:45:19+00:00 · Latest: 2025-08-26T03:45:19+00:00
Abstract
Safeguarding vision-language models (VLMs) is a critical challenge, as
existing methods often suffer from over-defense, which harms utility, or rely
on shallow alignment, failing to detect complex threats that require deep
reasoning. To this end, we introduce PRISM (Principled Reasoning for Integrated
Safety in Multimodality), a system2-like framework that aligns VLMs by
embedding a structured, safety-aware reasoning process. Our framework consists
of two key components: PRISM-CoT, a dataset that teaches safety-aware
chain-of-thought reasoning, and PRISM-DPO, generated via Monte Carlo Tree
Search (MCTS) to further refine this reasoning through Direct Preference
Optimization to help obtain a delicate safety boundary. Comprehensive
evaluations demonstrate PRISM's effectiveness, achieving remarkably low attack
success rates including 0.15% on JailbreakV-28K for Qwen2-VL and 90%
improvement over the previous best method on VLBreak for LLaVA-1.5. PRISM also
exhibits strong robustness against adaptive attacks, significantly increasing
computational costs for adversaries, and generalizes effectively to
out-of-distribution challenges, reducing attack success rates to just 8.70% on
the challenging multi-image MIS benchmark. Remarkably, this robust defense is
achieved while preserving, and in some cases enhancing, model utility. To
promote reproducibility, we have made our code, data, and model weights
available at https://github.com/SaFoLab-WISC/PRISM.
Summary / 总结
Safeguarding vision-language models (VLMs) is a critical challenge, as existing methods often suffer from over-defense, which harms utility, or rely on shallow alignment, failing to detect complex threats that require deep reasoning.
Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models
Authors: Zesen Lyu, Dandan Zhang, Wei Ye, Fangdi Li, Zhihang Jiang, Yao Yang
Venue: EMNLP 2025
First: 2025-05-27T05:17:41+00:00 · Latest: 2025-08-26T03:25:38+00:00
Comments: Accepted by EMNLP 2025 Main conference
Abstract
Spatial reasoning is a core component of human cognition, enabling
individuals to perceive, comprehend, and interact with the physical world. It
relies on a nuanced understanding of spatial structures and inter-object
relationships, serving as the foundation for complex reasoning and
decision-making. To investigate whether current vision-language models (VLMs)
exhibit similar capability, we introduce Jigsaw-Puzzles, a novel benchmark
consisting of 1,100 carefully curated real-world images with high spatial
complexity. Based on this dataset, we design five tasks to rigorously evaluate
VLMs' spatial perception, structural understanding, and reasoning capabilities,
while deliberately minimizing reliance on domain-specific knowledge to better
isolate and assess the general spatial reasoning capability. We conduct a
comprehensive evaluation across 24 state-of-the-art VLMs. The results show that
even the strongest model, Gemini-2.5-Pro, achieves only 77.14% overall accuracy
and performs particularly poorly on the Order Generation task, with only 30.00%
accuracy, far below the performance exceeding 90% achieved by human
participants. This persistent gap underscores the need for continued progress,
positioning Jigsaw-Puzzles as a challenging and diagnostic benchmark for
advancing spatial reasoning research in VLMs. Our project page is at
https://zesen01.github.io/jigsaw-puzzles.
Summary / 总结
Spatial reasoning is a core component of human cognition, enabling individuals to perceive, comprehend, and interact with the physical world.
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
Authors: Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, Jianwei Yin
Venue: EMNLP
First: 2024-11-25T02:15:30+00:00 · Latest: 2025-08-26T02:48:31+00:00
Comments: Accepted by EMNLP-2025 Main. Project page:
https://szhanz.github.io/zoomeye/
Abstract
An image, especially with high-resolution, typically consists of numerous
visual elements, ranging from dominant large objects to fine-grained detailed
objects. When perceiving such images, multimodal large language models~(MLLMs)
face limitations due to the restricted input resolution of the pretrained
vision encoder and the cluttered, dense context of the image, resulting in a
focus on primary objects while easily overlooking detailed ones. In this paper,
we propose Zoom Eye, a tree search algorithm designed to navigate the
hierarchical and visual nature of images to capture relevant information. Zoom
Eye conceptualizes an image as a tree, with each children node representing a
zoomed sub-patch of the parent node and the root represents the overall image.
Moreover, Zoom Eye is model-agnostic and training-free, so it enables any MLLMs
to simulate human zooming actions by searching along the image tree from root
to leaf nodes, seeking out pertinent information, and accurately responding to
related queries. We experiment on a series of elaborate high-resolution
benchmarks and the results demonstrate that Zoom Eye not only consistently
improves the performance of a series base MLLMs with large margin~(e.g.,
LLaVA-v1.5-7B increases by 34.57\% on $V^*$ Bench and 17.88\% on HR-Bench), but
also enables small 7B MLLMs to outperform strong large models such as GPT-4o.
Our code is available at
\href{https://github.com/om-ai-lab/ZoomEye}{https://github.com/om-ai-lab/ZoomEye}.
Summary / 总结
An image, especially with high-resolution, typically consists of numerous visual elements, ranging from dominant large objects to fine-grained detailed objects.
The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation
Authors: Girish A. Koushik, Fatemeh Nazarieh, Katherine Birch, Shenbin Qian, Diptesh Kanojia
First: 2025-08-26T00:04:01+00:00 · Latest: 2025-08-26T00:04:01+00:00
Comments: Under Review
Abstract
Visual metaphor generation is a challenging task that aims to generate an
image given an input text metaphor. Inherently, it needs language understanding
to bind a source concept with a target concept, in a way that preserves meaning
while ensuring visual coherence. We propose a self-evaluating visual metaphor
generation framework that focuses on metaphor alignment. Our self-evaluation
approach combines existing metrics with our newly proposed metaphor
decomposition score and a meaning alignment (MA) metric. Within this setup, we
explore two novel approaches: a training-free pipeline that explicitly
decomposes prompts into source-target-meaning (S-T-M) mapping for image
synthesis, and a complementary training-based pipeline that improves alignment
using our proposed self-evaluation reward schema, without any large-scale
retraining. On the held-out test set, the training-free approach surpasses
strong closed baselines (GPT-4o, Imagen) on decomposition, CLIP, and MA scores,
with the training-based approach close behind. We evaluate our framework output
using a user-facing study, and observed that participants preferred GPT-4o
overall, while our training-free pipeline led open-source methods and edged
Imagen on abstract metaphors. Our analyses show S-T-M prompting helps longer or
more abstract metaphors, with closed models excelling on short, concrete cases;
we also observe sensitivity to sampler settings. Overall, structured prompting
and lightweight RL perform metaphor alignment well under modest compute, and
remaining gaps to human preference appear driven by aesthetics and sampling.
Summary / 总结
Visual metaphor generation is a challenging task that aims to generate an image given an input text metaphor.
Generic Guard AI in Stealth Game with Composite Potential Fields
Authors: Kaijie Xu, Clark Verbrugge
First: 2025-08-25T21:56:13+00:00 · Latest: 2025-08-25T21:56:13+00:00
Abstract
Guard patrol behavior is central to the immersion and strategic depth of
stealth games, while most existing systems rely on hand-crafted routes or
specialized logic that struggle to balance coverage efficiency and responsive
pursuit with believable naturalness. We propose a generic, fully explainable,
training-free framework that integrates global knowledge and local information
via Composite Potential Fields, combining three interpretable maps-Information,
Confidence, and Connectivity-into a single kernel-filtered decision criterion.
Our parametric, designer-driven approach requires only a handful of decay and
weight parameters-no retraining-to smoothly adapt across both occupancy-grid
and NavMesh-partition abstractions. We evaluate on five representative game
maps, two player-control policies, and five guard modes, confirming that our
method outperforms classical baseline methods in both capture efficiency and
patrol naturalness. Finally, we show how common stealth mechanics-distractions
and environmental elements-integrate naturally into our framework as sub
modules, enabling rapid prototyping of rich, dynamic, and responsive guard
behaviors.
Summary / 总结
Guard patrol behavior is central to the immersion and strategic depth of stealth games, while most existing systems rely on hand-crafted routes or specialized logic that struggle to balance coverage efficiency and responsive pursuit with believable naturalness.
UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation
Authors: Yihe Tang, Wenlong Huang, Yingke Wang, Chengshu Li, Roy Yuan, Ruohan Zhang, Jiajun Wu, Li Fei-Fei
First: 2025-06-10T22:47:16+00:00 · Latest: 2025-08-25T19:45:29+00:00
Abstract
Understanding fine-grained object affordances is imperative for robots to
manipulate objects in unstructured environments given open-ended task
instructions. However, existing methods of visual affordance predictions often
rely on manually annotated data or conditions only on a predefined set of
tasks. We introduce UAD (Unsupervised Affordance Distillation), a method for
distilling affordance knowledge from foundation models into a task-conditioned
affordance model without any manual annotations. By leveraging the
complementary strengths of large vision models and vision-language models, UAD
automatically annotates a large-scale dataset with detailed $<$instruction,
visual affordance$>$ pairs. Training only a lightweight task-conditioned
decoder atop frozen features, UAD exhibits notable generalization to
in-the-wild robotic scenes and to various human activities, despite only being
trained on rendered objects in simulation. Using affordance provided by UAD as
the observation space, we show an imitation learning policy that demonstrates
promising generalization to unseen object instances, object categories, and
even variations in task instructions after training on as few as 10
demonstrations. Project website: https://unsup-affordance.github.io/
Summary / 总结
Understanding fine-grained object affordances is imperative for robots to manipulate objects in unstructured environments given open-ended task instructions.
CLARIFY: A Specialist-Generalist Framework for Accurate and Lightweight Dermatological Visual Question Answering
Authors: Aranya Saha, Tanvir Ahmed Khan, Ismam Nur Swapnil, Mohammad Ariful Haque
First: 2025-08-25T19:22:16+00:00 · Latest: 2025-08-25T19:22:16+00:00
Comments: 10 pages, 8 figures, Prepared for submission to IEEE Transactions on
Human-Machine Systems
Abstract
Vision-language models (VLMs) have shown significant potential for medical
tasks; however, their general-purpose nature can limit specialized diagnostic
accuracy, and their large size poses substantial inference costs for real-world
clinical deployment. To address these challenges, we introduce CLARIFY, a
Specialist-Generalist framework for dermatological visual question answering
(VQA). CLARIFY combines two components: (i) a lightweight, domain-trained image
classifier (the Specialist) that provides fast and highly accurate diagnostic
predictions, and (ii) a powerful yet compressed conversational VLM (the
Generalist) that generates natural language explanations to user queries. In
our framework, the Specialist's predictions directly guide the Generalist's
reasoning, focusing it on the correct diagnostic path. This synergy is further
enhanced by a knowledge graph-based retrieval module, which grounds the
Generalist's responses in factual dermatological knowledge, ensuring both
accuracy and reliability. This hierarchical design not only reduces diagnostic
errors but also significantly improves computational efficiency. Experiments on
our curated multimodal dermatology dataset demonstrate that CLARIFY achieves an
18\% improvement in diagnostic accuracy over the strongest baseline, a
fine-tuned, uncompressed single-line VLM, while reducing the average VRAM
requirement and latency by at least 20\% and 5\%, respectively. These results
indicate that a Specialist-Generalist system provides a practical and powerful
paradigm for building lightweight, trustworthy, and clinically viable AI
systems.
Summary / 总结
Vision-language models (VLMs) have shown significant potential for medical tasks; however, their general-purpose nature can limit specialized diagnostic accuracy, and their large size poses substantial inference costs for real-world clinical deployment.
SafeBimanual: Diffusion-based Trajectory Optimization for Safe Bimanual Manipulation
Authors: Haoyuan Deng, Wenkai Guo, Qianzhun Wang, Zhenyu Wu, Ziwei Wang
First: 2025-08-25T17:59:02+00:00 · Latest: 2025-08-25T17:59:02+00:00
Comments: Project website is at: https://denghaoyuan123.github.io/SafeBimanip/
Abstract
Bimanual manipulation has been widely applied in household services and
manufacturing, which enables the complex task completion with coordination
requirements. Recent diffusion-based policy learning approaches have achieved
promising performance in modeling action distributions for bimanual
manipulation. However, they ignored the physical safety constraints of bimanual
manipulation, which leads to the dangerous behaviors with damage to robots and
objects. To this end, we propose a test-time trajectory optimization framework
named SafeBimanual for any pre-trained diffusion-based bimanual manipulation
policies, which imposes the safety constraints on bimanual actions to avoid
dangerous robot behaviors with improved success rate. Specifically, we design
diverse cost functions for safety constraints in different dual-arm cooperation
patterns including avoidance of tearing objects and collision between arms and
objects, which optimizes the manipulator trajectories with guided sampling of
diffusion denoising process. Moreover, we employ a vision-language model (VLM)
to schedule the cost functions by specifying keypoints and corresponding
pairwise relationship, so that the optimal safety constraint is dynamically
generated in the entire bimanual manipulation process. SafeBimanual
demonstrates superiority on 8 simulated tasks in RoboTwin with a 13.7% increase
in success rate and a 18.8% reduction in unsafe interactions over
state-of-the-art diffusion-based methods. Extensive experiments on 4 real-world
tasks further verify its practical value by improving the success rate by
32.5%.
Summary / 总结
Bimanual manipulation has been widely applied in household services and manufacturing, which enables the complex task completion with coordination requirements.
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
Authors: Sixun Dong, Juhua Hu, Mian Zhang, Ming Yin, Yanjie Fu, Qi Qian
First: 2025-08-25T17:57:49+00:00 · Latest: 2025-08-25T17:57:49+00:00
Comments: Project page: https://project.ironieser.cc/mmtok
Abstract
Vision-Language Models (VLMs) demonstrate impressive performance in
understanding visual content with language instruction by converting visual
input to vision tokens. However, redundancy in vision tokens results in the
degenerated inference efficiency of VLMs. While many algorithms have been
proposed to reduce the number of vision tokens, most of them apply only
unimodal information (i.e., vision/text) for pruning and ignore the inherent
multimodal property of vision-language tasks. Moreover, it lacks a generic
criterion that can be applied to different modalities. To mitigate this
limitation, in this work, we propose to leverage both vision and text tokens to
select informative vision tokens by the criterion of coverage. We first
formulate the subset selection problem as a maximum coverage problem.
Afterward, a subset of vision tokens is optimized to cover the text tokens and
the original set of vision tokens, simultaneously. Finally, a VLM agent can be
adopted to further improve the quality of text tokens for guiding vision
pruning. The proposed method MMTok is extensively evaluated on benchmark
datasets with different VLMs. The comparison illustrates that vision and text
information are complementary, and combining multimodal information can surpass
the unimodal baseline with a clear margin. Moreover, under the maximum coverage
criterion on the POPE dataset, our method achieves a 1.87x speedup while
maintaining 98.7% of the original performance on LLaVA-NeXT-13B. Furthermore,
with only four vision tokens, it still preserves 87.7% of the original
performance on LLaVA-1.5-7B. These results highlight the effectiveness of
coverage in token selection.
Summary / 总结
Vision-Language Models (VLMs) demonstrate impressive performance in understanding visual content with language instruction by converting visual input to vision tokens.
SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models
Authors: Zhenwei Tang, Difan Jiao, Blair Yang, Ashton Anderson
First: 2025-08-25T16:33:07+00:00 · Latest: 2025-08-25T16:33:07+00:00
Comments: COLM 2025
Abstract
Evaluating whether vision-language models (VLMs) reason consistently across
representations is challenging because modality comparisons are typically
confounded by task differences and asymmetric information. We introduce SEAM, a
benchmark that pairs semantically equivalent inputs across four domains that
have existing standardized textual and visual notations. By employing distinct
notation systems across modalities, in contrast to OCR-based image-text
pairing, SEAM provides a rigorous comparative assessment of the
textual-symbolic and visual-spatial reasoning capabilities of VLMs. Across 21
contemporary models, we observe systematic modality imbalance: vision
frequently lags language in overall performance, despite the problems
containing semantically equivalent information, and cross-modal agreement is
relatively low. Our error analysis reveals two main drivers: textual perception
failures from tokenization in domain notation and visual perception failures
that induce hallucinations. We also show that our results are largely robust to
visual transformations. SEAM establishes a controlled, semantically equivalent
setting for measuring and improving modality-agnostic reasoning.
Summary / 总结
Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information.