arXiv 论文速递

2025-08-27 15:57
Snapshot: 20250827_1557
VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space
Authors: Lin Li, Zehuan Huang, Haoran Feng, Gengxiong Zhuang, Rui Chen, Chunchao Guo, Lu Sheng
First: 2025-08-26T17:59:47+00:00 · Latest: 2025-08-26T17:59:47+00:00
Comments: Project page: https://huanngzh.github.io/VoxHammer-Page/
Abstract
3D local editing of specified regions is crucial for game industry and robot interaction. Recent methods typically edit rendered multi-view images and then reconstruct 3D models, but they face challenges in precisely preserving unedited regions and overall coherence. Inspired by structured 3D generative models, we propose VoxHammer, a novel training-free approach that performs precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer first predicts its inversion trajectory and obtains its inverted latents and key-value tokens at each timestep. Subsequently, in the denoising and editing phase, we replace the denoising features of preserved regions with the corresponding inverted latents and cached key-value tokens. By retaining these contextual features, this approach ensures consistent reconstruction of preserved areas and coherent integration of edited parts. To evaluate the consistency of preserved regions, we constructed Edit3D-Bench, a human-annotated dataset comprising hundreds of samples, each with carefully labeled 3D editing regions. Experiments demonstrate that VoxHammer significantly outperforms existing methods in terms of both 3D consistency of preserved regions and overall quality. Our method holds promise for synthesizing high-quality edited paired data, thereby laying the data foundation for in-context 3D generation. See our project page at https://huanngzh.github.io/VoxHammer-Page/.
Summary / 总结
3D local editing of specified regions is crucial for game industry and robot interaction.
Articulate3D: Zero-Shot Text-Driven 3D Object Posing
Authors: Oishi Deb, Anjun Hu, Ashkan Khakzar, Philip Torr, Christian Rupprecht
First: 2025-08-26T17:59:17+00:00 · Latest: 2025-08-26T17:59:17+00:00
Comments: Project page:https://odeb1.github.io/articulate3d_page_deb/
Abstract
We propose a training-free method, Articulate3D, to pose a 3D asset through language control. Despite advances in vision and language models, this task remains surprisingly challenging. To achieve this goal, we decompose the problem into two steps. We modify a powerful image-generator to create target images conditioned on the input image and a text instruction. We then align the mesh to the target images through a multi-view pose optimisation step. In detail, we introduce a self-attention rewiring mechanism (RSActrl) that decouples the source structure from pose within an image generative model, allowing it to maintain a consistent structure across varying poses. We observed that differentiable rendering is an unreliable signal for articulation optimisation; instead, we use keypoints to establish correspondences between input and target images. The effectiveness of Articulate3D is demonstrated across a diverse range of 3D objects and free-form text prompts, successfully manipulating poses while maintaining the original identity of the mesh. Quantitative evaluations and a comparative user study, in which our method was preferred over 85\% of the time, confirm its superiority over existing approaches. Project page:https://odeb1.github.io/articulate3d_page_deb/
Summary / 总结
We propose a training-free method, Articulate3D, to pose a 3D asset through language control.
Route-and-Execute: Auditable Model-Card Matching and Specialty-Level Deployment
Authors: Shayan Vassef, Soorya Ram Shimegekar, Abhay Goyal, Koustuv Saha, Pi Zonooz, Navin Kumar
First: 2025-08-22T23:34:37+00:00 · Latest: 2025-08-26T17:13:21+00:00
Abstract
Clinical workflows are fragmented as a patchwork of scripts and task-specific networks that often handle triage, task selection, and model deployment. These pipelines are rarely streamlined for data science pipeline, reducing efficiency and raising operational costs. Workflows also lack data-driven model identification (from imaging/tabular inputs) and standardized delivery of model outputs. In response, we present a practical, healthcare-first framework that uses a single vision-language model (VLM) in two complementary roles. First (Solution 1), the VLM acts as an aware model-card matcher that routes an incoming image to the appropriate specialist model via a three-stage workflow (modality -> primary abnormality -> model-card id). Checks are provided by (i) stagewise prompts that allow early exit via None/Normal/Other and (ii) a stagewise answer selector that arbitrates between the top-2 candidates at each stage, reducing the chance of an incorrect selection and aligning the workflow with clinical risk tolerance. Second (Solution 2), we fine-tune the VLM on specialty-specific datasets ensuring a single model covers multiple downstream tasks within each specialty, maintaining performance while simplifying deployment. Across gastroenterology, hematology, ophthalmology, and pathology, our single-model deployment matches or approaches specialized baselines. Compared with pipelines composed of many task-specific agents, this approach shows that one VLM can both decide and do. It may reduce effort by data scientists, shorten monitoring, increase the transparency of model selection (with per-stage justifications), and lower integration overhead.
Summary / 总结
Clinical workflows are fragmented as a patchwork of scripts and task-specific networks that often handle triage, task selection, and model deployment.
mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation
Authors: Chan-Wei Hu, Yueqi Wang, Shuo Xing, Chia-Ju Chen, Suofei Feng, Ryan Rossi, Zhengzhong Tu
First: 2025-05-29T23:32:03+00:00 · Latest: 2025-08-26T16:42:37+00:00
Comments: 16 pages
Abstract
Large Vision-Language Models (LVLMs) have made remarkable strides in multimodal tasks such as visual question answering, visual grounding, and complex reasoning. However, they remain limited by static training data, susceptibility to hallucinations, and inability to verify claims against up-to-date, external evidence, compromising their performance in dynamic real-world applications. Retrieval-Augmented Generation (RAG) offers a practical solution to mitigate these challenges by allowing the LVLMs to access large-scale knowledge databases via retrieval mechanisms, thereby grounding model outputs in factual, contextually relevant information. Here in this paper, we conduct the first systematic dissection of the multimodal RAG pipeline for LVLMs, explicitly investigating (1) the retrieval phase: on the modality configurations and retrieval strategies, (2) the re-ranking stage: on strategies to mitigate positional biases and improve the relevance of retrieved evidence, and (3) the generation phase: we further investigate how to best integrate retrieved candidates into the final generation process. Finally, we extend to explore a unified agentic framework that integrates re-ranking and generation through self-reflection, enabling LVLMs to select relevant evidence and suppress irrelevant context dynamically. Our full-stack exploration of RAG for LVLMs yields substantial insights, resulting in an average performance boost of 5% without any fine-tuning.
Summary / 总结
Large Vision-Language Models (LVLMs) have made remarkable strides in multimodal tasks such as visual question answering, visual grounding, and complex reasoning.
Architecting Clinical Collaboration: Multi-Agent Reasoning Systems for Multimodal Medical VQA
Authors: Karishma Thakrar, Shreyas Basavatia, Akshay Daftardar
First: 2025-07-07T22:31:56+00:00 · Latest: 2025-08-26T14:02:57+00:00
Abstract
Dermatological care via telemedicine often lacks the rich context of in-person visits. Clinicians must make diagnoses based on a handful of images and brief descriptions, without the benefit of physical exams, second opinions, or reference materials. While many medical AI systems attempt to bridge these gaps with domain-specific fine-tuning, this work hypothesized that mimicking clinical reasoning processes could offer a more effective path forward. This study tested seven vision-language models on medical visual question answering across six configurations: baseline models, fine-tuned variants, and both augmented with either reasoning layers that combine multiple model perspectives, analogous to peer consultation, or retrieval-augmented generation that incorporates medical literature at inference time, serving a role similar to reference-checking. While fine-tuning degraded performance in four of seven models with an average 30% decrease, baseline models collapsed on test data. Clinical-inspired architectures, meanwhile, achieved up to 70% accuracy, maintaining performance on unseen data while generating explainable, literature-grounded outputs critical for clinical adoption. These findings demonstrate that medical AI succeeds by reconstructing the collaborative and evidence-based practices fundamental to clinical diagnosis.
Summary / 总结
Dermatological care via telemedicine often lacks the rich context of in-person visits.
ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval
Authors: Yi Pan, Yujia Zhang, Michael Kampffmeyer, Xiaoguang Zhao
Venue: EMNLP 2025
First: 2025-08-26T13:42:48+00:00 · Latest: 2025-08-26T13:42:48+00:00
Comments: Accepted by EMNLP 2025 Findings
Abstract
Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task that involves retrieving videos based on queries relevant to only specific segments. While existing works follow the paradigm of developing models to process unimodal features, powerful pretrained vision-language models like CLIP remain underexplored in this field. To bridge this gap, we propose ProPy, a model with systematic architectural adaption of CLIP specifically designed for PRVR. Drawing insights from the semantic relevance of multi-granularity events, ProPy introduces two key innovations: (1) A Prompt Pyramid structure that organizes event prompts to capture semantics at multiple granularity levels, and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that enables dynamic semantic interaction among events. With these designs, ProPy achieves SOTA performance on three public datasets, outperforming previous models by significant margins. Code is available at https://github.com/BUAAPY/ProPy.
Summary / 总结
Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task that involves retrieving videos based on queries relevant to only specific segments.
ForgetMe: Evaluating Selective Forgetting in Generative Models
Authors: Zhenyu Yu, Mohd Yamani Inda Idris, Pei Wang
First: 2025-04-17T01:44:57+00:00 · Latest: 2025-08-26T13:04:59+00:00
Abstract
The widespread adoption of diffusion models in image generation has increased the demand for privacy-compliant unlearning. However, due to the high-dimensional nature and complex feature representations of diffusion models, achieving selective unlearning remains challenging, as existing methods struggle to remove sensitive information while preserving the consistency of non-sensitive regions. To address this, we propose an Automatic Dataset Creation Framework based on prompt-based layered editing and training-free local feature removal, constructing the ForgetMe dataset and introducing the Entangled evaluation metric. The Entangled metric quantifies unlearning effectiveness by assessing the similarity and consistency between the target and background regions and supports both paired (Entangled-D) and unpaired (Entangled-S) image data, enabling unsupervised evaluation. The ForgetMe dataset encompasses a diverse set of real and synthetic scenarios, including CUB-200-2011 (Birds), Stanford-Dogs, ImageNet, and a synthetic cat dataset. We apply LoRA fine-tuning on Stable Diffusion to achieve selective unlearning on this dataset and validate the effectiveness of both the ForgetMe dataset and the Entangled metric, establishing them as benchmarks for selective unlearning. Our work provides a scalable and adaptable solution for advancing privacy-preserving generative AI.
Summary / 总结
The widespread adoption of diffusion models in image generation has increased the demand for privacy-compliant unlearning.
Ask Me Again Differently: GRAS for Measuring Bias in Vision Language Models on Gender, Race, Age, and Skin Tone
Authors: Shaivi Malik, Hasnat Md Abdullah, Sriparna Saha, Amit Sheth
First: 2025-08-26T12:41:35+00:00 · Latest: 2025-08-26T12:41:35+00:00
Abstract
As Vision Language Models (VLMs) become integral to real-world applications, understanding their demographic biases is critical. We introduce GRAS, a benchmark for uncovering demographic biases in VLMs across gender, race, age, and skin tone, offering the most diverse coverage to date. We further propose the GRAS Bias Score, an interpretable metric for quantifying bias. We benchmark five state-of-the-art VLMs and reveal concerning bias levels, with the least biased model attaining a GRAS Bias Score of only 2 out of 100. Our findings also reveal a methodological insight: evaluating bias in VLMs with visual question answering (VQA) requires considering multiple formulations of a question. Our code, data, and evaluation results are publicly available.
Summary / 总结
As Vision Language Models (VLMs) become integral to real-world applications, understanding their demographic biases is critical.
Enhancing Document VQA Models via Retrieval-Augmented Generation
Authors: Eric López, Artemis Llabrés, Ernest Valveny
First: 2025-08-26T12:32:55+00:00 · Latest: 2025-08-26T12:32:55+00:00
Comments: Accepted at Workshop on Machine Learning in Document Analysis and Recognition (ICDAR WML 2025), Wuhan, China
Abstract
Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry. Retrieval-Augmented Generation (RAG) offers an attractive alternative, first retrieving a concise set of relevant segments before generating answers from this selected evidence. In this paper, we systematically evaluate the impact of incorporating RAG into Document VQA through different retrieval variants - text-based retrieval using OCR tokens and purely visual retrieval without OCR - across multiple models and benchmarks. Evaluated on the multi-page datasets MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the "concatenate-all-pages" baseline by up to +22.5 ANLS, while the visual variant achieves +5.0 ANLS improvement without requiring any text extraction. An ablation confirms that retrieval and reranking components drive most of the gain, whereas the layout-guided chunking strategy - proposed in several recent works to leverage page structure - fails to help on these datasets. Our experiments demonstrate that careful evidence selection consistently boosts accuracy across multiple model sizes and multi-page benchmarks, underscoring its practical value for real-world Document VQA.
Summary / 总结
Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry.
Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs
Authors: Somraj Gautam, Abhirama Subramanyam Penamakuri, Abhishek Bhandari, Gaurav Harit
First: 2025-08-24T12:43:27+00:00 · Latest: 2025-08-26T12:16:26+00:00
Abstract
We introduce MMCRICBENCH-3K, a benchmark for Visual Question Answering (VQA) on cricket scorecards, designed to evaluate large vision-language models (LVLMs) on complex numerical and cross-lingual reasoning over semi-structured tabular images. MMCRICBENCH-3K comprises 1,463 synthetically generated scorecard images from ODI, T20, and Test formats, accompanied by 1,500 English QA pairs. It includes two subsets: MMCRICBENCH-E-1.5K, featuring English scorecards, and MMCRICBENCH-H-1.5K, containing visually similar Hindi scorecards, with all questions and answers kept in English to enable controlled cross-script evaluation. The task demands reasoning over structured numerical data, multi-image context, and implicit domain knowledge. Empirical results show that even state-of-the-art LVLMs, such as GPT-4o and Qwen2.5VL, struggle on the English subset despite it being their primary training language and exhibit a further drop in performance on the Hindi subset. This reveals key limitations in structure-aware visual text understanding, numerical reasoning, and cross-lingual generalization. The dataset is publicly available via Hugging Face at https://huggingface.co/datasets/DIALab/MMCricBench, to promote LVLM research in this direction.
Summary / 总结
We introduce MMCRICBENCH-3K, a benchmark for Visual Question Answering (VQA) on cricket scorecards, designed to evaluate large vision-language models (LVLMs) on complex numerical and cross-lingual reasoning over semi-structured tabular images.
Prototype-Guided Diffusion: Visual Conditioning without External Memory
Authors: Bilal Faye, Hanane Azzag, Mustapha Lebbah
First: 2025-08-13T16:18:35+00:00 · Latest: 2025-08-26T10:29:55+00:00
Abstract
Diffusion models have emerged as a leading framework for high-quality image generation, offering stable training and strong performance across diverse domains. However, they remain computationally intensive, particularly during the iterative denoising process. Latent-space models like Stable Diffusion alleviate some of this cost by operating in compressed representations, though at the expense of fine-grained detail. More recent approaches such as Retrieval-Augmented Diffusion Models (RDM) address efficiency by conditioning denoising on similar examples retrieved from large external memory banks. While effective, these methods introduce drawbacks: they require costly storage and retrieval infrastructure, depend on static vision-language models like CLIP for similarity, and lack adaptability during training. We propose the Prototype Diffusion Model (PDM), a method that integrates prototype learning directly into the diffusion process for efficient and adaptive visual conditioning - without external memory. Instead of retrieving reference samples, PDM constructs a dynamic set of compact visual prototypes from clean image features using contrastive learning. These prototypes guide the denoising steps by aligning noisy representations with semantically relevant visual patterns, enabling efficient generation with strong semantic grounding. Experiments show that PDM maintains high generation quality while reducing computational and storage overhead, offering a scalable alternative to retrieval-based conditioning in diffusion models.
Summary / 总结
Diffusion models have emerged as a leading framework for high-quality image generation, offering stable training and strong performance across diverse domains.
M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering
Authors: Yanshu Li, Yi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, Ruixiang Tang
First: 2025-04-06T22:02:21+00:00 · Latest: 2025-08-26T10:19:05+00:00
Comments: COLM 2025, 30 pages, 10 figures, 16 tables
Abstract
Multimodal in-context learning (ICL) equips Large Vision-language Models (LVLMs) with the ability to adapt to new tasks via multiple user-provided demonstrations, without requiring any model parameter updates. However, its effectiveness is constrained by the token-intensive nature of multimodal inputs and the complexity of cross-modal few-shot reasoning, which together hinder LVLMs from extracting useful patterns from demonstrations. To address these challenges, we propose \textbf{M$^2$IV}, a novel representation engineering approach that replaces explicit token-level demonstrations with a set of learnable Multimodal In-context Vectors directly injected into the residual streams of LVLMs. By analyzing the distinct roles of multi-head attention (MHA) and multi-layer perceptrons (MLP) in the ICL process, we design a training strategy that enables M$^2$IV to perform fine-grained semantic distillation and robust cross-modal representation learning. M$^2$IV not only improves performance across diverse tasks and LVLMs but also significantly reduces token overhead, enabling graceful scaling to many-shot scenarios. To further enhance usability, we introduce \textbf{VLibrary}, a repository that stores trained M$^2$IVs for flexible retrieval and injection. With VLibrary, users can steer pre-trained LVLMs in a customized manner that meets diverse requirements. Extensive experiments demonstrate that M$^2$IV consistently outperforms vanilla ICL and prior representation engineering baselines, achieving an average accuracy gain of 3.74\% with substantial improvements in overall efficiency.
Summary / 总结
Multimodal in-context learning (ICL) equips Large Vision-language Models (LVLMs) with the ability to adapt to new tasks via multiple user-provided demonstrations, without requiring any model parameter updates.
Video CLIP Model for Multi-View Echocardiography Interpretation
Authors: Ryo Takizawa, Satoshi Kodera, Tempei Kabayama, Ryo Matsuoka, Yuta Ando, Yuto Nakamura, Haruki Settai, Norihiko Takeda
First: 2025-04-26T05:11:15+00:00 · Latest: 2025-08-26T10:06:14+00:00
Abstract
Echocardiography records ultrasound videos of the heart, enabling clinicians to assess cardiac function. Recent advances in large-scale vision-language models (VLMs) have spurred interest in automating echocardiographic interpretation. However, most existing medical VLMs rely on single-frame (image) inputs, which can reduce diagnostic accuracy for conditions identifiable only through cardiac motion. In addition, echocardiographic videos are captured from multiple views, each varying in suitability for detecting specific conditions. Leveraging multiple views may therefore improve diagnostic performance. We developed a video-language model that processes full video sequences from five standard views, trained on 60,747 echocardiographic video-report pairs. We evaluated the gains in retrieval performance from video input and multi-view support, including the contributions of various pretrained models.
Summary / 总结
Echocardiography records ultrasound videos of the heart, enabling clinicians to assess cardiac function.
Toward Robust Medical Fairness: Debiased Dual-Modal Alignment via Text-Guided Attribute-Disentangled Prompt Learning for Vision-Language Models
Authors: Yuexuan Xia, Benteng Ma, Jiang He, Zhiyong Wang, Qi Dou, Yong Xia
First: 2025-08-26T10:01:23+00:00 · Latest: 2025-08-26T10:01:23+00:00
Abstract
Ensuring fairness across demographic groups in medical diagnosis is essential for equitable healthcare, particularly under distribution shifts caused by variations in imaging equipment and clinical practice. Vision-language models (VLMs) exhibit strong generalization, and text prompts encode identity attributes, enabling explicit identification and removal of sensitive directions. However, existing debiasing approaches typically address vision and text modalities independently, leaving residual cross-modal misalignment and fairness gaps. To address this challenge, we propose DualFairVL, a multimodal prompt-learning framework that jointly debiases and aligns cross-modal representations. DualFairVL employs a parallel dual-branch architecture that separates sensitive and target attributes, enabling disentangled yet aligned representations across modalities. Approximately orthogonal text anchors are constructed via linear projections, guiding cross-attention mechanisms to produce fused features. A hypernetwork further disentangles attribute-related information and generates instance-aware visual prompts, which encode dual-modal cues for fairness and robustness. Prototype-based regularization is applied in the visual branch to enforce separation of sensitive features and strengthen alignment with textual anchors. Extensive experiments on eight medical imaging datasets across four modalities show that DualFairVL achieves state-of-the-art fairness and accuracy under both in- and out-of-distribution settings, outperforming full fine-tuning and parameter-efficient baselines with only 3.6M trainable parameters. Code will be released upon publication.
Summary / 总结
Ensuring fairness across demographic groups in medical diagnosis is essential for equitable healthcare, particularly under distribution shifts caused by variations in imaging equipment and clinical practice.
Weakly-Supervised 3D Visual Grounding based on Visual Language Alignment
Authors: Xiaoxu Xu, Yitian Yuan, Qiudan Zhang, Wenhui Wu, Zequn Jie, Lin Ma, Xu Wang
First: 2023-12-15T09:08:14+00:00 · Latest: 2025-08-26T08:50:35+00:00
Abstract
Learning to ground natural language queries to target objects or regions in 3D point clouds is quite essential for 3D scene understanding. Nevertheless, existing 3D visual grounding approaches require a substantial number of bounding box annotations for text queries, which is time-consuming and labor-intensive to obtain. In this paper, we propose 3D-VLA, a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment. Our 3D-VLA exploits the superior ability of current large-scale vision-language models (VLMs) on aligning the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds with no need for fine-grained box annotations in the training procedure. During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images. To the best of our knowledge, this is the first work to investigate 3D visual grounding in a weakly supervised manner by involving large scale vision-language models, and extensive experiments on ReferIt3D and ScanRefer datasets demonstrate that our 3D-VLA achieves comparable and even superior results over the fully supervised methods.
Summary / 总结
Learning to ground natural language queries to target objects or regions in 3D point clouds is quite essential for 3D scene understanding.
Hidden Tail: Adversarial Image Causing Stealthy Resource Consumption in Vision-Language Models
Authors: Rui Zhang, Zihan Wang, Tianli Yang, Hongwei Li, Wenbo Jiang, Qingchuan Zhao, Yang Liu, Guowen Xu
First: 2025-08-26T08:40:22+00:00 · Latest: 2025-08-26T08:40:22+00:00
Abstract
Vision-Language Models (VLMs) are increasingly deployed in real-world applications, but their high inference cost makes them vulnerable to resource consumption attacks. Prior attacks attempt to extend VLM output sequences by optimizing adversarial images, thereby increasing inference costs. However, these extended outputs often introduce irrelevant abnormal content, compromising attack stealthiness. This trade-off between effectiveness and stealthiness poses a major limitation for existing attacks. To address this challenge, we propose \textit{Hidden Tail}, a stealthy resource consumption attack that crafts prompt-agnostic adversarial images, inducing VLMs to generate maximum-length outputs by appending special tokens invisible to users. Our method employs a composite loss function that balances semantic preservation, repetitive special token induction, and suppression of the end-of-sequence (EOS) token, optimized via a dynamic weighting strategy. Extensive experiments show that \textit{Hidden Tail} outperforms existing attacks, increasing output length by up to 19.2$\times$ and reaching the maximum token limit, while preserving attack stealthiness. These results highlight the urgent need to improve the robustness of VLMs against efficiency-oriented adversarial threats. Our code is available at https://github.com/zhangrui4041/Hidden_Tail.
Summary / 总结
Vision-Language Models (VLMs) are increasingly deployed in real-world applications, but their high inference cost makes them vulnerable to resource consumption attacks.
Robust and Label-Efficient Deep Waste Detection
Authors: Hassan Abid, Khan Muhammad, Muhammad Haris Khan
First: 2025-08-26T08:34:04+00:00 · Latest: 2025-08-26T08:34:04+00:00
Comments: Accepted to BMVC 2025
Abstract
Effective waste sorting is critical for sustainable recycling, yet AI research in this domain continues to lag behind commercial systems due to limited datasets and reliance on legacy object detectors. In this work, we advance AI-driven waste detection by establishing strong baselines and introducing an ensemble-based semi-supervised learning framework. We first benchmark state-of-the-art Open-Vocabulary Object Detection (OVOD) models on the real-world ZeroWaste dataset, demonstrating that while class-only prompts perform poorly, LLM-optimized prompts significantly enhance zero-shot accuracy. Next, to address domain-specific limitations, we fine-tune modern transformer-based detectors, achieving a new baseline of 51.6 mAP. We then propose a soft pseudo-labeling strategy that fuses ensemble predictions using spatial and consensus-aware weighting, enabling robust semi-supervised training. Applied to the unlabeled ZeroWaste-s subset, our pseudo-annotations achieve performance gains that surpass fully supervised training, underscoring the effectiveness of scalable annotation pipelines. Our work contributes to the research community by establishing rigorous baselines, introducing a robust ensemble-based pseudo-labeling pipeline, generating high-quality annotations for the unlabeled ZeroWaste-s subset, and systematically evaluating OVOD models under real-world waste sorting conditions. Our code is available at: https://github.com/h-abid97/robust-waste-detection.
Summary / 总结
Effective waste sorting is critical for sustainable recycling, yet AI research in this domain continues to lag behind commercial systems due to limited datasets and reliance on legacy object detectors.
Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods
Authors: Qinqian Lei, Bo Wang, Robby T. Tan
First: 2025-08-26T07:30:53+00:00 · Latest: 2025-08-26T07:30:53+00:00
Abstract
Prior human-object interaction (HOI) detection methods have integrated early vision-language models (VLMs) such as CLIP, but only as supporting components within their frameworks. In contrast, recent advances in large, generative VLMs suggest that these models may already possess strong ability to understand images involving HOI. This naturally raises an important question: can general-purpose standalone VLMs effectively solve HOI detection, and how do they compare with specialized HOI methods? Answering this requires a benchmark that can accommodate both paradigms. However, existing HOI benchmarks such as HICO-DET were developed before the emergence of modern VLMs, and their evaluation protocols require exact matches to annotated HOI classes. This is poorly aligned with the generative nature of VLMs, which often yield multiple valid interpretations in ambiguous cases. For example, a static image may capture a person mid-motion with a frisbee, which can plausibly be interpreted as either "throwing" or "catching". When only "catching" is annotated, the other, though equally plausible for the image, is marked incorrect when exact matching is used. As a result, correct predictions might be penalized, affecting both VLMs and HOI-specific methods. To avoid penalizing valid predictions, we introduce a new benchmark that reformulates HOI detection as a multiple-answer multiple-choice task, where each question includes only ground-truth positive options and a curated set of negatives that are constructed to reduce ambiguity (e.g., when "catching" is annotated, "throwing" is not selected as a negative to avoid penalizing valid predictions). The proposed evaluation protocol is the first of its kind for both VLMs and HOI methods, enabling direct comparison and offering new insight into the current state of progress in HOI understanding.
Summary / 总结
Prior human-object interaction (HOI) detection methods have integrated early vision-language models (VLMs) such as CLIP, but only as supporting components within their frameworks.
Prompting with Sign Parameters for Low-resource Sign Language Instruction Generation
Authors: Md Tariquzzaman, Md Farhan Ishmam, Saiyma Sittul Muna, Md Kamrul Hasan, Hasan Mahmud
Venue: ICCV 2025
First: 2025-08-22T04:11:28+00:00 · Latest: 2025-08-26T06:32:51+00:00
Comments: CV4A11y@ICCV 2025
Abstract
Sign Language (SL) enables two-way communication for the deaf and hard-of-hearing community, yet many sign languages remain under-resourced in the AI space. Sign Language Instruction Generation (SLIG) produces step-by-step textual instructions that enable non-SL users to imitate and learn SL gestures, promoting two-way interaction. We introduce BdSLIG, the first Bengali SLIG dataset, used to evaluate Vision Language Models (VLMs) (i) on under-resourced SLIG tasks, and (ii) on long-tail visual concepts, as Bengali SL is unlikely to appear in the VLM pre-training data. To enhance zero-shot performance, we introduce Sign Parameter-Infused (SPI) prompting, which integrates standard SL parameters, like hand shape, motion, and orientation, directly into the textual prompts. Subsuming standard sign parameters into the prompt makes the instructions more structured and reproducible than free-form natural text from vanilla prompting. We envision that our work would promote inclusivity and advancement in SL learning systems for the under-resourced communities.
Summary / 总结
Sign Language (SL) enables two-way communication for the deaf and hard-of-hearing community, yet many sign languages remain under-resourced in the AI space.
Utilizing Training Data to Improve LLM Reasoning for Tabular Understanding
Authors: Chufan Gao, Jintai Chen, Jimeng Sun
First: 2025-08-26T04:46:54+00:00 · Latest: 2025-08-26T04:46:54+00:00
Abstract
Automated tabular understanding and reasoning are essential tasks for data scientists. Recently, Large language models (LLMs) have become increasingly prevalent in tabular reasoning tasks. Previous work focuses on (1) finetuning LLMs using labeled data or (2) Training-free prompting LLM agents using chain-of-thought (CoT). Finetuning offers dataset-specific learning at the cost of generalizability. Training-free prompting is highly generalizable but does not take full advantage of training data. In this paper, we propose a novel prompting-based reasoning approach, Learn then Retrieve: LRTab, which integrates the benefits of both by retrieving relevant information learned from training data. We first use prompting to obtain CoT responses over the training data. For incorrect CoTs, we prompt the LLM to predict Prompt Conditions to avoid the error, learning insights from the data. We validate the effectiveness of Prompt Conditions using validation data. Finally, at inference time, we retrieve the most relevant Prompt Conditions for additional context for table understanding. We provide comprehensive experiments on WikiTQ and Tabfact, showing that LRTab is interpretable, cost-efficient, and can outperform previous baselines in tabular reasoning.
Summary / 总结
Automated tabular understanding and reasoning are essential tasks for data scientists.
PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality
Authors: Nanxi Li, Zhengyue Zhao, Chaowei Xiao
First: 2025-08-26T03:45:19+00:00 · Latest: 2025-08-26T03:45:19+00:00
Abstract
Safeguarding vision-language models (VLMs) is a critical challenge, as existing methods often suffer from over-defense, which harms utility, or rely on shallow alignment, failing to detect complex threats that require deep reasoning. To this end, we introduce PRISM (Principled Reasoning for Integrated Safety in Multimodality), a system2-like framework that aligns VLMs by embedding a structured, safety-aware reasoning process. Our framework consists of two key components: PRISM-CoT, a dataset that teaches safety-aware chain-of-thought reasoning, and PRISM-DPO, generated via Monte Carlo Tree Search (MCTS) to further refine this reasoning through Direct Preference Optimization to help obtain a delicate safety boundary. Comprehensive evaluations demonstrate PRISM's effectiveness, achieving remarkably low attack success rates including 0.15% on JailbreakV-28K for Qwen2-VL and 90% improvement over the previous best method on VLBreak for LLaVA-1.5. PRISM also exhibits strong robustness against adaptive attacks, significantly increasing computational costs for adversaries, and generalizes effectively to out-of-distribution challenges, reducing attack success rates to just 8.70% on the challenging multi-image MIS benchmark. Remarkably, this robust defense is achieved while preserving, and in some cases enhancing, model utility. To promote reproducibility, we have made our code, data, and model weights available at https://github.com/SaFoLab-WISC/PRISM.
Summary / 总结
Safeguarding vision-language models (VLMs) is a critical challenge, as existing methods often suffer from over-defense, which harms utility, or rely on shallow alignment, failing to detect complex threats that require deep reasoning.
Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models
Authors: Zesen Lyu, Dandan Zhang, Wei Ye, Fangdi Li, Zhihang Jiang, Yao Yang
Venue: EMNLP 2025
First: 2025-05-27T05:17:41+00:00 · Latest: 2025-08-26T03:25:38+00:00
Comments: Accepted by EMNLP 2025 Main conference
Abstract
Spatial reasoning is a core component of human cognition, enabling individuals to perceive, comprehend, and interact with the physical world. It relies on a nuanced understanding of spatial structures and inter-object relationships, serving as the foundation for complex reasoning and decision-making. To investigate whether current vision-language models (VLMs) exhibit similar capability, we introduce Jigsaw-Puzzles, a novel benchmark consisting of 1,100 carefully curated real-world images with high spatial complexity. Based on this dataset, we design five tasks to rigorously evaluate VLMs' spatial perception, structural understanding, and reasoning capabilities, while deliberately minimizing reliance on domain-specific knowledge to better isolate and assess the general spatial reasoning capability. We conduct a comprehensive evaluation across 24 state-of-the-art VLMs. The results show that even the strongest model, Gemini-2.5-Pro, achieves only 77.14% overall accuracy and performs particularly poorly on the Order Generation task, with only 30.00% accuracy, far below the performance exceeding 90% achieved by human participants. This persistent gap underscores the need for continued progress, positioning Jigsaw-Puzzles as a challenging and diagnostic benchmark for advancing spatial reasoning research in VLMs. Our project page is at https://zesen01.github.io/jigsaw-puzzles.
Summary / 总结
Spatial reasoning is a core component of human cognition, enabling individuals to perceive, comprehend, and interact with the physical world.
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
Authors: Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, Jianwei Yin
Venue: EMNLP
First: 2024-11-25T02:15:30+00:00 · Latest: 2025-08-26T02:48:31+00:00
Comments: Accepted by EMNLP-2025 Main. Project page: https://szhanz.github.io/zoomeye/
Abstract
An image, especially with high-resolution, typically consists of numerous visual elements, ranging from dominant large objects to fine-grained detailed objects. When perceiving such images, multimodal large language models~(MLLMs) face limitations due to the restricted input resolution of the pretrained vision encoder and the cluttered, dense context of the image, resulting in a focus on primary objects while easily overlooking detailed ones. In this paper, we propose Zoom Eye, a tree search algorithm designed to navigate the hierarchical and visual nature of images to capture relevant information. Zoom Eye conceptualizes an image as a tree, with each children node representing a zoomed sub-patch of the parent node and the root represents the overall image. Moreover, Zoom Eye is model-agnostic and training-free, so it enables any MLLMs to simulate human zooming actions by searching along the image tree from root to leaf nodes, seeking out pertinent information, and accurately responding to related queries. We experiment on a series of elaborate high-resolution benchmarks and the results demonstrate that Zoom Eye not only consistently improves the performance of a series base MLLMs with large margin~(e.g., LLaVA-v1.5-7B increases by 34.57\% on $V^*$ Bench and 17.88\% on HR-Bench), but also enables small 7B MLLMs to outperform strong large models such as GPT-4o. Our code is available at \href{https://github.com/om-ai-lab/ZoomEye}{https://github.com/om-ai-lab/ZoomEye}.
Summary / 总结
An image, especially with high-resolution, typically consists of numerous visual elements, ranging from dominant large objects to fine-grained detailed objects.
The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation
Authors: Girish A. Koushik, Fatemeh Nazarieh, Katherine Birch, Shenbin Qian, Diptesh Kanojia
First: 2025-08-26T00:04:01+00:00 · Latest: 2025-08-26T00:04:01+00:00
Comments: Under Review
Abstract
Visual metaphor generation is a challenging task that aims to generate an image given an input text metaphor. Inherently, it needs language understanding to bind a source concept with a target concept, in a way that preserves meaning while ensuring visual coherence. We propose a self-evaluating visual metaphor generation framework that focuses on metaphor alignment. Our self-evaluation approach combines existing metrics with our newly proposed metaphor decomposition score and a meaning alignment (MA) metric. Within this setup, we explore two novel approaches: a training-free pipeline that explicitly decomposes prompts into source-target-meaning (S-T-M) mapping for image synthesis, and a complementary training-based pipeline that improves alignment using our proposed self-evaluation reward schema, without any large-scale retraining. On the held-out test set, the training-free approach surpasses strong closed baselines (GPT-4o, Imagen) on decomposition, CLIP, and MA scores, with the training-based approach close behind. We evaluate our framework output using a user-facing study, and observed that participants preferred GPT-4o overall, while our training-free pipeline led open-source methods and edged Imagen on abstract metaphors. Our analyses show S-T-M prompting helps longer or more abstract metaphors, with closed models excelling on short, concrete cases; we also observe sensitivity to sampler settings. Overall, structured prompting and lightweight RL perform metaphor alignment well under modest compute, and remaining gaps to human preference appear driven by aesthetics and sampling.
Summary / 总结
Visual metaphor generation is a challenging task that aims to generate an image given an input text metaphor.
Generic Guard AI in Stealth Game with Composite Potential Fields
Authors: Kaijie Xu, Clark Verbrugge
First: 2025-08-25T21:56:13+00:00 · Latest: 2025-08-25T21:56:13+00:00
Abstract
Guard patrol behavior is central to the immersion and strategic depth of stealth games, while most existing systems rely on hand-crafted routes or specialized logic that struggle to balance coverage efficiency and responsive pursuit with believable naturalness. We propose a generic, fully explainable, training-free framework that integrates global knowledge and local information via Composite Potential Fields, combining three interpretable maps-Information, Confidence, and Connectivity-into a single kernel-filtered decision criterion. Our parametric, designer-driven approach requires only a handful of decay and weight parameters-no retraining-to smoothly adapt across both occupancy-grid and NavMesh-partition abstractions. We evaluate on five representative game maps, two player-control policies, and five guard modes, confirming that our method outperforms classical baseline methods in both capture efficiency and patrol naturalness. Finally, we show how common stealth mechanics-distractions and environmental elements-integrate naturally into our framework as sub modules, enabling rapid prototyping of rich, dynamic, and responsive guard behaviors.
Summary / 总结
Guard patrol behavior is central to the immersion and strategic depth of stealth games, while most existing systems rely on hand-crafted routes or specialized logic that struggle to balance coverage efficiency and responsive pursuit with believable naturalness.
UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation
Authors: Yihe Tang, Wenlong Huang, Yingke Wang, Chengshu Li, Roy Yuan, Ruohan Zhang, Jiajun Wu, Li Fei-Fei
First: 2025-06-10T22:47:16+00:00 · Latest: 2025-08-25T19:45:29+00:00
Abstract
Understanding fine-grained object affordances is imperative for robots to manipulate objects in unstructured environments given open-ended task instructions. However, existing methods of visual affordance predictions often rely on manually annotated data or conditions only on a predefined set of tasks. We introduce UAD (Unsupervised Affordance Distillation), a method for distilling affordance knowledge from foundation models into a task-conditioned affordance model without any manual annotations. By leveraging the complementary strengths of large vision models and vision-language models, UAD automatically annotates a large-scale dataset with detailed $<$instruction, visual affordance$>$ pairs. Training only a lightweight task-conditioned decoder atop frozen features, UAD exhibits notable generalization to in-the-wild robotic scenes and to various human activities, despite only being trained on rendered objects in simulation. Using affordance provided by UAD as the observation space, we show an imitation learning policy that demonstrates promising generalization to unseen object instances, object categories, and even variations in task instructions after training on as few as 10 demonstrations. Project website: https://unsup-affordance.github.io/
Summary / 总结
Understanding fine-grained object affordances is imperative for robots to manipulate objects in unstructured environments given open-ended task instructions.
CLARIFY: A Specialist-Generalist Framework for Accurate and Lightweight Dermatological Visual Question Answering
Authors: Aranya Saha, Tanvir Ahmed Khan, Ismam Nur Swapnil, Mohammad Ariful Haque
First: 2025-08-25T19:22:16+00:00 · Latest: 2025-08-25T19:22:16+00:00
Comments: 10 pages, 8 figures, Prepared for submission to IEEE Transactions on Human-Machine Systems
Abstract
Vision-language models (VLMs) have shown significant potential for medical tasks; however, their general-purpose nature can limit specialized diagnostic accuracy, and their large size poses substantial inference costs for real-world clinical deployment. To address these challenges, we introduce CLARIFY, a Specialist-Generalist framework for dermatological visual question answering (VQA). CLARIFY combines two components: (i) a lightweight, domain-trained image classifier (the Specialist) that provides fast and highly accurate diagnostic predictions, and (ii) a powerful yet compressed conversational VLM (the Generalist) that generates natural language explanations to user queries. In our framework, the Specialist's predictions directly guide the Generalist's reasoning, focusing it on the correct diagnostic path. This synergy is further enhanced by a knowledge graph-based retrieval module, which grounds the Generalist's responses in factual dermatological knowledge, ensuring both accuracy and reliability. This hierarchical design not only reduces diagnostic errors but also significantly improves computational efficiency. Experiments on our curated multimodal dermatology dataset demonstrate that CLARIFY achieves an 18\% improvement in diagnostic accuracy over the strongest baseline, a fine-tuned, uncompressed single-line VLM, while reducing the average VRAM requirement and latency by at least 20\% and 5\%, respectively. These results indicate that a Specialist-Generalist system provides a practical and powerful paradigm for building lightweight, trustworthy, and clinically viable AI systems.
Summary / 总结
Vision-language models (VLMs) have shown significant potential for medical tasks; however, their general-purpose nature can limit specialized diagnostic accuracy, and their large size poses substantial inference costs for real-world clinical deployment.
SafeBimanual: Diffusion-based Trajectory Optimization for Safe Bimanual Manipulation
Authors: Haoyuan Deng, Wenkai Guo, Qianzhun Wang, Zhenyu Wu, Ziwei Wang
First: 2025-08-25T17:59:02+00:00 · Latest: 2025-08-25T17:59:02+00:00
Comments: Project website is at: https://denghaoyuan123.github.io/SafeBimanip/
Abstract
Bimanual manipulation has been widely applied in household services and manufacturing, which enables the complex task completion with coordination requirements. Recent diffusion-based policy learning approaches have achieved promising performance in modeling action distributions for bimanual manipulation. However, they ignored the physical safety constraints of bimanual manipulation, which leads to the dangerous behaviors with damage to robots and objects. To this end, we propose a test-time trajectory optimization framework named SafeBimanual for any pre-trained diffusion-based bimanual manipulation policies, which imposes the safety constraints on bimanual actions to avoid dangerous robot behaviors with improved success rate. Specifically, we design diverse cost functions for safety constraints in different dual-arm cooperation patterns including avoidance of tearing objects and collision between arms and objects, which optimizes the manipulator trajectories with guided sampling of diffusion denoising process. Moreover, we employ a vision-language model (VLM) to schedule the cost functions by specifying keypoints and corresponding pairwise relationship, so that the optimal safety constraint is dynamically generated in the entire bimanual manipulation process. SafeBimanual demonstrates superiority on 8 simulated tasks in RoboTwin with a 13.7% increase in success rate and a 18.8% reduction in unsafe interactions over state-of-the-art diffusion-based methods. Extensive experiments on 4 real-world tasks further verify its practical value by improving the success rate by 32.5%.
Summary / 总结
Bimanual manipulation has been widely applied in household services and manufacturing, which enables the complex task completion with coordination requirements.
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
Authors: Sixun Dong, Juhua Hu, Mian Zhang, Ming Yin, Yanjie Fu, Qi Qian
First: 2025-08-25T17:57:49+00:00 · Latest: 2025-08-25T17:57:49+00:00
Comments: Project page: https://project.ironieser.cc/mmtok
Abstract
Vision-Language Models (VLMs) demonstrate impressive performance in understanding visual content with language instruction by converting visual input to vision tokens. However, redundancy in vision tokens results in the degenerated inference efficiency of VLMs. While many algorithms have been proposed to reduce the number of vision tokens, most of them apply only unimodal information (i.e., vision/text) for pruning and ignore the inherent multimodal property of vision-language tasks. Moreover, it lacks a generic criterion that can be applied to different modalities. To mitigate this limitation, in this work, we propose to leverage both vision and text tokens to select informative vision tokens by the criterion of coverage. We first formulate the subset selection problem as a maximum coverage problem. Afterward, a subset of vision tokens is optimized to cover the text tokens and the original set of vision tokens, simultaneously. Finally, a VLM agent can be adopted to further improve the quality of text tokens for guiding vision pruning. The proposed method MMTok is extensively evaluated on benchmark datasets with different VLMs. The comparison illustrates that vision and text information are complementary, and combining multimodal information can surpass the unimodal baseline with a clear margin. Moreover, under the maximum coverage criterion on the POPE dataset, our method achieves a 1.87x speedup while maintaining 98.7% of the original performance on LLaVA-NeXT-13B. Furthermore, with only four vision tokens, it still preserves 87.7% of the original performance on LLaVA-1.5-7B. These results highlight the effectiveness of coverage in token selection.
Summary / 总结
Vision-Language Models (VLMs) demonstrate impressive performance in understanding visual content with language instruction by converting visual input to vision tokens.
SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models
Authors: Zhenwei Tang, Difan Jiao, Blair Yang, Ashton Anderson
First: 2025-08-25T16:33:07+00:00 · Latest: 2025-08-25T16:33:07+00:00
Comments: COLM 2025
Abstract
Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information. We introduce SEAM, a benchmark that pairs semantically equivalent inputs across four domains that have existing standardized textual and visual notations. By employing distinct notation systems across modalities, in contrast to OCR-based image-text pairing, SEAM provides a rigorous comparative assessment of the textual-symbolic and visual-spatial reasoning capabilities of VLMs. Across 21 contemporary models, we observe systematic modality imbalance: vision frequently lags language in overall performance, despite the problems containing semantically equivalent information, and cross-modal agreement is relatively low. Our error analysis reveals two main drivers: textual perception failures from tokenization in domain notation and visual perception failures that induce hallucinations. We also show that our results are largely robust to visual transformations. SEAM establishes a controlled, semantically equivalent setting for measuring and improving modality-agnostic reasoning.
Summary / 总结
Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information.
History