arXiv 论文速递

TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection

Authors: Zehong Yan, Peng Qi, Wynne Hsu, Mong Li Lee

Venue: EMNLP 2025

First: 2025-09-04T17:59:43+00:00 · Latest: 2025-09-04T17:59:43+00:00

Comments: EMNLP 2025; Project Homepage: https://yanzehong.github.io/trust-vl/

Abstract

Multimodal misinformation, encompassing textual, visual, and cross-modal distortions, poses an increasing societal threat that is amplified by generative AI. Existing methods typically focus on a single type of distortion and struggle to generalize to unseen scenarios. In this work, we observe that different distortion types share common reasoning capabilities while also requiring task-specific skills. We hypothesize that joint training across distortion types facilitates knowledge sharing and enhances the model's ability to generalize. To this end, we introduce TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. TRUST-VL incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.

中文标题/摘要

标题：TRUST-VL：一种可解释的通用多模态虚假信息检测助手

多模态虚假信息，包括文本、视觉和跨模态的扭曲，构成了日益严重的社会威胁，这种威胁被生成式AI放大。现有方法通常专注于一种类型的扭曲，并难以泛化到未见过的场景。在这项工作中，我们观察到不同类型的扭曲共享一些共同的推理能力，同时也需要特定的任务技能。我们假设跨类型联合训练有助于知识共享并增强模型的泛化能力。为此，我们引入了TRUST-VL，这是一种统一且可解释的视觉语言模型，用于通用多模态虚假信息检测。TRUST-VL 包含一个新颖的问答视觉增强模块，旨在提取特定任务的视觉特征。为了支持训练，我们还构建了TRUST-Instruct，这是一个包含198K样本的大规模指令数据集，样本中包含与人类事实核查工作流程对齐的结构化推理链。在领域内和零样本基准上的广泛实验表明，TRUST-VL 达到了最先进的性能，同时提供了强大的泛化能力和可解释性。

Summary / 总结

The research aims to address the challenge of detecting multimodal misinformation, which includes textual, visual, and cross-modal distortions. The method involves a unified vision-language model, TRUST-VL, that incorporates a Question-Aware Visual Amplifier module to extract task-specific visual features. The model is trained on a large-scale instruction dataset, TRUST-Instruct, containing 198K samples. Experimental results show that TRUST-VL outperforms existing methods and demonstrates strong generalization and interpretability capabilities.

研究旨在应对文本、视觉和跨模态扭曲等多重模态虚假信息的检测挑战。方法是通过跨不同扭曲类型的联合训练，促进知识共享并增强泛化能力。关键实验结果显示，TRUST-VL，一种带有问题感知视觉增强模块的统一视觉语言模型，在领域内和零样本基准测试中均表现出色，同时提供了强大的可解释性和泛化能力。

AnomalyLMM: Bridging Generative Knowledge and Discriminative Retrieval for Text-Based Person Anomaly Search

Authors: Hao Ju, Hu Zhang, Zhedong Zheng

First: 2025-09-04T16:34:46+00:00 · Latest: 2025-09-04T16:34:46+00:00

Abs · PDF

Abstract

With growing public safety demands, text-based person anomaly search has emerged as a critical task, aiming to retrieve individuals with abnormal behaviors via natural language descriptions. Unlike conventional person search, this task presents two unique challenges: (1) fine-grained cross-modal alignment between textual anomalies and visual behaviors, and (2) anomaly recognition under sparse real-world samples. While Large Multi-modal Models (LMMs) excel in multi-modal understanding, their potential for fine-grained anomaly retrieval remains underexplored, hindered by: (1) a domain gap between generative knowledge and discriminative retrieval, and (2) the absence of efficient adaptation strategies for deployment. In this work, we propose AnomalyLMM, the first framework that harnesses LMMs for text-based person anomaly search. Our key contributions are: (1) A novel coarse-to-fine pipeline integrating LMMs to bridge generative world knowledge with retrieval-centric anomaly detection; (2) A training-free adaptation cookbook featuring masked cross-modal prompting, behavioral saliency prediction, and knowledge-aware re-ranking, enabling zero-shot focus on subtle anomaly cues. As the first study to explore LMMs for this task, we conduct a rigorous evaluation on the PAB dataset, the only publicly available benchmark for text-based person anomaly search, with its curated real-world anomalies covering diverse scenarios (e.g., falling, collision, and being hit). Experiments show the effectiveness of the proposed method, surpassing the competitive baseline by +0.96% Recall@1 accuracy. Notably, our method reveals interpretable alignment between textual anomalies and visual behaviors, validated via qualitative analysis. Our code and models will be released for future research.

GeoArena: An Open Platform for Benchmarking Large Vision-language Models on WorldWide Image Geolocalization

Authors: Pengyue Jia, Yingyi Zhang, Xiangyu Zhao, Yixuan Li

First: 2025-09-04T15:52:04+00:00 · Latest: 2025-09-04T15:52:04+00:00

Abs · PDF

Abstract

Image geolocalization aims to predict the geographic location of images captured anywhere on Earth, but its global nature presents significant challenges. Current evaluation methodologies suffer from two major limitations. First, data leakage: advanced approaches often rely on large vision-language models (LVLMs) to predict image locations, yet these models are frequently pretrained on the test datasets, compromising the accuracy of evaluating a model's actual geolocalization capability. Second, existing metrics primarily rely on exact geographic coordinates to assess predictions, which not only neglects the reasoning process but also raises privacy concerns when user-level location data is required. To address these issues, we propose GeoArena, a first open platform for evaluating LVLMs on worldwide image geolocalization tasks, offering true in-the-wild and human-centered benchmarking. GeoArena enables users to upload in-the-wild images for a more diverse evaluation corpus, and it leverages pairwise human judgments to determine which model output better aligns with human expectations. Our platform has been deployed online for two months, during which we collected over thousands voting records. Based on this data, we conduct a detailed analysis and establish a leaderboard of different LVLMs on the image geolocalization task.

中文标题/摘要

标题：GeoArena：一个用于评估全球图像地理定位的大规模视觉语言模型的开放平台

图像地理定位旨在预测地球上任何地方拍摄的图像的地理位置，但其全球性质带来了重大挑战。当前的评估方法存在两个主要局限性。首先，数据泄露：先进的方法通常依赖大规模视觉语言模型（LVLMs）来预测图像位置，但这些模型经常在测试数据集上进行预训练，这会损害评估模型实际地理定位能力的准确性。其次，现有的指标主要依赖于精确的地理坐标来评估预测，这不仅忽视了推理过程，还当需要用户级位置数据时引发了隐私问题。为了解决这些问题，我们提出了GeoArena，这是一个首个用于评估LVLMs在世界范围图像地理定位任务上的开放平台，提供真正的野外和以人为本的基准测试。GeoArena使用户能够上传野外图像以获得更多样化的评估语料库，并利用成对的人类判断来确定哪个模型输出更符合人类期望。该平台已在线部署两个月，期间我们收集了数千条投票记录。基于这些数据，我们进行了详细分析，并建立了不同LVLMs在图像地理定位任务上的排行榜。

Summary / 总结

GeoArena is an open platform designed to benchmark large vision-language models (LVLMs) on global image geolocalization tasks. It addresses the limitations of current evaluation methods by avoiding data leakage and using human judgments to assess model outputs. The platform has collected thousands of voting records over two months, leading to a detailed analysis and a leaderboard of different LVLMs.

GeoArena 是一个开放平台，用于评估大型视觉语言模型（LVLMs）在全球图像地理定位任务中的表现。它通过避免数据泄漏并使用人类判断来评估模型输出来解决现有评估方法的局限性。主要发现包括平台上线两个月，收集了数千份投票记录，并基于这些数据建立了不同 LVLMs 的排行榜。

OVGrasp: Open-Vocabulary Grasping Assistance via Multimodal Intent Detection

Authors: Chen Hu, Shan Luo, Letizia Gionfrida

First: 2025-09-04T15:42:36+00:00 · Latest: 2025-09-04T15:42:36+00:00

Abs · PDF

Abstract

Grasping assistance is essential for restoring autonomy in individuals with motor impairments, particularly in unstructured environments where object categories and user intentions are diverse and unpredictable. We present OVGrasp, a hierarchical control framework for soft exoskeleton-based grasp assistance that integrates RGB-D vision, open-vocabulary prompts, and voice commands to enable robust multimodal interaction. To enhance generalization in open environments, OVGrasp incorporates a vision-language foundation model with an open-vocabulary mechanism, allowing zero-shot detection of previously unseen objects without retraining. A multimodal decision-maker further fuses spatial and linguistic cues to infer user intent, such as grasp or release, in multi-object scenarios. We deploy the complete framework on a custom egocentric-view wearable exoskeleton and conduct systematic evaluations on 15 objects across three grasp types. Experimental results with ten participants demonstrate that OVGrasp achieves a grasping ability score (GAS) of 87.00%, outperforming state-of-the-art baselines and achieving improved kinematic alignment with natural hand motion.

中文标题/摘要

标题：OVGrasp: 开放词汇抓取辅助通过多模态意图检测

抓取辅助对于恢复运动受损个体的自主性至关重要，特别是在物体类别和用户意图多样且不可预测的非结构化环境中。我们提出了OVGrasp，一种基于软外骨骼的抓取辅助的分层控制框架，该框架结合了RGB-D视觉、开放词汇提示和语音命令，以实现稳健的多模态交互。为了在开放环境中增强泛化能力，OVGrasp整合了带有开放词汇机制的视觉-语言基础模型，允许在无需重新训练的情况下进行零样本检测，以识别未见过的对象。多模态决策者进一步融合空间和语言线索，以推断用户意图，如抓取或释放，在多物体场景中。我们在一个定制的主观视角可穿戴外骨骼上部署了完整的框架，并在15个物体上进行了三种抓取类型的系统评估。十名参与者的实验结果表明，OVGrasp实现了87.00%的抓取能力评分（GAS），优于最先进的基线，并实现了与自然手部运动更好的运动学对齐。

Summary / 总结

OVGrasp is a hierarchical control framework for grasping assistance in unstructured environments, integrating RGB-D vision, open-vocabulary prompts, and voice commands. It uses a vision-language foundation model to detect unseen objects and a multimodal decision-maker to infer user intent. Evaluations with ten participants show that OVGrasp achieves a grasping ability score of 87.00%, outperforming existing methods and improving kinematic alignment with natural hand motion.

OVGrasp 是一种用于不规则环境中的抓取辅助框架，结合了 RGB-D 视觉、开放词汇提示和语音命令。它使用视觉-语言基础模型来检测未见过的对象，并使用多模态决策器来推断用户意图。实验结果表明，OVGrasp 的抓取能力得分为 87.00%，优于现有方法，并且与自然手部运动有更好的运动学对齐。

Image Embedding Sampling Method for Diverse Captioning

Authors: Sania Waheed, Na Min An

First: 2025-02-14T12:33:19+00:00 · Latest: 2025-09-04T15:00:25+00:00

Comments: 17 pages, 5 figures, 9 tables

Abs · PDF

Abstract

Image Captioning for state-of-the-art VLMs has significantly improved over time; however, this comes at the cost of increased computational complexity, making them less accessible for resource-constrained applications such as mobile devices and assistive technologies. Alternatively, comparably smaller VLMs prioritize high-level scene descriptions, overlooking finer details that contribute to a richer understanding of an image. In this paper, we introduce a training-free framework that enhances caption diversity and informativeness by explicitly attending to distinct image regions using a comparably small VLM, BLIP, as the backbone. Our approach leverages structured segmentation to produce hierarchical representations that capture both global and localized semantics. Without requiring additional model training, we demonstrate that our method allows smaller VLMs to achieve performance comparable to larger models in terms of image-caption alignment, semantic integrity, and diversity. We evaluate our framework on MSCOCO, Flickr30k, and Nocaps test datasets, achieving a Div-2 score of 0.735, 0.750, and 0.748 for each dataset, respectively, while maintaining strong image-caption relevancy and semantic integrity with the human-annotated captions.

中文标题/摘要

标题：图像嵌入采样方法以实现多样的图像描述

最先进的视觉语言模型（VLM）的图像描述在过去已经显著改进，但这也带来了计算复杂度的增加，使得它们在资源受限的应用中（如移动设备和辅助技术）不够普及。相反，较小的VLM更侧重于高层次的场景描述，而忽略了有助于更深入理解图像的细节。在本文中，我们介绍了一种无需训练的框架，通过使用BLIP作为骨干网络，明确关注不同的图像区域，从而增强描述的多样性和信息量。我们的方法利用结构化分割生成层次表示，捕捉全局和局部语义。无需额外的模型训练，我们证明了我们的方法使较小的VLM在图像-描述对齐、语义完整性和多样性方面达到了与较大模型相当的性能。我们在MSCOCO、Flickr30k和Nocaps测试数据集上评估了我们的框架，分别获得了Div-2得分为0.735、0.750和0.748，同时保持了与人工标注描述的高度相关性和语义完整性。

Summary / 总结

This paper addresses the challenge of enhancing the diversity and informativeness of image captions using a small vision-language model (VLM) called BLIP, without additional training. By leveraging structured segmentation, the method captures both global and localized semantics, allowing smaller VLMs to match the performance of larger models in terms of image-caption alignment, semantic integrity, and diversity. The approach is evaluated on MSCOCO, Flickr30k, and Nocaps datasets, achieving Div-2 scores of 0.735, 0.750, and 0.748, respectively, while maintaining strong relevance and semantic integrity with human-annotated captions.

本文解决了图像描述模型在计算复杂性和描述多样性之间的权衡问题。它提出了一种无需训练的框架，使用小型VLM（BLIP）和结构化分割来增强描述的多样性和信息量。该方法在MSCOCO、Flickr30k和Nocaps数据集上的Div-2得分为0.735、0.750和0.748，同时保持了与人工标注描述的高度相关性和语义完整性。

Straighter Flow Matching via a Diffusion-Based Coupling Prior

Authors: Siyu Xing, Jie Cao, Huaibo Huang, Haichao Shi, Xiao-Yu Zhang

First: 2023-11-28T06:19:30+00:00 · Latest: 2025-09-04T14:24:04+00:00

Abs · PDF

Abstract

Flow matching as a paradigm of generative model achieves notable success across various domains. However, existing methods use either multi-round training or knowledge within minibatches, posing challenges in finding a favorable coupling strategy for straightening trajectories to few-step generation. To address this issue, we propose a novel approach, Straighter trajectories of Flow Matching (StraightFM). It straightens trajectories with the coupling strategy from the entire distribution level. More specifically, during training, StraightFM creates couplings of images and noise via one diffusion model as a coupling prior to straighten trajectories for few-step generation. Our coupling strategy can also integrate with the existing coupling direction from real data to noise, improving image quality in few-step generation. Experimental results on pixel space and latent space show that StraightFM yields attractive samples within 5 steps. Moreover, our unconditional StraightFM is seamlessly compatible with training-free multimodal conditional generation, maintaining high-quality image generation in few steps.

中文标题/摘要

标题：基于扩散耦合先验的更直流水流动匹配

水流动匹配作为一种生成模型的范式，在各个领域取得了显著的成功。然而，现有方法要么采用多轮训练，要么利用小批量内的知识，这在寻找有利于直流水流动的耦合策略方面提出了挑战，尤其是对于多步生成。为了解决这一问题，我们提出了一种新的方法，称为直流水流动匹配（StraightFM）。它通过整个分布层面的耦合策略来直流水流动。具体而言，在训练过程中，StraightFM 通过一个扩散模型将图像和噪声耦合起来作为耦合先验，以直流水流动进行多步生成。我们的耦合策略还可以与真实数据到噪声的现有耦合方向结合，从而在多步生成中提高图像质量。在像素空间和潜在空间的实验结果表明，StraightFM 在 5 步内生成了具有吸引力的样本。此外，我们的无条件 StraightFM 可无缝兼容无需训练的多模态条件生成，保持多步生成中的高质量图像生成。

Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding

Authors: Wanfu Wang, Qipeng Huang, Guangquan Xue, Xiaobo Liang, Juntao Li

First: 2025-09-04T14:17:01+00:00 · Latest: 2025-09-04T14:17:01+00:00

Abs · PDF

Abstract

Vision Language Models (VLMs) have recently achieved significant progress in bridging visual perception and linguistic reasoning. Recently, OpenAI o3 model introduced a zoom-in search strategy that effectively elicits active perception capabilities in VLMs, improving downstream task performance. However, enabling VLMs to reason effectively over appropriate image regions remains a core challenge in GUI grounding, particularly under high-resolution inputs and complex multi-element visual interactions. In this work, we propose LASER, a self-evolving framework that progressively endows VLMs with multi-step perception capabilities, enabling precise coordinate prediction. Specifically, our approach integrate Monte Carlo quality estimation with Intersection-over-Union (IoU)-based region quality evaluation to jointly encourage both accuracy and diversity in constructing high-quality preference data. This combination explicitly guides the model to focus on instruction-relevant key regions while adaptively allocating reasoning steps based on task complexity. Comprehensive experiments on the ScreenSpot Pro and ScreenSpot-v2 benchmarks demonstrate consistent performance gains, validating the effectiveness of our method. Furthermore, when fine-tuned on GTA1-7B, LASER achieves a score of 55.7 on the ScreenSpot-Pro benchmark, establishing a new state-of-the-art (SoTA) among 7B-scale models.

中文标题/摘要

标题：通过自我进化的偏好优化学习主动感知对GUI定位

视觉语言模型（VLMs）最近在视觉感知和语言推理的结合方面取得了显著进展。最近，OpenAI的o3模型引入了一种缩放搜索策略，有效地激发了VLMs的主动感知能力，提高了下游任务的性能。然而，在GUI定位中，特别是在高分辨率输入和复杂多元素视觉交互下，使VLMs有效地在适当图像区域进行推理仍然是一个核心挑战。在本文中，我们提出了一种自我进化的框架LASER，逐步赋予VLMs多步感知能力，使其能够进行精确的坐标预测。具体而言，我们的方法将蒙特卡洛质量估计与基于交并比（IoU）的区域质量评估相结合，共同促进构建高质量偏好数据的准确性和多样性。这种结合明确地引导模型关注与指令相关的关键区域，并根据任务复杂性自适应分配推理步骤。在ScreenSpot Pro和ScreenSpot-v2基准上的全面实验表明，该方法具有一致的性能提升，验证了其有效性。此外，当在GTA1-7B上微调时，LASER在ScreenSpot-Pro基准上的得分为55.7，成为7B规模模型中的新最佳水平（SoTA）。

Summary / 总结

This work addresses the challenge of enabling Vision Language Models (VLMs) to effectively reason over appropriate image regions in GUI grounding tasks. The authors propose LASER, a self-evolving framework that uses Monte Carlo quality estimation and IoU-based region quality evaluation to improve multi-step perception capabilities. Experiments show consistent performance gains on ScreenSpot Pro and ScreenSpot-v2 benchmarks, and LASER achieves a score of 55.7 on ScreenSpot-Pro, setting a new state-of-the-art among 7B-scale models.

该研究旨在解决使视觉语言模型（VLMs）在GUI定位任务中有效推理适当图像区域的挑战。作者提出了LASER，这是一种自我进化的框架，结合了蒙特卡洛质量估计和基于交并比（IoU）的区域质量评估，以提高多步感知能力。在ScreenSpot Pro和ScreenSpot-v2基准上的实验显示了一致的性能提升，且当在GTA1-7B上微调时，LASER在ScreenSpot-Pro基准上的得分为55.7，成为7B规模模型的新最佳表现。

Exposing Synthetic Speech: Model Attribution and Detection of AI-generated Speech via Audio Fingerprints

Authors: Matías Pizarro, Mike Laszkiewicz, Shawkat Hesso, Dorothea Kolossa, Asja Fischer

First: 2024-11-21T10:55:49+00:00 · Latest: 2025-09-04T12:43:52+00:00

Abs · PDF

Abstract

As speech generation technologies continue to advance in quality and accessibility, the risk of malicious use cases, including impersonation, misinformation, and spoofing, increases rapidly. This work addresses this threat by introducing a simple, training-free, yet effective approach for detecting AI-generated speech and attributing it to its source model. Specifically, we tackle three key tasks: (1) single-model attribution in an open-world setting, where the goal is to determine whether a given audio sample was generated by a specific target neural speech synthesis system (with access only to data from that system); (2) multi-model attribution in a closed-world setting, where the objective is to identify the generating system from a known pool of candidates; and last but not least (3) detection of synthetic versus real speech. Our approach leverages standardized average residuals-the difference between an input audio signal and its filtered version using either a low-pass filter or the EnCodec audio autoencoder. We demonstrate that these residuals consistently capture artifacts introduced by diverse speech synthesis systems, serving as distinctive, model-agnostic fingerprints for attribution. Across extensive experiments, our approach achieves AUROC scores exceeding 99% in most scenarios, evaluated on augmented benchmark datasets that pair real speech with synthetic audio generated by multiple synthesis systems. In addition, our robustness analysis underscores the method's ability to maintain high performance even in the presence of moderate additive noise. Due to its simplicity, efficiency, and strong generalization across speech synthesis systems and languages, this technique offers a practical tool for digital forensics and security applications.

中文标题/摘要

标题：揭示合成语音：通过音频指纹检测和归因AI生成语音的方法

随着语音生成技术在质量和可访问性方面的不断进步，恶意使用案例，包括冒充、误导和欺骗，的风险迅速增加。本研究通过引入一种简单、无需训练且有效的检测AI生成语音并将其归因于其源模型的方法来应对这一威胁。具体而言，我们解决了三个关键任务：（1）开放世界中的单模型归因，目标是在仅访问该系统数据的情况下确定给定音频样本是否由特定目标神经语音合成系统生成；（2）封闭世界中的多模型归因，目标是从已知候选池中识别生成系统；最后但同样重要的是（3）合成语音与真实语音的检测。我们的方法利用标准化平均残差——输入音频信号与其使用低通滤波器或EnCodec音频自编码器进行滤波后的版本之间的差异。我们证明这些残差能够一致地捕捉由多种语音合成系统引入的特征，作为区分性、模型无关的指纹用于归因。在广泛的实验中，我们的方法在大多数场景中实现了超过99%的AUROC分数，评估基于扩展基准数据集，该数据集将真实语音与多个合成系统生成的合成音频配对。此外，我们的鲁棒性分析强调了该方法即使在存在中等程度的附加噪声时仍能保持高性能的能力。由于其简单性、高效性和在语音合成系统和语言方面的强大泛化能力，该技术为数字取证和安全应用提供了一种实用工具。

Summary / 总结

This work aims to address the growing risk of malicious use of speech generation technologies by introducing a training-free method for detecting and attributing AI-generated speech. The method uses standardized average residuals to identify artifacts introduced by different speech synthesis systems, acting as model-agnostic fingerprints. Experiments show that the approach achieves AUROC scores over 99% in various scenarios and maintains high performance even with noise, demonstrating its effectiveness in digital forensics and security applications.

该研究旨在通过开发一种无需训练的方法来检测和归因AI生成的语音，以应对语音生成技术的恶意使用威胁。方法利用标准化平均残差来识别不同语音合成系统引入的特征，作为模型无关的指纹。实验结果显示，在各种场景下的AUROC分数超过99%，即使在有噪声的情况下也能保持高性能。

TAGAL: Tabular Data Generation using Agentic LLM Methods

Authors: Benoît Ronval, Pierre Dupont, Siegfried Nijssen

First: 2025-09-04T12:25:14+00:00 · Latest: 2025-09-04T12:25:14+00:00

Abs · PDF

Abstract

The generation of data is a common approach to improve the performance of machine learning tasks, among which is the training of models for classification. In this paper, we present TAGAL, a collection of methods able to generate synthetic tabular data using an agentic workflow. The methods leverage Large Language Models (LLMs) for an automatic and iterative process that uses feedback to improve the generated data without any further LLM training. The use of LLMs also allows for the addition of external knowledge in the generation process. We evaluate TAGAL across diverse datasets and different aspects of quality for the generated data. We look at the utility of downstream ML models, both by training classifiers on synthetic data only and by combining real and synthetic data. Moreover, we compare the similarities between the real and the generated data. We show that TAGAL is able to perform on par with state-of-the-art approaches that require LLM training and generally outperforms other training-free approaches. These findings highlight the potential of agentic workflow and open new directions for LLM-based data generation methods.

中文标题/摘要

标题：TAGAL：使用代理型LLM方法生成表格数据

数据生成是提高机器学习任务性能的常见方法，其中包括用于分类模型训练的任务。本文介绍了TAGAL，一种能够使用代理型工作流生成合成表格数据的方法。该方法利用大型语言模型（LLMs）进行自动且迭代的过程，通过反馈不断改进生成的数据，而无需进一步训练LLMs。利用LLMs还可以在生成过程中添加外部知识。我们通过多种数据集和生成数据的不同质量方面评估了TAGAL。我们不仅通过仅使用合成数据训练分类器，还通过结合真实和合成数据来评估下游机器学习模型的实用性。此外，我们还比较了真实数据和生成数据之间的相似性。结果显示，TAGAL能够与需要训练LLMs的最新方法相媲美，并且通常优于其他无需训练的方法。这些发现突显了代理型工作流的潜力，并为基于LLMs的数据生成方法开辟了新的方向。

Summary / 总结

TAGAL is a method for generating synthetic tabular data using an agentic workflow with Large Language Models (LLMs). It leverages LLMs for an automatic and iterative process that improves generated data through feedback without further LLM training. The method is evaluated across various datasets and aspects of data quality, showing that TAGAL performs comparably to state-of-the-art approaches requiring LLM training and outperforms other training-free approaches in generating data for downstream ML models.

TAGAL 是一种使用大型语言模型（LLMs）的自动工作流程生成合成表格数据的方法。该方法利用 LLMs 进行一个自动迭代的过程，通过反馈改进生成的数据，而无需进一步的 LLM 训练。该方法在多个数据集和数据质量方面进行了评估，结果显示 TAGAL 在生成用于下游 ML 模型的数据方面与需要 LLM 训练的最新方法表现相当，并且优于其他无需训练的方法。

MUNBa: Machine Unlearning via Nash Bargaining

Authors: Jing Wu, Mehrtash Harandi

First: 2024-11-23T12:18:28+00:00 · Latest: 2025-09-04T11:00:46+00:00

Abs · PDF

Abstract

Machine Unlearning (MU) aims to selectively erase harmful behaviors from models while retaining the overall utility of the model. As a multi-task learning problem, MU involves balancing objectives related to forgetting specific concepts/data and preserving general performance. A naive integration of these forgetting and preserving objectives can lead to gradient conflicts and dominance, impeding MU algorithms from reaching optimal solutions. To address the gradient conflict and dominance issue, we reformulate MU as a two-player cooperative game, where the two players, namely, the forgetting player and the preservation player, contribute via their gradient proposals to maximize their overall gain and balance their contributions. To this end, inspired by the Nash bargaining theory, we derive a closed-form solution to guide the model toward the Pareto stationary point. Our formulation of MU guarantees an equilibrium solution, where any deviation from the final state would lead to a reduction in the overall objectives for both players, ensuring optimality in each objective. We evaluate our algorithm's effectiveness on a diverse set of tasks across image classification and image generation. Extensive experiments with ResNet, vision-language model CLIP, and text-to-image diffusion models demonstrate that our method outperforms state-of-the-art MU algorithms, achieving a better trade-off between forgetting and preserving. Our results also highlight improvements in forgetting precision, preservation of generalization, and robustness against adversarial attacks.

中文标题/摘要

标题：MUNBa: 机器去学习通过纳什讨价还价

机器去学习（MU）旨在从模型中选择性地消除有害行为，同时保留模型的整体效用。作为多任务学习问题，MU涉及平衡与遗忘特定概念/数据和保持一般性能相关的目标。简单地整合这些遗忘和保留目标可能导致梯度冲突和支配，阻碍MU算法达到最优解。为了解决梯度冲突和支配问题，我们将MU重新表述为一个两玩家合作博弈，其中两个玩家，即遗忘玩家和保留玩家，通过其梯度提案贡献，以最大化其整体收益并平衡其贡献。为此，借鉴纳什讨价还价理论，我们推导出一个闭式解来引导模型向帕累托稳定点发展。我们对MU的表述保证了一个均衡解，其中任何偏离最终状态都会导致两个玩家的整体目标减少，确保每个目标的最优性。我们在图像分类和图像生成的一系列任务上评估了我们算法的有效性。广泛的实验使用ResNet、视觉-语言模型CLIP和文本到图像扩散模型表明，我们的方法优于最先进的MU算法，实现了遗忘和保留之间的更好权衡。我们的结果还突显了遗忘精度、保持泛化能力和对抗攻击鲁棒性的改进。

Summary / 总结

MUNBa reformulates Machine Unlearning (MU) as a two-player cooperative game using Nash Bargaining theory to address gradient conflicts and dominance issues. This approach ensures an equilibrium solution that optimizes both forgetting and preserving objectives. Experiments on image classification, image generation, ResNet, CLIP, and text-to-image models show that MUNBa outperforms existing MU algorithms in achieving a better trade-off between forgetting and preserving, with improved forgetting precision and robustness against adversarial attacks.

该论文通过使用纳什讨价还价理论将机器遗忘（MU）问题表述为一个两玩家合作博弈，旨在平衡忘记特定数据和保持模型整体性能之间的目标，避免梯度冲突。实验结果显示，所提出的MUNBa方法在图像分类和生成等任务上优于现有MU算法，实现了更好的遗忘与保留模型性能之间的权衡，并且提高了遗忘精度和对抗攻击的鲁棒性。

SMooGPT: Stylized Motion Generation using Large Language Models

Authors: Lei Zhong, Yi Yang, Changjian Li

First: 2025-09-04T09:41:18+00:00 · Latest: 2025-09-04T09:41:18+00:00

Abs · PDF

Abstract

Stylized motion generation is actively studied in computer graphics, especially benefiting from the rapid advances in diffusion models. The goal of this task is to produce a novel motion respecting both the motion content and the desired motion style, e.g., ``walking in a loop like a Monkey''. Existing research attempts to address this problem via motion style transfer or conditional motion generation. They typically embed the motion style into a latent space and guide the motion implicitly in a latent space as well. Despite the progress, their methods suffer from low interpretability and control, limited generalization to new styles, and fail to produce motions other than ``walking'' due to the strong bias in the public stylization dataset. In this paper, we propose to solve the stylized motion generation problem from a new perspective of reasoning-composition-generation, based on our observations: i) human motion can often be effectively described using natural language in a body-part centric manner, ii) LLMs exhibit a strong ability to understand and reason about human motion, and iii) human motion has an inherently compositional nature, facilitating the new motion content or style generation via effective recomposing. We thus propose utilizing body-part text space as an intermediate representation, and present SMooGPT, a fine-tuned LLM, acting as a reasoner, composer, and generator when generating the desired stylized motion. Our method executes in the body-part text space with much higher interpretability, enabling fine-grained motion control, effectively resolving potential conflicts between motion content and style, and generalizes well to new styles thanks to the open-vocabulary ability of LLMs. Comprehensive experiments and evaluations, and a user perceptual study, demonstrate the effectiveness of our approach, especially under the pure text-driven stylized motion generation.

中文标题/摘要

标题：SMooGPT：使用大型语言模型进行风格化运动生成

风格化运动生成在计算机图形学中得到了积极的研究，特别得益于扩散模型的迅速发展。该任务的目标是生成既尊重运动内容又符合期望运动风格的新运动，例如“像猴子一样环形行走”。现有研究试图通过运动风格转换或条件运动生成来解决这一问题。它们通常将运动风格嵌入到潜在空间中，并在潜在空间中隐式地引导运动。尽管取得了进展，但它们的方法在可解释性和控制性方面较低，对新风格的泛化能力有限，并且由于公共风格化数据集中的强烈偏见，无法生成除“行走”之外的运动。在本文中，我们从推理-合成-生成的新视角出发，解决风格化运动生成问题，基于我们的观察：i) 人类运动往往可以用自然语言在以身体部位为中心的方式进行有效描述，ii) 大型语言模型表现出强大的理解与推理人类运动的能力，iii) 人类运动具有固有的组合性质，有助于通过有效的重组生成新的运动内容或风格。因此，我们提出利用身体部位文本空间作为中间表示，并提出SMooGPT，这是一种微调后的大型语言模型，在生成期望的风格化运动时充当推理者、合成者和生成者。我们的方法在身体部位文本空间中执行，具有更高的可解释性，能够实现精细的运动控制，有效解决运动内容与风格之间的潜在冲突，并得益于大型语言模型的开放式词汇能力，能够很好地泛化到新风格。全面的实验和评估以及用户感知研究证明了我们方法的有效性，特别是在纯文本驱动的风格化运动生成方面。

Summary / 总结

The paper addresses the challenge of generating stylized motion by proposing SMooGPT, which leverages large language models to reason, compose, and generate motion in a body-part text space. This approach enhances interpretability and control over motion generation, resolves conflicts between content and style, and generalizes well to new styles. Experiments and user studies confirm its effectiveness, particularly in pure text-driven stylized motion generation.

论文提出了一种利用大型语言模型（LLMs）进行推理、组合和生成的方法SMooGPT，通过使用身体部位的文本空间作为中间表示，提供更高的可解释性和对运动内容和风格的更好控制。实验表明，SMooGPT 能够有效处理新风格并生成如‘像猴子一样环形行走’等动作，证明了其在可解释性和泛化能力方面的优越性，优于现有方法。

DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model

Authors: Qian Chen, Xianyin Zhang, Lifan Guo, Feng Chen, Chi Zhang

First: 2025-08-18T03:28:57+00:00 · Latest: 2025-09-04T08:05:29+00:00

Abs · PDF

Abstract

Recent advances in large vision-language models (LVLMs) have enabled a new paradigm of end-to-end document image parsing, excelling in Optical Character Recognition (OCR) tasks such as text, table, and formula recognition. However, generative LVLMs, similarly to large language models (LLMs), are prone to hallucinations--generating words that do not exist in input images. Furthermore, LVLMs are designed for general purposes and tend to be less effective on OCR tasks compared to expert models that are trained on domain-specific datasets. In this paper, we propose DianJin-OCR-R1, a reasoning-enhanced framework designed to address these limitations through training reasoning-and-tool interleaved VLMs. Given a recognition instruction, our DianJin-OCR-R1 model first recognizes the content in the input image by its own OCR capabilities, and then calls other tools (i.e., other expert models) to obtain their results as references, finally "looks again" the image and rethinks about the reasoning process to provide the final recognized content. Since architectures of expert models are tailored for specific OCR tasks, which makes them less prone to hallucinations, their results can help VLMs mitigate hallucinations. We evaluate our model on ReST and OmniDocBench, and experimental results show that our DianJin-OCR-R1 models consistently outperform their non-reasoning counterparts and expert OCR models, which proves the effectiveness of our method. Additionally, the results indicate that enhancing expert models, which are typically small and easy to iterate, enable performance improvements for VLMs.

Summary / 总结

The research aims to improve OCR capabilities by addressing the limitations of hallucinations and domain-specific effectiveness in vision-language models (LVLMs). The DianJin-OCR-R1 model is proposed, which integrates reasoning and tool calls to enhance OCR performance. The model first uses its own OCR capabilities, then calls expert models for references, and finally rethinks the reasoning process. Evaluation on ReST and OmniDocBench shows that DianJin-OCR-R1 outperforms non-reasoning counterparts and expert OCR models, demonstrating the effectiveness of this approach.

研究旨在通过解决幻觉和领域特定有效性的问题来提升OCR能力。提出了一种名为DianJin-OCR-R1的推理增强框架，该框架结合了推理和工具调用以提升OCR性能。模型首先使用自身的OCR能力，然后调用专家模型获取参考结果，最后重新思考推理过程。在ReST和OmniDocBench上的评估表明，DianJin-OCR-R1在性能上优于非推理版本和专家OCR模型，证明了该方法的有效性。

Multimodal Feature Fusion Network with Text Difference Enhancement for Remote Sensing Change Detection

Authors: Yijun Zhou, Yikui Zhai, Zilu Ying, Tingfeng Xian, Wenlve Zhou, Zhiheng Zhou, Xiaolin Tian, Xudong Jia, Hongsheng Zhang, C. L. Philip Chen

First: 2025-09-04T07:39:18+00:00 · Latest: 2025-09-04T07:39:18+00:00

Abs · PDF · Code1

Abstract

Although deep learning has advanced remote sensing change detection (RSCD), most methods rely solely on image modality, limiting feature representation, change pattern modeling, and generalization especially under illumination and noise disturbances. To address this, we propose MMChange, a multimodal RSCD method that combines image and text modalities to enhance accuracy and robustness. An Image Feature Refinement (IFR) module is introduced to highlight key regions and suppress environmental noise. To overcome the semantic limitations of image features, we employ a vision language model (VLM) to generate semantic descriptions of bitemporal images. A Textual Difference Enhancement (TDE) module then captures fine grained semantic shifts, guiding the model toward meaningful changes. To bridge the heterogeneity between modalities, we design an Image Text Feature Fusion (ITFF) module that enables deep cross modal integration. Extensive experiments on LEVIRCD, WHUCD, and SYSUCD demonstrate that MMChange consistently surpasses state of the art methods across multiple metrics, validating its effectiveness for multimodal RSCD. Code is available at: https://github.com/yikuizhai/MMChange.

中文标题/摘要

标题：基于文本差异增强的多模态特征融合网络在遥感变化检测中的应用

尽管深度学习已推动遥感变化检测（RSCD）的进步，但大多数方法仅依赖图像模态，限制了特征表示、变化模式建模和泛化能力，尤其是在光照和噪声干扰下。为解决这一问题，我们提出了一种名为MMChange的多模态RSCD方法，结合图像和文本模态以提高准确性和鲁棒性。引入了图像特征精炼（IFR）模块以突出关键区域并抑制环境噪声。为克服图像特征的语义限制，我们采用视觉语言模型（VLM）生成双时相图像的语义描述。随后，文本差异增强（TDE）模块捕捉细微的语义变化，引导模型关注有意义的变化。为弥合模态之间的异质性，我们设计了图像文本特征融合（ITFF）模块，实现深层次的跨模态整合。在LEVIRCD、WHUCD和SYSUCD上的广泛实验表明，MMChange在多个指标上均超越了现有方法，验证了其在多模态RSCD中的有效性。代码可在：https://github.com/yikuizhai/MMChange 获取。

Summary / 总结

The research aims to improve the accuracy and robustness of remote sensing change detection by integrating image and text modalities. The method, MMChange, includes an Image Feature Refinement module to highlight key regions, a Textual Difference Enhancement module to capture semantic shifts, and an Image Text Feature Fusion module to integrate modalities. Experiments show MMChange outperforms existing methods on multiple metrics across different datasets, validating its effectiveness.

研究旨在通过结合图像和文本模态来提高遥感变化检测的准确性和鲁棒性。方法MMChange引入了图像特征精炼模块以突出关键区域并抑制噪声，文本差异增强模块以捕捉语义变化，并设计了图像文本特征融合模块以实现跨模态的深度集成。在LEVIRCD、WHUCD和SYSUCD上的实验表明，MMChange在多个指标上优于现有方法，验证了其在多模态遥感变化检测中的有效性。

ANTS: Shaping the Adaptive Negative Textual Space by MLLM for OOD Detection

Authors: Zhu Wenjie, Zhang Yabin, Xin Jin, Wenjun Zeng, Lei Zhang

First: 2025-09-04T07:26:20+00:00 · Latest: 2025-09-04T07:26:20+00:00

Abs · PDF

Abstract

The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. In addition, the presence of false negative labels significantly degrades their near-OOD performance. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we identify images likely to be OOD samples as negative images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we first identify the subset of ID classes that are visually similar to negative images and then leverage the reasoning capability of MLLMs to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD) without relying on task-specific prior knowledge, making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 4.2\%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.

中文标题/摘要

标题：ANTS: 通过MLLM塑造适应性负文本空间以进行OOD检测

引入负标签（NLs）已被证明能有效提升Out-of-Distribution（OOD）检测。然而，现有方法往往缺乏对OOD图像的理解，难以构建准确的负空间。此外，假负标签的存在显著降低了其近OOD性能。为解决这些问题，我们提出利用多模态大语言模型（MLLM）的理解和推理能力，塑造适应性负文本空间（ANTS）。具体而言，我们识别出可能为OOD样本的图像作为负图像，并提示MLLM描述这些图像，生成能够精确刻画OOD分布并增强远OOD检测的表达性负句子。对于近OOD设置，其中OOD样本与分布内（ID）子集相似，我们首先识别出与负图像视觉相似的ID类子集，然后利用MLLM的推理能力生成针对该子集的视觉相似负标签，有效减少假负标签并提高近OOD检测。为了平衡这两种类型的负文本空间，我们设计了一种自适应加权得分，使方法能够在无需依赖特定任务先验知识的情况下处理不同的OOD任务设置（近OOD和远OOD），使其在开放环境中具有高度适应性。在ImageNet基准测试上，我们的ANTS显著降低了FPR95，建立了新的最佳水平。此外，我们的方法无需训练且零样本，具有高可扩展性。

Summary / 总结

The paper introduces ANTS, a method that uses multimodal large language models (MLLMs) to shape an adaptive negative textual space for enhancing Out-of-Distribution (OOD) detection. By identifying OOD samples and prompting MLLMs to generate precise negative descriptions, ANTS improves far-OOD detection. For near-OOD samples, ANTS generates visually similar negative labels to reduce false negatives. ANTS uses an adaptive weighted score to balance far-OOD and near-OOD detection, achieving a 4.2% reduction in FPR95 on the ImageNet benchmark and demonstrating high scalability without requiring training or specific prior knowledge.

该论文提出了一种名为ANTS的方法，利用多模态大语言模型（MLLMs）塑造自适应的负文本空间，以增强Out-of-Distribution (OOD)检测。通过识别OOD样本并促使MLLMs生成精确的负描述，ANTS提升了远OOD检测效果。对于近OOD样本，ANTS生成视觉上相似的负标签以减少误负。ANTS使用自适应加权分数来平衡远OOD和近OOD检测，实现了在ImageNet基准上4.2%的FPR95降低，并且无需训练和特定先验知识，具有高可扩展性。

Defending LVLMs Against Vision Attacks through Partial-Perception Supervision

Authors: Qi Zhou, Tianlin Li, Qing Guo, Dongxia Wang, Yun Lin, Yang Liu, Jin Song Dong

Venue: ICML 2025

First: 2024-12-17T09:38:58+00:00 · Latest: 2025-09-04T06:43:22+00:00

Comments: Accepted to ICML 2025

Abs · PDF

Abstract

Recent studies have raised significant concerns regarding the vulnerability of Large Vision Language Models (LVLMs) to maliciously injected or perturbed input images, which can mislead their responses. Existing defense methods show that such vision attacks are sensitive to image modifications especially cropping, using majority voting across responses of modified images as corrected responses. However, these modifications often result in partial images and distort the semantics, which reduces response quality on clean images after voting. Instead of directly using responses from partial images for voting, we investigate using them to supervise the LVLM's responses to the original images. We propose a black-box, training-free method called DPS (Defense through Partial-Perception Supervision). In this approach, the model is prompted using the responses generated by a model that perceives only a partial image. With DPS, the model can adjust its response based on partial image understanding when under attack, while confidently maintaining its original response for clean input. Our findings show that the weak model can supervise the strong model: when faced with an attacked input, the strong model becomes less confident and adjusts its response based on the weak model's partial understanding, effectively defending against the attack. With clean input, it confidently maintains its original response. Empirical experiments show our method outperforms the baseline, cutting the average attack success rate by 76.3% across six datasets on three popular models.

中文标题/摘要

标题：通过部分感知监督防御LVLMs的视觉攻击

近期研究对大型视觉语言模型（LVLMs）在恶意注入或扰动输入图像时的脆弱性提出了严重关切，这些攻击可以误导模型的响应。现有的防御方法表明，这类视觉攻击对图像修改特别敏感，尤其是裁剪，通过在修改图像的响应中使用多数投票来获得正确的响应。然而，这些修改通常会导致部分图像，从而扭曲语义，这在投票后降低了干净图像的响应质量。我们不直接使用部分图像的响应进行投票，而是研究使用它们来监督LVLM对原始图像的响应。我们提出了一种无需训练的黑盒方法，称为DPS（通过部分感知监督的防御）。在此方法中，模型使用仅感知部分图像的模型生成的响应进行提示。使用DPS，模型在受到攻击时可以根据部分图像的理解调整其响应，同时自信地保持其原始响应以应对干净输入。我们的研究发现，弱模型可以监督强模型：当面对攻击输入时，强模型变得不那么自信，并根据弱模型的部分理解调整其响应，从而有效防御攻击。在干净输入时，它自信地保持其原始响应。实验证明，我们的方法优于基线，六个数据集上三个流行模型的平均攻击成功率降低了76.3%。

Summary / 总结

This study addresses the vulnerability of Large Vision Language Models (LVLMs) to vision attacks by proposing a black-box, training-free method called DPS (Defense through Partial-Perception Supervision). DPS uses responses from a model that perceives only a partial image to supervise the LVLM's responses to the original image. The method enhances the model's ability to adjust its response when under attack while maintaining confidence for clean inputs. Experiments show that DPS significantly reduces the average attack success rate by 76.3% across six datasets on three popular models.

该研究提出了一种名为DPS（通过部分感知监督防御）的黑盒、无需训练的方法，用于解决大型视觉语言模型（LVLMs）对视觉攻击的脆弱性问题。DPS 使用仅感知部分图像的模型生成的响应来监督LVLM 对原始图像的响应。该方法允许模型在受到攻击时根据部分图像的理解调整其响应，同时在干净输入时保持其原始响应。实验结果显示，DPS 在三个流行模型的六个数据集上将平均攻击成功率降低了76.3%，优于基线方法。

Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model

Authors: Phuoc-Nguyen Bui, Khanh-Binh Nguyen, Hyunseung Choo

Venue: ICCV 2025

First: 2025-09-04T05:42:02+00:00 · Latest: 2025-09-04T05:42:02+00:00

Comments: ICCV 2025 - LIMIT Workshop

Abs · PDF

Abstract

Contrastive vision-language models excel in zero-shot image recognition but face challenges in few-shot scenarios due to computationally intensive offline fine-tuning using prompt learning, which risks overfitting. To overcome these limitations, we propose Attn-Adapter, a novel online few-shot learning framework that enhances CLIP's adaptability via a dual attention mechanism. Our design incorporates dataset-specific information through two components: the Memory Attn-Adapter, which refines category embeddings using support examples, and the Local-Global Attn-Adapter, which enriches image embeddings by integrating local and global features. This architecture enables dynamic adaptation from a few labeled samples without retraining the base model. Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization, maintaining efficient inference and scaling across CLIP backbones.

中文标题/摘要

标题：Attn-Adapter：无需离线微调的视觉-语言模型在线少样本学习者

对比视觉-语言模型在零样本图像识别中表现出色，但在少样本场景中由于使用提示学习进行计算密集型离线微调而面临挑战，这可能导致过拟合。为克服这些限制，我们提出了一种名为Attn-Adapter的新颖在线少样本学习框架，通过双重注意力机制增强CLIP的适应性。我们的设计通过两个组件整合了数据集特定的信息：Memory Attn-Adapter，通过支持样本细化类别嵌入；Local-Global Attn-Adapter，通过整合局部和全局特征丰富图像嵌入。该架构能够在少量标记样本下实现动态适应，而无需重新训练基础模型。Attn-Adapter在跨类别和跨数据集泛化方面优于现有方法，保持高效的推理并适用于各种CLIP基础模型。

Summary / 总结

The research aims to address the limitations of contrastive vision-language models in few-shot scenarios by proposing Attn-Adapter, which enhances CLIP's adaptability through a dual attention mechanism. The method includes Memory Attn-Adapter for refining category embeddings and Local-Global Attn-Adapter for enriching image embeddings. Experimental results show that Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization while maintaining efficient inference and scalability across CLIP backbones.

研究旨在提高如CLIP这样的视觉-语言模型在少量标注样本下的学习能力，这些模型在零样本图像识别方面表现出色，但在少量样本场景下面临挑战，因为需要进行计算密集型的离线微调。提出的Attn-Adapter框架引入了双注意力机制，能够动态适应新类别，增强跨类别和跨数据集的泛化能力。实验表明，Attn-Adapter在保持高效推理的同时，能够跨不同CLIP骨干网络进行扩展，并优于现有方法。

Weakly-Supervised Learning of Dense Functional Correspondences

Authors: Stefan Stojanov, Linan Zhao, Yunzhi Zhang, Daniel L. K. Yamins, Jiajun Wu

Venue: ICCV 2025

First: 2025-09-04T05:39:16+00:00 · Latest: 2025-09-04T05:39:16+00:00

Comments: Accepted at ICCV 2025. Project website: https://dense-functional-correspondence.github.io/

Abs · PDF · Project1

Abstract

Establishing dense correspondences across image pairs is essential for tasks such as shape reconstruction and robot manipulation. In the challenging setting of matching across different categories, the function of an object, i.e., the effect that an object can cause on other objects, can guide how correspondences should be established. This is because object parts that enable specific functions often share similarities in shape and appearance. We derive the definition of dense functional correspondence based on this observation and propose a weakly-supervised learning paradigm to tackle the prediction task. The main insight behind our approach is that we can leverage vision-language models to pseudo-label multi-view images to obtain functional parts. We then integrate this with dense contrastive learning from pixel correspondences to distill both functional and spatial knowledge into a new model that can establish dense functional correspondence. Further, we curate synthetic and real evaluation datasets as task benchmarks. Our results demonstrate the advantages of our approach over baseline solutions consisting of off-the-shelf self-supervised image representations and grounded vision language models.

中文标题/摘要

标题：弱监督学习密集功能对应

在形状重建和机器人操作等任务中，跨图像对建立密集对应关系是必不可少的。在不同类别之间的匹配挑战中，物体的功能，即物体对其他物体造成的影响，可以指导对应关系的建立。因为能够执行特定功能的物体部分在形状和外观上往往具有相似性。我们基于这一观察推导出密集功能对应的概念，并提出了一种弱监督学习范式来解决预测任务。我们方法的核心思想是，可以利用视觉语言模型对多视角图像进行伪标签，以获得功能部分。然后，我们将此与基于像素对应关系的密集对比学习相结合，将功能和空间知识提炼到一个新模型中，以建立密集功能对应。此外，我们还整理了合成和真实评估数据集作为任务基准。我们的结果表明，我们的方法在基线解决方案（包括现成的自监督图像表示和基于视觉语言模型）上具有优势。

Summary / 总结

This paper addresses the challenge of establishing dense correspondences across different object categories by leveraging the functional role of object parts. It proposes a weakly-supervised learning paradigm that uses vision-language models to pseudo-label functional parts and integrates this with dense contrastive learning. The approach demonstrates superior performance compared to baseline methods on both synthetic and real datasets, showing the effectiveness of incorporating functional knowledge into correspondence prediction.

该论文通过利用物体部分的功能作用来解决不同类别间稀疏对应关系的建立问题，提出了一种弱监督学习范式，利用视觉语言模型进行伪标签标注功能部分，并将其与基于像素对应关系的密集对比学习相结合。该方法在合成和真实数据集上的表现优于基线方法，展示了将功能知识融入对应关系预测的有效性。

Expedition & Expansion: Leveraging Semantic Representations for Goal-Directed Exploration in Continuous Cellular Automata

Authors: Sina Khajehabdollahi, Gautier Hamon, Marko Cvjetko, Pierre-Yves Oudeyer, Clément Moulin-Frier, Cédric Colas

First: 2025-09-04T03:44:44+00:00 · Latest: 2025-09-04T03:44:44+00:00

Abs · PDF

Abstract

Discovering diverse visual patterns in continuous cellular automata (CA) is challenging due to the vastness and redundancy of high-dimensional behavioral spaces. Traditional exploration methods like Novelty Search (NS) expand locally by mutating known novel solutions but often plateau when local novelty is exhausted, failing to reach distant, unexplored regions. We introduce Expedition and Expansion (E&E), a hybrid strategy where exploration alternates between local novelty-driven expansions and goal-directed expeditions. During expeditions, E&E leverages a Vision-Language Model (VLM) to generate linguistic goals--descriptions of interesting but hypothetical patterns that drive exploration toward uncharted regions. By operating in semantic spaces that align with human perception, E&E both evaluates novelty and generates goals in conceptually meaningful ways, enhancing the interpretability and relevance of discovered behaviors. Tested on Flow Lenia, a continuous CA known for its rich, emergent behaviors, E&E consistently uncovers more diverse solutions than existing exploration methods. A genealogical analysis further reveals that solutions originating from expeditions disproportionately influence long-term exploration, unlocking new behavioral niches that serve as stepping stones for subsequent search. These findings highlight E&E's capacity to break through local novelty boundaries and explore behavioral landscapes in human-aligned, interpretable ways, offering a promising template for open-ended exploration in artificial life and beyond.

中文标题/摘要

标题：探险与扩展：利用语义表示在连续细胞自动机中进行目标导向探索

在连续细胞自动机（CA）中发现多样的视觉模式具有挑战性，因为高维行为空间既庞大又冗余。传统探索方法如新颖性搜索（NS）通过突变已知的新颖解进行局部扩展，但在局部新颖性耗尽时往往会停滞，无法到达遥远的未探索区域。我们引入了探险与扩展（E&E）这一混合策略，其中探索交替进行局部新颖性驱动的扩展和目标导向的探险。在探险期间，E&E 利用视觉语言模型（VLM）生成语言目标——对有趣但假设的模式的描述，从而驱动探索向未开发区域前进。通过在与人类感知相一致的语义空间中操作，E&E 既评估新颖性又以概念上有意义的方式生成目标，从而增强发现行为的可解释性和相关性。在 Flow Lenia 上进行测试，这是一种以丰富、涌现行为著称的连续 CA，E&E 一致地发现了比现有探索方法更多的多样化解决方案。进一步的谱系分析表明，源自探险的解决方案在长期探索中不成比例地产生影响，解锁了作为后续搜索踏脚石的新行为生态位。这些发现突显了 E&E 能够突破局部新颖性边界，在与人类对齐、可解释的方式下探索行为景观的能力，为人工生命中的开放探索提供了有希望的模板，超越了人工生命领域。

Summary / 总结

Discovering diverse visual patterns in continuous cellular automata (CA) is challenging due to the vastness and redundancy of high-dimensional behavioral spaces.

Measuring How (Not Just Whether) VLMs Build Common Ground

Authors: Saki Imai, Mert İnan, Anthony Sicilia, Malihe Alikhani

First: 2025-09-04T01:43:49+00:00 · Latest: 2025-09-04T01:43:49+00:00

Abs · PDF

Abstract

Large vision language models (VLMs) increasingly claim reasoning skills, yet current benchmarks evaluate them in single-turn or question answering settings. However, grounding is an interactive process in which people gradually develop shared understanding through ongoing communication. We introduce a four-metric suite (grounding efficiency, content alignment, lexical adaptation, and human-likeness) to systematically evaluate VLM performance in interactive grounding contexts. We deploy the suite on 150 self-play sessions of interactive referential games between three proprietary VLMs and compare them with human dyads. All three models diverge from human patterns on at least three metrics, while GPT4o-mini is the closest overall. We find that (i) task success scores do not indicate successful grounding and (ii) high image-utterance alignment does not necessarily predict task success. Our metric suite and findings offer a framework for future research on VLM grounding.

中文标题/摘要

标题：测量VLMs如何（而不仅仅是是否）建立共同基础

大型视觉语言模型（VLMs）越来越多地声称具备推理能力，但当前基准测试仅在单轮或问答设置中评估它们。然而，定位是一个互动过程，在这个过程中，人们通过持续沟通逐渐发展出共享理解。我们引入了一套四指标体系（定位效率、内容一致性、词汇适应性和类人度），系统评估VLM在互动定位环境中的表现。我们部署该体系于150场三款专有VLM之间的自我对弈互动指称游戏中，并将其与人类双人组进行比较。所有三个模型在至少三个指标上偏离了人类模式，而GPT4o-mini总体上最接近人类。我们发现，(i) 任务成功率分数并不能表明成功的定位，(ii) 高图像-语句对齐并不一定预示任务成功。我们的指标体系和研究结果为未来VLM定位研究提供了一个框架。

Causality-guided Prompt Learning for Vision-language Models via Visual Granulation

Authors: Mengyu Gao, Qiulei Dong

Venue: ICCV 2025

First: 2025-09-04T01:40:41+00:00 · Latest: 2025-09-04T01:40:41+00:00

Comments: ICCV 2025 Accepted

Abs · PDF

Abstract

Prompt learning has recently attracted much attention for adapting pre-trained vision-language models (e.g., CLIP) to downstream recognition tasks. However, most of the existing CLIP-based prompt learning methods only show a limited ability for handling fine-grained datasets. To address this issue, we propose a causality-guided text prompt learning method via visual granulation for CLIP, called CaPL, where the explored visual granulation technique could construct sets of visual granules for the text prompt to capture subtle discrepancies among different fine-grained classes through casual inference. The CaPL method contains the following two modules: (1) An attribute disentanglement module is proposed to decompose visual features into non-individualized attributes (shared by some classes) and individualized attributes (specific to single classes) using a Brownian Bridge Diffusion Model; (2) A granule learning module is proposed to construct visual granules by integrating the aforementioned attributes for recognition under two causal inference strategies. Thanks to the learned visual granules, more discriminative text prompt is expected to be learned. Extensive experimental results on 15 datasets demonstrate that our CaPL method significantly outperforms the state-of-the-art prompt learning methods, especially on fine-grained datasets.

中文标题/摘要

标题：基于视觉粒化引导因果性的提示学习方法在视觉-语言模型中的应用

提示学习最近引起了对适应预训练视觉-语言模型（例如CLIP）到下游识别任务的极大关注。然而，现有的大多数基于CLIP的提示学习方法在处理细粒度数据集时能力有限。为了解决这一问题，我们提出了一种基于视觉粒化的因果性引导文本提示学习方法，称为CaPL，其中探索的视觉粒化技术可以为文本提示构建视觉粒子集，通过因果推理捕捉不同细粒度类之间的细微差异。CaPL方法包含以下两个模块：（1）提出了一种属性解耦模块，使用布朗桥扩散模型将视觉特征分解为非个体化属性（由某些类共享）和个体化属性（仅对单一类特定）；（2）提出了一种粒子学习模块，通过结合上述属性在两种因果推理策略下构建视觉粒子进行识别。由于学习到的视觉粒子，期望能够学习到更具区分性的文本提示。在15个数据集上的广泛实验结果表明，我们的CaPL方法显著优于最先进的提示学习方法，尤其是在细粒度数据集上。

Summary / 总结

The research aims to enhance the capability of pre-trained vision-language models like CLIP in handling fine-grained datasets through prompt learning. The proposed method, CaPL, introduces a causality-guided text prompt learning approach using visual granulation. It includes an attribute disentanglement module to decompose visual features and a granule learning module to construct visual granules for better causal inference. Experimental results on 15 datasets show that CaPL outperforms existing methods, particularly on fine-grained datasets.

论文提出了一种名为CaPL的因果引导文本提示学习方法，用于视觉-语言模型，特别解决了细粒度数据集处理的挑战。CaPL通过视觉粒化将视觉特征分解为共享和个体属性，并构建用于更好因果推理的视觉粒度。实验结果表明，CaPL在15个数据集上优于现有方法，特别是在细粒度数据集上表现更佳。

MedVista3D: Vision-Language Modeling for Reducing Diagnostic Errors in 3D CT Disease Detection, Understanding and Reporting

Authors: Yuheng Li, Yenho Chen, Yuxiang Lai, Jike Zhong, Vanessa Wildman, Xiaofeng Yang

First: 2025-09-04T01:28:44+00:00 · Latest: 2025-09-04T01:28:44+00:00

Abs · PDF

Abstract

Radiologic diagnostic errors-under-reading errors, inattentional blindness, and communication failures-remain prevalent in clinical practice. These issues often stem from missed localized abnormalities, limited global context, and variability in report language. These challenges are amplified in 3D imaging, where clinicians must examine hundreds of slices per scan. Addressing them requires systems with precise localized detection, global volume-level reasoning, and semantically consistent natural language reporting. However, existing 3D vision-language models are unable to meet all three needs jointly, lacking local-global understanding for spatial reasoning and struggling with the variability and noise of uncurated radiology reports. We present MedVista3D, a multi-scale semantic-enriched vision-language pretraining framework for 3D CT analysis. To enable joint disease detection and holistic interpretation, MedVista3D performs local and global image-text alignment for fine-grained representation learning within full-volume context. To address report variability, we apply language model rewrites and introduce a Radiology Semantic Matching Bank for semantics-aware alignment. MedVista3D achieves state-of-the-art performance on zero-shot disease classification, report retrieval, and medical visual question answering, while transferring well to organ segmentation and prognosis prediction. Code and datasets will be released.

中文标题/摘要

标题：MedVista3D：用于减少3D CT疾病检测、理解和报告中的诊断错误的视觉-语言建模

放射学诊断错误，如漏诊、注意盲区和沟通失败，在临床实践中仍然普遍存在。这些问题通常源于局部异常的遗漏、全球上下文的限制以及报告语言的差异性。这些问题在3D成像中被放大，因为临床医生必须检查每幅扫描中的数百个切片。解决这些问题需要具备精确局部检测、全局体积级推理和语义一致自然语言报告的系统。然而，现有的3D视觉-语言模型无法同时满足这三个需求，缺乏空间推理所需的局部-全局理解，并且难以处理未经整理的放射学报告的差异性和噪声。我们提出了MedVista3D，一种用于3D CT分析的多尺度语义增强视觉-语言预训练框架。为了实现联合疾病检测和整体解释，MedVista3D 在全体积上下文中进行局部和全局图像-文本对齐，以实现细粒度的表示学习。为了应对报告的差异性，我们应用了语言模型重写，并引入了放射学语义匹配库，以实现语义感知的对齐。MedVista3D 在零样本疾病分类、报告检索和医学视觉问答方面达到了最先进的性能，同时在器官分割和预后预测方面表现出良好的迁移能力。代码和数据集将被发布。

Summary / 总结

MedVista3D is designed to reduce diagnostic errors in 3D CT disease detection, understanding, and reporting by addressing issues such as under-reading and communication failures. It uses a multi-scale semantic-enriched vision-language pretraining framework to perform local and global image-text alignment, enabling fine-grained representation learning and semantics-aware alignment. The model achieves state-of-the-art performance in zero-shot disease classification, report retrieval, and medical visual question answering, and also transfers well to organ segmentation and prognosis prediction.

研究旨在通过开发MedVista3D，一种结合局部和全局图像-文本对齐的视觉-语言模型，来解决3D CT成像中的诊断错误问题。MedVista3D 使用多尺度语义增强的预训练框架对齐图像和文本，并引入放射学语义匹配库以处理报告的变异性。该模型在零样本疾病分类、报告检索和医学视觉问答任务中表现出色，且在器官分割和预后预测等其他任务上也表现出良好的迁移性。

STA-Net: A Decoupled Shape and Texture Attention Network for Lightweight Plant Disease Classification

Authors: Zongsen Qiu

First: 2025-09-03T22:46:20+00:00 · Latest: 2025-09-03T22:46:20+00:00

Abs · PDF · Code1

Abstract

Responding to rising global food security needs, precision agriculture and deep learning-based plant disease diagnosis have become crucial. Yet, deploying high-precision models on edge devices is challenging. Most lightweight networks use attention mechanisms designed for generic object recognition, which poorly capture subtle pathological features like irregular lesion shapes and complex textures. To overcome this, we propose a twofold solution: first, using a training-free neural architecture search method (DeepMAD) to create an efficient network backbone for edge devices; second, introducing the Shape-Texture Attention Module (STAM). STAM splits attention into two branches -- one using deformable convolutions (DCNv4) for shape awareness and the other using a Gabor filter bank for texture awareness. On the public CCMT plant disease dataset, our STA-Net model (with 401K parameters and 51.1M FLOPs) reached 89.00% accuracy and an F1 score of 88.96%. Ablation studies confirm STAM significantly improves performance over baseline and standard attention models. Integrating domain knowledge via decoupled attention thus presents a promising path for edge-deployed precision agriculture AI. The source code is available at https://github.com/RzMY/STA-Net.

中文标题/摘要

标题：STA-Net：一种用于轻量级植物病害分类的解耦形状和纹理注意力网络

为应对全球粮食安全需求的上升，精准农业和基于深度学习的植物病害诊断变得至关重要。然而，在边缘设备上部署高精度模型具有挑战性。大多数轻量级网络使用设计用于通用对象识别的注意力机制，这不能很好地捕捉到如不规则病斑形状和复杂纹理等细微的病理特征。为克服这一问题，我们提出了一种两步解决方案：首先，使用无训练神经架构搜索方法（DeepMAD）为边缘设备创建一个高效的网络骨干；其次，引入形状-纹理注意力模块（STAM）。STAM将注意力机制分为两个分支——一个使用可变形卷积（DCNv4）进行形状感知，另一个使用Gabor滤波器组进行纹理感知。在公共CCMT植物病害数据集上，我们的STA-Net模型（参数量40.1万，FLOPs 51.1百万）达到了89.00%的准确率和88.96%的F1分数。消融研究证实，STAM在性能上显著优于基线和标准注意力模型。通过解耦注意力机制整合领域知识，为边缘部署的精准农业AI提供了一条有前景的道路。源代码可在https://github.com/RzMY/STA-Net 获取。

Singular Value Few-shot Adaptation of Vision-Language Models

Authors: Taha Koleilat, Hassan Rivaz, Yiming Xiao

First: 2025-09-03T22:00:23+00:00 · Latest: 2025-09-03T22:00:23+00:00

Comments: 10 pages, 2 figures, 8 tables

Abs · PDF · Code1

Abstract

Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present \textbf{CLIP-SVD}, a novel \textit{multi-modal} and \textit{parameter-efficient} adaptation technique that leverages Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only \textbf{0.04\%} of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. The code is publicly available at https://github.com/HealthX-Lab/CLIP-SVD.

中文标题/摘要

标题：视觉语言模型的单值分解少量样本适应

视觉语言模型（VLMs）如CLIP在多种应用中展示了令人印象深刻的零样本和少量样本学习能力。然而，由于依赖于提示工程和全模型微调的高昂成本，将这些模型适应到新的细粒度领域仍然具有挑战性。现有的适应方法依赖于增强组件，如提示标记和适配器模块，这可能会限制适应质量，使模型不稳定，并损害其在预训练期间学到的丰富知识。在本文中，我们提出了**CLIP-SVD**，这是一种新颖的**多模态**和**参数高效**的适应技术，利用单值分解（SVD）修改CLIP的内部参数空间，而不注入额外模块。具体来说，我们仅微调CLIP参数矩阵的奇异值以重新缩放基向量进行领域适应，同时保留预训练模型。此设计仅使用模型总参数的**0.04%**便能实现增强的适应性能，并更好地保留其泛化能力。CLIP-SVD在11个自然和10个生物医学数据集上实现了最先进的分类结果，在少量样本设置中在准确性和泛化方面均优于先前的方法。此外，我们利用基于自然语言的方法分析CLIP适应的有效性和动态，以实现CLIP-SVD的可解释性。代码可在https://github.com/HealthX-Lab/CLIP-SVD上公开获取。

Summary / 总结

The research aims to improve the adaptation of vision-language models like CLIP to new fine-grained domains by reducing the reliance on prompt engineering and full model fine-tuning. CLIP-SVD uses Singular Value Decomposition to modify only 0.04% of the model's parameters, achieving state-of-the-art results on 11 natural and 10 biomedical datasets with better generalization under few-shot settings.

该研究解决了将如CLIP的视觉-语言模型适应到新领域时，仅通过少量微调参数（0.04%）实现最佳性能的问题。CLIP-SVD 方法使用奇异值分解来修改模型参数，实现了在11个自然和10个生物医学数据集上的最佳分类结果，同时保持了泛化能力。

Short-Form Video Recommendations with Multimodal Embeddings: Addressing Cold-Start and Bias Challenges

Authors: Andrii Dzhoha, Katya Mirylenka, Egor Malykh, Marco-Andrea Buchmann, Francesca Catino

First: 2025-07-25T14:57:04+00:00 · Latest: 2025-09-03T20:09:30+00:00

Abs · PDF

Abstract

In recent years, social media users have spent significant amounts of time on short-form video platforms. As a result, established platforms in other domains, such as e-commerce, have begun introducing short-form video content to engage users and increase their time spent on the platform. The success of these experiences is due not only to the content itself but also to a unique UI innovation: instead of offering users a list of choices to click, platforms actively recommend content for users to watch one at a time. This creates new challenges for recommender systems, especially when launching a new video experience. Beyond the limited interaction data, immersive feed experiences introduce stronger position bias due to the UI and duration bias when optimizing for watch-time, as models tend to favor shorter videos. These issues, together with the feedback loop inherent in recommender systems, make it difficult to build effective solutions. In this paper, we highlight the challenges faced when introducing a new short-form video experience and present our experience showing that, even with sufficient video interaction data, it can be more beneficial to leverage a video retrieval system using a fine-tuned multimodal vision-language model to overcome these challenges. This approach demonstrated greater effectiveness compared to conventional supervised learning methods in online experiments conducted on our e-commerce platform.

中文标题/摘要

标题：基于多模态嵌入的短视频推荐：应对冷启动和偏差挑战

近年来，社交媒体用户在短视频平台上花费了大量时间。因此，其他领域的已建立平台，如电子商务，也开始引入短视频内容以吸引用户并增加他们在平台上的停留时间。这些体验的成功不仅归功于内容本身，还归功于一种独特的UI创新：平台不再为用户提供可供点击的选择列表，而是主动为用户推荐他们可以逐个观看的内容。这为推荐系统带来了新的挑战，尤其是在推出新的视频体验时。除了有限的交互数据外，沉浸式流体验还由于UI和优化观看时间时的时长偏差而引入了更强的位置偏差，模型倾向于偏好较短的视频。这些问题，加上推荐系统固有的反馈循环，使得构建有效的解决方案变得困难。在本文中，我们强调了引入新的短视频体验所面临的挑战，并展示了即使有足够的视频交互数据，利用微调的多模态视觉-语言模型的视频检索系统也可以更有效地克服这些挑战。这种方法在我们电子商务平台进行的在线实验中比传统的监督学习方法更有效。

Summary / 总结

This paper addresses the challenges of recommending short-form videos, especially during the cold-start phase and in the presence of bias. It introduces a multimodal embedding approach using a fine-tuned vision-language model to enhance recommendation effectiveness. The method outperformed traditional supervised learning methods in online experiments on an e-commerce platform, demonstrating improved user engagement and content diversity.

本文探讨了推荐短格式视频内容的挑战，特别是在电子商务平台上的背景下。它指出了冷启动和用户交互数据中的偏差问题，以及优化观看时间时的位置偏差和持续时间偏差。作者提出使用微调的多模态视觉-语言模型进行视频检索，发现在其平台上进行的在线实验中，这种方法比传统的监督学习方法更有效。

E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition

Authors: Aryan Gupta, Anupam Purwar

First: 2025-09-03T18:08:41+00:00 · Latest: 2025-09-03T18:08:41+00:00

Comments: Sprinklr OCR provides a fast and compute light way of performing OCR

Abs · PDF

Abstract

Optical Character Recognition (OCR) in multilingual, noisy, and diverse real-world images remains a significant challenge for optical character recognition systems. With the rise of Large Vision-Language Models (LVLMs), there is growing interest in their ability to generalize and reason beyond fixed OCR pipelines. In this work, we introduce Sprinklr-Edge-OCR, a novel OCR system built specifically optimized for edge deployment in resource-constrained environments. We present a large-scale comparative evaluation of five state-of-the-art LVLMs (InternVL, Qwen, GOT OCR, LLaMA, MiniCPM) and two traditional OCR systems (Sprinklr-Edge-OCR, SuryaOCR) on a proprietary, doubly hand annotated dataset of multilingual (54 languages) images. Our benchmark covers a broad range of metrics including accuracy, semantic consistency, language coverage, computational efficiency (latency, memory, GPU usage), and deployment cost. To better reflect real-world applicability, we also conducted edge case deployment analysis, evaluating model performance on CPU only environments. Among the results, Qwen achieved the highest precision (0.54), while Sprinklr-Edge-OCR delivered the best overall F1 score (0.46) and outperformed others in efficiency, processing images 35 faster (0.17 seconds per image on average) and at less than 0.01 of the cost (0.006 USD per 1,000 images) compared to LVLM. Our findings demonstrate that the most optimal OCR systems for edge deployment are the traditional ones even in the era of LLMs due to their low compute requirements, low latency, and very high affordability.

中文标题/摘要

标题：E-ARMOR：边缘案例评估与多语言光学字符识别审查

多语言、嘈杂和多样化的现实世界图像中的光学字符识别（OCR）仍然是光学字符识别系统的一个重大挑战。随着大型视觉-语言模型（LVLM）的兴起，人们越来越关注它们在固定OCR流水线之外的泛化和推理能力。在本文中，我们介绍了Sprinklr-Edge-OCR，这是一种专门针对资源受限环境边缘部署优化的新型OCR系统。我们对五种最先进的LVLM（InternVL、Qwen、GOT OCR、LLaMA、MiniCPM）和两种传统OCR系统（Sprinklr-Edge-OCR、SuryaOCR）在我们专有的、双人工标注的多语言（54种语言）图像数据集上进行了大规模比较评估。我们的基准测试涵盖了准确性、语义一致性、语言覆盖率、计算效率（延迟、内存、GPU使用情况）和部署成本等一系列指标。为了更好地反映实际应用情况，我们还进行了边缘案例部署分析，评估了模型在仅CPU环境下的性能。结果中，Qwen达到了最高的精确度（0.54），而Sprinklr-Edge-OCR在整体F1分数（0.46）上表现最佳，并且在效率方面优于其他系统，平均每张图像处理速度为0.17秒，成本仅为0.006美元/1000张图像，比LVLM低0.01。我们的研究结果表明，即使在LLM时代，传统的OCR系统仍然是边缘部署的最佳选择，因为它们具有低计算需求、低延迟和极高的性价比。

Summary / 总结

This study addresses the challenge of OCR in multilingual and noisy images by evaluating five state-of-the-art Large Vision-Language Models (LVLMs) and two traditional OCR systems. The evaluation was conducted on a proprietary dataset of 54 languages, covering metrics such as accuracy, semantic consistency, computational efficiency, and deployment cost. Sprinklr-Edge-OCR, a traditional OCR system, achieved the best overall F1 score and was the most efficient, processing images 35 times faster and at a lower cost compared to LVLMs. The research highlights that traditional OCR systems remain optimal for edge deployment due to their low compute requirements and high affordability.

该研究通过评估五种最先进的大型视觉-语言模型和两种传统OCR系统，解决了多语言和噪声图像中的OCR挑战。使用包含54种语言的专有数据集进行了大规模比较评估，涵盖了准确性、语义一致性、计算效率和部署成本等指标。研究结果表明，Qwen在精度方面表现最佳，而Sprinklr-Edge-OCR在整体F1分数上表现最佳，处理速度提高了35倍，并且成本仅为其他系统的十分之一。研究结论认为，传统OCR系统更适合边缘部署，因为它们具有低计算需求、低延迟和高性价比的特点。

LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence

Authors: Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, Li Mao, Mingchao Hao, Ningbo Dai, Renzhe Xu, Shuyang Li, Tianyang Zhang, Yue He, Yuanrui Wang, Yunjia Zhang, Zijing Xu, Dongzhe Li, Fang Gao, Hao Zou, Jiandong Liu, Jiashuo Liu, Jiawei Xu, Kaijie Cheng, Kehan Li, Linjun Zhou, Qing Li, Shaohua Fan, Xiaoyu Lin, Xinyan Han, Xuanyue Li, Yan Lu, Yuan Xue, Yuanyuan Jiang, Zimu Wang, Zhenlei Wang, Peng Cui

First: 2025-09-03T17:39:08+00:00 · Latest: 2025-09-03T17:39:08+00:00

Comments: 56 pages

Abs · PDF

Abstract

We argue that progress toward general intelligence requires complementary foundation models grounded in language, the physical world, and structured data. This report presents LimiX, the first installment of our large structured-data models (LDMs). LimiX treats structured data as a joint distribution over variables and missingness, thus capable of addressing a wide range of tabular tasks through query-based conditional prediction via a single model. LimiX is pretrained using masked joint-distribution modeling with an episodic, context-conditional objective, where the model predicts for query subsets conditioned on dataset-specific contexts, supporting rapid, training-free adaptation at inference. We evaluate LimiX across 10 large structured-data benchmarks with broad regimes of sample size, feature dimensionality, class number, categorical-to-numerical feature ratio, missingness, and sample-to-feature ratios. With a single model and a unified interface, LimiX consistently surpasses strong baselines including gradient-boosting trees, deep tabular networks, recent tabular foundation models, and automated ensembles, as shown in Figure 1 and Figure 2. The superiority holds across a wide range of tasks, such as classification, regression, missing value imputation, and data generation, often by substantial margins, while avoiding task-specific architectures or bespoke training per task. All LimiX models are publicly accessible under Apache 2.0.

中文标题/摘要

标题：LimiX：释放结构化数据建模能力以促进通用智能

我们认为通向通用智能的进展需要语言、物理世界和结构化数据的互补基础模型。本报告介绍了LimiX，这是我们大型结构化数据模型（LDMs）的第一部分。LimiX 将结构化数据视为变量和缺失值的联合分布，因此能够通过单个模型基于查询的条件预测来解决广泛的表格任务。LimiX 使用掩码联合分布建模进行预训练，目标是基于上下文的事件性目标，其中模型根据数据集特定的上下文条件预测查询子集，支持快速、无需训练的推理适应。我们在10个大型结构化数据基准测试中评估了LimiX，这些基准测试涵盖了样本大小、特征维度、类别数量、分类到数值特征的比例、缺失值以及样本到特征比率的广泛范围。使用单一模型和统一接口，LimiX 一致地超越了包括梯度提升树、深度表格网络、近期的表格基础模型和自动化集成在内的强大基线，如图1和图2所示。这种优越性在分类、回归、缺失值填充和数据生成等多种任务中普遍存在，通常差距显著，同时避免了特定任务的架构或针对每个任务的定制训练。所有LimiX模型均在Apache 2.0许可下公开。

WildFireCan-MMD: A Multimodal Dataset for Classification of User-Generated Content During Wildfires in Canada

Authors: Braeden Sherritt, Isar Nejadgholi, Efstratios Aivaliotis, Khaled Mslmani, Marzieh Amini

First: 2025-04-17T14:43:56+00:00 · Latest: 2025-09-03T16:22:06+00:00

Abs · PDF

Abstract

Rapid information access is vital during wildfires, yet traditional data sources are slow and costly. Social media offers real-time updates, but extracting relevant insights remains a challenge. In this work, we focus on multimodal wildfire social media data, which, although existing in current datasets, is currently underrepresented in Canadian contexts. We present WildFireCan-MMD, a new multimodal dataset of X posts from recent Canadian wildfires, annotated across twelve key themes. We evaluate zero-shot vision-language models on this dataset and compare their results with those of custom-trained and baseline classifiers. We show that while baseline methods and zero-shot prompting offer quick deployment, custom-trained models outperform them when labelled data is available. Our best-performing custom model reaches 84.48% f-score, outperforming VLMs and baseline classifiers. We also demonstrate how this model can be used to uncover trends during wildfires, through the collection and analysis of a large unlabeled dataset. Our dataset facilitates future research in wildfire response, and our findings highlight the importance of tailored datasets and task-specific training. Importantly, such datasets should be localized, as disaster response requirements vary across regions and contexts.

中文标题/摘要

标题：WildFireCan-MMD：加拿大野火期间用户生成内容分类的多模态数据集

在野火期间快速获取信息至关重要，但传统数据源速度慢且成本高。社交媒体可以提供实时更新，但提取相关见解仍具挑战性。在本研究中，我们关注多模态野火社交媒体数据，尽管这些数据目前存在于现有数据集中，但在加拿大语境下却相对不足。我们介绍了WildFireCan-MMD，这是一个包含来自加拿大近期野火的X条多模态帖子的新数据集，并在十二个关键主题上进行了标注。我们评估了零样本视觉-语言模型在该数据集上的表现，并将其结果与自定义训练和基线分类器进行了比较。我们表明，虽然基线方法和零样本提示可以快速部署，但在有标注数据时，自定义训练模型的表现更优。我们最好的自定义模型达到了84.48%的F分数，优于视觉语言模型和基线分类器。我们还展示了如何使用该模型来揭示野火期间的趋势，通过收集和分析大量未标注的数据集。我们的数据集促进了未来在野火响应方面的研究，我们的发现强调了定制数据集和任务特定训练的重要性。重要的是，这样的数据集应该本地化，因为灾害响应需求在不同地区和背景下有所不同。

Summary / 总结

This study addresses the need for rapid information access during wildfires by leveraging social media data. The researchers developed WildFireCan-MMD, a multimodal dataset of user-generated content from recent Canadian wildfires, annotated across twelve themes. They evaluated zero-shot vision-language models and custom-trained classifiers, finding that custom-trained models outperformed zero-shot models and baselines, achieving an f-score of 84.48%. The study also highlights the importance of localized datasets for tailored disaster response.

本研究旨在通过开发WildFireCan-MMD数据集，解决野火期间快速获取信息的需求，该数据集包含来自加拿大近期野火的多模态用户生成内容，并按十二个关键主题进行标注。研究使用零样本视觉-语言模型、自训练模型和基线分类器对数据集进行了评估，自训练模型表现出色，达到84.48%的f分数，强调了灾害响应中本地化数据集的重要性。

TinyDrop: Tiny Model Guided Token Dropping for Vision Transformers

Authors: Guoxin Wang, Qingyuan Wang, Binhua Huang, Shaowu Chen, Deepu John

First: 2025-09-03T14:55:49+00:00 · Latest: 2025-09-03T14:55:49+00:00

Abs · PDF

Abstract

Vision Transformers (ViTs) achieve strong performance in image classification but incur high computational costs from processing all image tokens. To reduce inference costs in large ViTs without compromising accuracy, we propose TinyDrop, a training-free token dropping framework guided by a lightweight vision model. The guidance model estimates the importance of tokens while performing inference, thereby selectively discarding low-importance tokens if large vit models need to perform attention calculations. The framework operates plug-and-play, requires no architectural modifications, and is compatible with diverse ViT architectures. Evaluations on standard image classification benchmarks demonstrate that our framework reduces FLOPs by up to 80% for ViTs with minimal accuracy degradation, highlighting its generalization capability and practical utility for efficient ViT-based classification.

Summary / 总结

TinyDrop is a training-free token dropping framework for Vision Transformers (ViTs) that uses a lightweight guidance model to estimate token importance during inference. This allows for selective discarding of low-importance tokens, reducing FLOPs by up to 80% while maintaining minimal accuracy degradation. The framework is compatible with various ViT architectures and can be applied without architectural modifications, making it a practical solution for efficient ViT-based image classification.

TinyDrop 是一种无需训练的 token 舍弃框架，用于降低 Vision Transformers (ViTs) 的计算成本同时不牺牲准确性。它使用一个轻量级的指导模型在推理过程中估计 token 的重要性，从而可以选择性地舍弃低重要性的 token。实验表明，TinyDrop 可以将 FLOPs 减少高达 80%，同时保持最小的准确性下降，展示了其在高效 ViT 基础分类中的广泛应用能力。

ChordPrompt: Orchestrating Cross-Modal Prompt Synergy for Multi-Domain Incremental Learning in CLIP

Authors: Zhiyuan Wang, Bokui Chen

First: 2025-06-24T13:22:06+00:00 · Latest: 2025-09-03T12:23:15+00:00

Comments: Accepted by the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2025)

Abs · PDF

Abstract

Continual learning (CL) empowers pre-trained vision-language models to adapt effectively to novel or previously underrepresented data distributions without comprehensive retraining, enhancing their adaptability and efficiency. While vision-language models like CLIP show great promise, they struggle to maintain performance across domains in incremental learning scenarios. Existing prompt learning methods face two main limitations: 1) they primarily focus on class-incremental learning scenarios, lacking specific strategies for multi-domain task incremental learning; 2) most current approaches employ single-modal prompts, neglecting the potential benefits of cross-modal information exchange. To address these challenges, we propose the \ChordPrompt framework, which facilitates a harmonious interplay between visual and textual prompts. \ChordPrompt introduces cross-modal prompts to leverage interactions between visual and textual information. Our approach also employs domain-adaptive text prompts to select appropriate prompts for continual adaptation across multiple domains. Comprehensive experiments on multi-domain incremental learning benchmarks demonstrate that \ChordPrompt outperforms state-of-the-art methods in zero-shot generalization and downstream task performance.

中文标题/摘要

标题：ChordPrompt：跨模态提示协同 orchestrating 多域增量学习在 CLIP 中的跨模态提示协同

持续学习（CL）使预训练的视觉-语言模型能够有效地适应新的或以前未充分代表的数据分布，而无需进行全面的重新训练，从而增强其适应性和效率。尽管视觉-语言模型如CLIP表现出巨大的潜力，但在增量学习场景中，它们难以在不同领域中保持性能。现有的提示学习方法面临两个主要限制：1）它们主要集中在类别增量学习场景上，缺乏针对多域任务增量学习的具体策略；2）大多数当前方法使用单模态提示，忽视了跨模态信息交换的潜在好处。为了解决这些挑战，我们提出了ChordPrompt框架，该框架促进了视觉和文本提示之间的和谐互动。ChordPrompt引入了跨模态提示，以利用视觉和文本信息之间的交互。我们的方法还使用了领域自适应文本提示，以选择适合持续适应多个领域的提示。在多域增量学习基准上的全面实验表明，ChordPrompt在零样本泛化和下游任务性能方面优于现有方法。

Summary / 总结

ChordPrompt addresses the limitations of existing prompt learning methods in multi-domain incremental learning by introducing cross-modal prompts and domain-adaptive text prompts. The framework demonstrates superior performance in zero-shot generalization and downstream task performance compared to state-of-the-art methods on multi-domain incremental learning benchmarks.

ChordPrompt通过引入跨模态提示和领域自适应文本提示解决了现有提示学习方法在多域增量学习中的局限性。该框架在多域增量学习基准测试中展示了在零样本泛化和下游任务性能方面优于最新方法的性能。

Mitigating Hallucination in Large Vision-Language Models through Aligning Attention Distribution to Information Flow

Authors: Jianfei Zhao, Feng Zhang, Xin Sun, Chong Feng

Venue: EMNLP 2025

First: 2025-05-20T12:10:13+00:00 · Latest: 2025-09-03T11:34:49+00:00

Comments: Accepted to Findings of EMNLP 2025

Abs · PDF

Abstract

Due to the unidirectional masking mechanism, Decoder-Only models propagate information from left to right. LVLMs (Large Vision-Language Models) follow the same architecture, with visual information gradually integrated into semantic representations during forward propagation. Through systematic analysis, we observe that the majority of the visual information is absorbed into the semantic representations. However, the model's attention distribution does not exhibit sufficient emphasis on semantic representations. This misalignment between the attention distribution and the actual information flow undermines the model's visual understanding ability and contributes to hallucinations. To address this issue, we enhance the model's visual understanding by leveraging the core information embedded in semantic representations. Specifically, we identify attention heads that focus on core semantic representations based on their attention distributions. Then, through a two-stage optimization paradigm, we propagate the advantages of these attention heads across the entire model, aligning the attention distribution with the actual information flow. We evaluate our method on three image captioning benchmarks using five different LVLMs, demonstrating its effectiveness in significantly reducing hallucinations. Further experiments reveal a trade-off between reduced hallucinations and richer details. Notably, our method allows for manual adjustment of the model's conservativeness, enabling flexible control to meet diverse real-world requirements.

中文标题/摘要

标题：通过将注意力分布与信息流对齐来缓解大型视觉-语言模型中的幻觉

由于单向掩码机制，解码器模型从左到右传播信息。大型视觉-语言模型（LVLMs）遵循相同的架构，在前向传播过程中，视觉信息逐渐整合到语义表示中。通过系统分析，我们观察到大部分视觉信息被吸收到了语义表示中。然而，模型的注意力分布并未充分强调语义表示。这种注意力分布与实际信息流之间的不匹配削弱了模型的视觉理解能力，并导致幻觉的产生。为了解决这一问题，我们通过利用嵌入在语义表示中的核心信息来增强模型的视觉理解能力。具体来说，我们根据注意力分布识别出专注于核心语义表示的注意力头。然后，通过两阶段优化范式，我们将这些注意力头的优势在整个模型中传播，使注意力分布与实际信息流对齐。我们在三个图像字幕基准上使用五种不同的LVLMs评估了我们的方法，证明了其在显著减少幻觉方面的有效性。进一步的实验揭示了减少幻觉与更丰富的细节之间的权衡。值得注意的是，我们的方法允许手动调整模型的保守性，从而灵活地控制以满足多样化的现实需求。

Summary / 总结

The research aims to mitigate hallucinations in large vision-language models by aligning the attention distribution with the actual information flow. The method involves identifying attention heads that focus on core semantic representations and then propagating their advantages through a two-stage optimization process. Experiments on three image captioning benchmarks with five different LVLMs show significant reduction in hallucinations, though there is a trade-off between reduced hallucinations and richer details. Manual adjustment of the model's conservativeness is also possible.

本文通过将模型的注意力分布与实际信息流对齐来解决大型视觉语言模型（LVLM）中的幻觉问题。作者观察到，尽管视觉信息被整合到语义表示中，但模型的注意力并未充分强调这些表示，导致幻觉。他们提出了一种两阶段优化方法，通过利用核心语义表示来增强模型的视觉理解。在三个图像字幕基准上的实验表明，该方法显著减少了幻觉，尽管可能会牺牲一些细节。此外，该方法还允许手动调整模型的保守性，提供灵活控制以满足各种实际需求。