arXiv 论文速递

TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection

Authors: Zehong Yan, Peng Qi, Wynne Hsu, Mong Li Lee

Venue: EMNLP 2025

First: 2025-09-04T17:59:43+00:00 · Latest: 2025-09-04T17:59:43+00:00

Comments: EMNLP 2025; Project Homepage: https://yanzehong.github.io/trust-vl/

Abstract

Multimodal misinformation, encompassing textual, visual, and cross-modal distortions, poses an increasing societal threat that is amplified by generative AI. Existing methods typically focus on a single type of distortion and struggle to generalize to unseen scenarios. In this work, we observe that different distortion types share common reasoning capabilities while also requiring task-specific skills. We hypothesize that joint training across distortion types facilitates knowledge sharing and enhances the model's ability to generalize. To this end, we introduce TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. TRUST-VL incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.

中文标题/摘要

标题：TRUST-VL：一种可解释的通用多模态虚假信息检测助手

多模态虚假信息，包括文本、视觉和跨模态的扭曲，构成了日益严重的社会威胁，这种威胁被生成式AI放大了。现有方法通常专注于一种类型的扭曲，并且难以泛化到未见过的场景。在这项工作中，我们观察到不同类型的扭曲共享一些共同的推理能力，同时也需要特定的任务技能。我们假设跨类型联合训练促进了知识共享并增强了模型的泛化能力。为此，我们引入了TRUST-VL，这是一种统一且可解释的视觉语言模型，用于通用多模态虚假信息检测。TRUST-VL 包含一个新颖的问答感知视觉增强模块，旨在提取特定任务的视觉特征。为了支持训练，我们还构建了TRUST-Instruct，这是一个包含198K样本的大规模指令数据集，样本中包含与人类事实核查工作流程对齐的结构化推理链。在领域内和零样本基准上的广泛实验表明，TRUST-VL 达到了最先进的性能，同时提供了强大的泛化能力和可解释性。

Summary / 总结

Multimodal misinformation, encompassing textual, visual, and cross-modal distortions, poses an increasing societal threat that is amplified by generative AI.

AnomalyLMM: Bridging Generative Knowledge and Discriminative Retrieval for Text-Based Person Anomaly Search

Authors: Hao Ju, Hu Zhang, Zhedong Zheng

First: 2025-09-04T16:34:46+00:00 · Latest: 2025-09-04T16:34:46+00:00

Abs · PDF

Abstract

With growing public safety demands, text-based person anomaly search has emerged as a critical task, aiming to retrieve individuals with abnormal behaviors via natural language descriptions. Unlike conventional person search, this task presents two unique challenges: (1) fine-grained cross-modal alignment between textual anomalies and visual behaviors, and (2) anomaly recognition under sparse real-world samples. While Large Multi-modal Models (LMMs) excel in multi-modal understanding, their potential for fine-grained anomaly retrieval remains underexplored, hindered by: (1) a domain gap between generative knowledge and discriminative retrieval, and (2) the absence of efficient adaptation strategies for deployment. In this work, we propose AnomalyLMM, the first framework that harnesses LMMs for text-based person anomaly search. Our key contributions are: (1) A novel coarse-to-fine pipeline integrating LMMs to bridge generative world knowledge with retrieval-centric anomaly detection; (2) A training-free adaptation cookbook featuring masked cross-modal prompting, behavioral saliency prediction, and knowledge-aware re-ranking, enabling zero-shot focus on subtle anomaly cues. As the first study to explore LMMs for this task, we conduct a rigorous evaluation on the PAB dataset, the only publicly available benchmark for text-based person anomaly search, with its curated real-world anomalies covering diverse scenarios (e.g., falling, collision, and being hit). Experiments show the effectiveness of the proposed method, surpassing the competitive baseline by +0.96% Recall@1 accuracy. Notably, our method reveals interpretable alignment between textual anomalies and visual behaviors, validated via qualitative analysis. Our code and models will be released for future research.

中文标题/摘要

标题：AnomalyLMM：连接生成性知识与辨别性检索的文本基础人员异常搜索

随着公共安全需求的增长，基于文本的人员异常搜索已成为一项关键任务，旨在通过自然语言描述检索具有异常行为的个体。与传统的人员搜索不同，这项任务面临两个独特的挑战：（1）文本异常与视觉行为之间的精细跨模态对齐，以及（2）在稀疏的现实世界样本下进行异常识别。虽然大型多模态模型（LMMs）在多模态理解方面表现出色，但它们在精细异常检索方面的潜力尚未得到充分探索，受到以下因素的阻碍：（1）生成性知识与辨别性检索之间的领域差距，以及（2）缺乏有效的部署适应策略。在本文中，我们提出了AnomalyLMM，这是第一个利用LMMs进行基于文本的人员异常搜索的框架。我们的主要贡献是：（1）一种新颖的从粗到细的流水线，将LMMs集成以连接生成性世界的知识与检索为中心的异常检测；（2）一种无需训练的适应食谱，包括掩码跨模态提示、行为显著性预测和知识感知再排序，使零样本聚焦于细微的异常线索。作为首次探索LMMs用于此任务的研究，我们在PAB数据集上进行了严格的评估，这是唯一公开的基于文本的人员异常搜索基准数据集，其精心策划的现实世界异常涵盖了多种场景（例如，跌倒、碰撞和被击中）。实验表明，所提出的方法的有效性，超越了竞争性基线+0.96%的召回率。值得注意的是，我们的方法揭示了文本异常与视觉行为之间的可解释对齐，通过定性分析进行了验证。我们的代码和模型将为未来的研究发布。

GeoArena: An Open Platform for Benchmarking Large Vision-language Models on WorldWide Image Geolocalization

Authors: Pengyue Jia, Yingyi Zhang, Xiangyu Zhao, Yixuan Li

First: 2025-09-04T15:52:04+00:00 · Latest: 2025-09-04T15:52:04+00:00

Abs · PDF

Abstract

Image geolocalization aims to predict the geographic location of images captured anywhere on Earth, but its global nature presents significant challenges. Current evaluation methodologies suffer from two major limitations. First, data leakage: advanced approaches often rely on large vision-language models (LVLMs) to predict image locations, yet these models are frequently pretrained on the test datasets, compromising the accuracy of evaluating a model's actual geolocalization capability. Second, existing metrics primarily rely on exact geographic coordinates to assess predictions, which not only neglects the reasoning process but also raises privacy concerns when user-level location data is required. To address these issues, we propose GeoArena, a first open platform for evaluating LVLMs on worldwide image geolocalization tasks, offering true in-the-wild and human-centered benchmarking. GeoArena enables users to upload in-the-wild images for a more diverse evaluation corpus, and it leverages pairwise human judgments to determine which model output better aligns with human expectations. Our platform has been deployed online for two months, during which we collected over thousands voting records. Based on this data, we conduct a detailed analysis and establish a leaderboard of different LVLMs on the image geolocalization task.

中文标题/摘要

标题：GeoArena：一个用于评估全球图像地理定位的大规模视觉-语言模型的开源平台

图像地理定位旨在预测地球上任何地方拍摄的图像的地理位置，但其全球性质带来了重大挑战。当前的评估方法存在两个主要局限性。首先，数据泄露：先进的方法通常依赖大规模视觉-语言模型（LVLMs）来预测图像位置，但这些模型经常在测试数据集上进行预训练，这会损害评估模型实际地理定位能力的准确性。其次，现有的评估指标主要依赖于精确的地理坐标来评估预测，这不仅忽视了推理过程，还当需要用户级别的位置数据时引发了隐私问题。为了解决这些问题，我们提出了GeoArena，这是一个首个用于评估大规模视觉-语言模型在世界范围图像地理定位任务上的开源平台，提供真正的野外和以人为本的基准测试。GeoArena 允许用户上传野外图像以获得更多样化的评估语料，并利用成对的人类判断来确定哪个模型输出更符合人类期望。我们的平台已在线部署两个月，期间我们收集了数千条投票记录。基于这些数据，我们进行了详细分析，并建立了不同大规模视觉-语言模型在图像地理定位任务上的排行榜。

Summary / 总结

GeoArena is an open platform designed to benchmark large vision-language models on global image geolocalization tasks. It addresses the issues of data leakage and reliance on exact geographic coordinates by allowing users to upload diverse in-the-wild images and using pairwise human judgments to evaluate model outputs. The platform has collected thousands of voting records over two months, leading to a detailed analysis and a leaderboard of different vision-language models.

GeoArena 是一个开放平台，用于评估大型视觉语言模型在环球图像地理定位任务中的表现。它通过使用野外图像和成对的人类判断来解决数据泄漏和隐私问题。该平台在两个月内收集了数千条投票记录，从而进行了详细分析并建立了不同模型的排行榜。

OVGrasp: Open-Vocabulary Grasping Assistance via Multimodal Intent Detection

Authors: Chen Hu, Shan Luo, Letizia Gionfrida

First: 2025-09-04T15:42:36+00:00 · Latest: 2025-09-04T15:42:36+00:00

Abs · PDF

Abstract

Grasping assistance is essential for restoring autonomy in individuals with motor impairments, particularly in unstructured environments where object categories and user intentions are diverse and unpredictable. We present OVGrasp, a hierarchical control framework for soft exoskeleton-based grasp assistance that integrates RGB-D vision, open-vocabulary prompts, and voice commands to enable robust multimodal interaction. To enhance generalization in open environments, OVGrasp incorporates a vision-language foundation model with an open-vocabulary mechanism, allowing zero-shot detection of previously unseen objects without retraining. A multimodal decision-maker further fuses spatial and linguistic cues to infer user intent, such as grasp or release, in multi-object scenarios. We deploy the complete framework on a custom egocentric-view wearable exoskeleton and conduct systematic evaluations on 15 objects across three grasp types. Experimental results with ten participants demonstrate that OVGrasp achieves a grasping ability score (GAS) of 87.00%, outperforming state-of-the-art baselines and achieving improved kinematic alignment with natural hand motion.

中文标题/摘要

标题：OVGrasp: 开放词汇抓取辅助通过多模态意图检测

抓取辅助对于恢复运动受损个体的自主性至关重要，特别是在物体类别和用户意图多样且不可预测的非结构化环境中。我们提出了OVGrasp，一种基于软外骨骼的抓取辅助的分层控制框架，该框架结合了RGB-D视觉、开放词汇提示和语音命令，以实现稳健的多模态交互。为了在开放环境中增强泛化能力，OVGrasp整合了一个视觉语言基础模型和开放词汇机制，允许在无需重新训练的情况下进行零样本检测，以识别未见过的对象。多模态决策者进一步融合空间和语言线索，以推断用户意图，如抓取或释放，在多物体场景中。我们在一个自定义的第一人称视角可穿戴外骨骼上部署了完整的框架，并在15个物体上进行了三种抓取类型的系统评估。十名参与者的实验结果表明，OVGrasp实现了87.00%的抓取能力评分（GAS），优于最先进的基线，并实现了与自然手部运动更好的运动学对齐。

Image Embedding Sampling Method for Diverse Captioning

Authors: Sania Waheed, Na Min An

First: 2025-02-14T12:33:19+00:00 · Latest: 2025-09-04T15:00:25+00:00

Comments: 17 pages, 5 figures, 9 tables

Abs · PDF

Abstract

Image Captioning for state-of-the-art VLMs has significantly improved over time; however, this comes at the cost of increased computational complexity, making them less accessible for resource-constrained applications such as mobile devices and assistive technologies. Alternatively, comparably smaller VLMs prioritize high-level scene descriptions, overlooking finer details that contribute to a richer understanding of an image. In this paper, we introduce a training-free framework that enhances caption diversity and informativeness by explicitly attending to distinct image regions using a comparably small VLM, BLIP, as the backbone. Our approach leverages structured segmentation to produce hierarchical representations that capture both global and localized semantics. Without requiring additional model training, we demonstrate that our method allows smaller VLMs to achieve performance comparable to larger models in terms of image-caption alignment, semantic integrity, and diversity. We evaluate our framework on MSCOCO, Flickr30k, and Nocaps test datasets, achieving a Div-2 score of 0.735, 0.750, and 0.748 for each dataset, respectively, while maintaining strong image-caption relevancy and semantic integrity with the human-annotated captions.

中文标题/摘要

标题：图像嵌入采样方法以实现多样的描述

最先进的VLM的图像描述在过去的时间里显著提高，但这也带来了计算复杂性的增加，使得它们在资源受限的应用中，如移动设备和辅助技术中不够普及。相反，较小的VLM更侧重于高层次的场景描述，而忽略了有助于更深入理解图像的细节。在本文中，我们介绍了一种无需训练的框架，通过使用BLIP作为骨干网络，明确关注不同的图像区域，从而增强描述的多样性和信息量。我们的方法利用结构化分割生成层次表示，捕捉全局和局部语义。无需额外的模型训练，我们证明了我们的方法使较小的VLM在图像-描述对齐、语义完整性和多样性方面达到了与较大模型相当的性能。我们在MSCOCO、Flickr30k和Nocaps测试数据集上评估了我们的框架，分别获得了Div-2得分为0.735、0.750和0.748，同时保持了与人工标注描述的高度相关性和语义完整性。

Summary / 总结

This paper addresses the challenge of enhancing the diversity and informativeness of image captions using a small vision-language model (VLM) called BLIP. By leveraging structured segmentation, the method captures both global and localized semantics without additional training. The approach achieves performance comparable to larger models on MSCOCO, Flickr30k, and Nocaps datasets, with Div-2 scores of 0.735, 0.750, and 0.748 respectively, while maintaining strong relevance and semantic integrity with human-annotated captions.

该论文通过利用结构化分割方法和BLIP模型，不需额外训练，提升了使用较小的视觉-语言模型生成的图像描述的多样性和信息量。在MSCOCO、Flickr30k和Nocaps测试数据集上，该方法分别获得了Div-2分数0.735、0.750和0.748，同时保持了与人工标注描述的高度相关性和语义完整性。

Straighter Flow Matching via a Diffusion-Based Coupling Prior

Authors: Siyu Xing, Jie Cao, Huaibo Huang, Haichao Shi, Xiao-Yu Zhang

First: 2023-11-28T06:19:30+00:00 · Latest: 2025-09-04T14:24:04+00:00

Abs · PDF

Abstract

Flow matching as a paradigm of generative model achieves notable success across various domains. However, existing methods use either multi-round training or knowledge within minibatches, posing challenges in finding a favorable coupling strategy for straightening trajectories to few-step generation. To address this issue, we propose a novel approach, Straighter trajectories of Flow Matching (StraightFM). It straightens trajectories with the coupling strategy from the entire distribution level. More specifically, during training, StraightFM creates couplings of images and noise via one diffusion model as a coupling prior to straighten trajectories for few-step generation. Our coupling strategy can also integrate with the existing coupling direction from real data to noise, improving image quality in few-step generation. Experimental results on pixel space and latent space show that StraightFM yields attractive samples within 5 steps. Moreover, our unconditional StraightFM is seamlessly compatible with training-free multimodal conditional generation, maintaining high-quality image generation in few steps.

中文标题/摘要

标题：基于扩散耦合先验的更直流水流动匹配

水流动匹配作为一种生成模型的范式，在各个领域取得了显著的成功。然而，现有方法要么采用多轮训练，要么利用小批量内的知识，这在寻找适合直流水流动轨迹的耦合策略方面提出了挑战。为了解决这一问题，我们提出了一种新的方法，即直流水流动匹配（StraightFM）。它在整体分布层面使用耦合策略来直流水流动轨迹。具体而言，在训练过程中，StraightFM通过一个扩散模型将图像和噪声耦合起来作为耦合先验，以直流水流动轨迹进行多步生成。我们的耦合策略还可以与真实数据到噪声的现有耦合方向结合，从而在多步生成中提高图像质量。在像素空间和潜在空间的实验结果显示，StraightFM在5步内生成了具有吸引力的样本。此外，我们的无条件StraightFM可以无缝兼容无需训练的多模态条件生成，保持在多步生成中高质量的图像生成能力。

Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding

Authors: Wanfu Wang, Qipeng Huang, Guangquan Xue, Xiaobo Liang, Juntao Li

First: 2025-09-04T14:17:01+00:00 · Latest: 2025-09-04T14:17:01+00:00

Abs · PDF

Abstract

Vision Language Models (VLMs) have recently achieved significant progress in bridging visual perception and linguistic reasoning. Recently, OpenAI o3 model introduced a zoom-in search strategy that effectively elicits active perception capabilities in VLMs, improving downstream task performance. However, enabling VLMs to reason effectively over appropriate image regions remains a core challenge in GUI grounding, particularly under high-resolution inputs and complex multi-element visual interactions. In this work, we propose LASER, a self-evolving framework that progressively endows VLMs with multi-step perception capabilities, enabling precise coordinate prediction. Specifically, our approach integrate Monte Carlo quality estimation with Intersection-over-Union (IoU)-based region quality evaluation to jointly encourage both accuracy and diversity in constructing high-quality preference data. This combination explicitly guides the model to focus on instruction-relevant key regions while adaptively allocating reasoning steps based on task complexity. Comprehensive experiments on the ScreenSpot Pro and ScreenSpot-v2 benchmarks demonstrate consistent performance gains, validating the effectiveness of our method. Furthermore, when fine-tuned on GTA1-7B, LASER achieves a score of 55.7 on the ScreenSpot-Pro benchmark, establishing a new state-of-the-art (SoTA) among 7B-scale models.

中文标题/摘要

标题：通过自我进化的偏好优化学习主动感知以实现GUI定位

视觉语言模型（VLMs）最近在视觉感知和语言推理的结合方面取得了显著进展。最近，OpenAI的o3模型引入了一种缩放搜索策略，有效地激发了VLMs的主动感知能力，提高了下游任务的性能。然而，在GUI定位中，特别是在高分辨率输入和复杂多元素视觉交互下，使VLMs能够有效地在适当图像区域进行推理仍然是一个核心挑战。在本文中，我们提出了一种自我进化的框架LASER，逐步赋予VLMs多步感知能力，使其能够进行精确的坐标预测。具体而言，我们的方法将蒙特卡洛质量估计与基于交并比（IoU）的区域质量评估相结合，以共同促进构建高质量偏好数据的准确性和多样性。这种结合明确地引导模型关注与指令相关的关键区域，并根据任务复杂性自适应地分配推理步骤。在ScreenSpot Pro和ScreenSpot-v2基准上的全面实验表明，该方法具有一致的性能提升，验证了其有效性。此外，当在GTA1-7B上微调时，LASER在ScreenSpot-Pro基准上的得分为55.7，成为7B规模模型中的新最佳水平（SoTA）。

Summary / 总结

This paper addresses the challenge of enabling Vision Language Models (VLMs) to effectively reason over appropriate image regions in GUI grounding tasks. The authors propose LASER, a self-evolving framework that integrates Monte Carlo quality estimation with IoU-based region quality evaluation to improve multi-step perception capabilities. Experiments on ScreenSpot Pro and ScreenSpot-v2 benchmarks show consistent performance gains, and LASER achieves a score of 55.7 on the ScreenSpot-Pro benchmark, setting a new state-of-the-art among 7B-scale models.

该研究旨在解决使视觉语言模型在GUI定位任务中有效推理适当图像区域的挑战。作者提出了一种自演化框架LASER，该框架结合了蒙特卡洛质量估计与基于IoU的区域质量评估，以提高多步感知能力。在ScreenSpot Pro和ScreenSpot-v2基准上的实验显示了一致的性能提升，且在GTA1-7B微调后，LASER在ScreenSpot-Pro基准上的得分为55.7，成为7B规模模型中的新最佳表现。

Exposing Synthetic Speech: Model Attribution and Detection of AI-generated Speech via Audio Fingerprints

Authors: Matías Pizarro, Mike Laszkiewicz, Shawkat Hesso, Dorothea Kolossa, Asja Fischer

First: 2024-11-21T10:55:49+00:00 · Latest: 2025-09-04T12:43:52+00:00

Abs · PDF

Abstract

As speech generation technologies continue to advance in quality and accessibility, the risk of malicious use cases, including impersonation, misinformation, and spoofing, increases rapidly. This work addresses this threat by introducing a simple, training-free, yet effective approach for detecting AI-generated speech and attributing it to its source model. Specifically, we tackle three key tasks: (1) single-model attribution in an open-world setting, where the goal is to determine whether a given audio sample was generated by a specific target neural speech synthesis system (with access only to data from that system); (2) multi-model attribution in a closed-world setting, where the objective is to identify the generating system from a known pool of candidates; and last but not least (3) detection of synthetic versus real speech. Our approach leverages standardized average residuals-the difference between an input audio signal and its filtered version using either a low-pass filter or the EnCodec audio autoencoder. We demonstrate that these residuals consistently capture artifacts introduced by diverse speech synthesis systems, serving as distinctive, model-agnostic fingerprints for attribution. Across extensive experiments, our approach achieves AUROC scores exceeding 99% in most scenarios, evaluated on augmented benchmark datasets that pair real speech with synthetic audio generated by multiple synthesis systems. In addition, our robustness analysis underscores the method's ability to maintain high performance even in the presence of moderate additive noise. Due to its simplicity, efficiency, and strong generalization across speech synthesis systems and languages, this technique offers a practical tool for digital forensics and security applications.

中文标题/摘要

标题：揭示合成语音：通过音频指纹检测和归因于AI生成语音的模型

随着语音生成技术在质量和可访问性方面的不断进步，恶意使用案例，包括冒充、误导和欺诈，的风险迅速增加。本研究通过引入一种简单、无需训练但有效的检测AI生成语音并将其归因于其源模型的方法来应对这一威胁。具体而言，我们解决了三个关键任务：（1）开放世界中的单模型归因，目标是在仅访问该系统数据的情况下确定给定音频样本是否由特定目标神经语音合成系统生成；（2）封闭世界中的多模型归因，目标是从已知候选池中识别生成系统；最后但同样重要的是（3）合成语音与真实语音的检测。我们的方法利用标准化平均残差——输入音频信号与其使用低通滤波器或EnCodec音频自编码器进行滤波后的版本之间的差异。我们证明这些残差能够一致地捕捉到由多种语音合成系统引入的特征，作为区分性、模型无关的指纹用于归因。在广泛的实验中，我们的方法在大多数场景中实现了超过99%的AUROC分数，评估基于扩展基准数据集，该数据集将真实语音与由多个合成系统生成的合成音频配对。此外，我们的鲁棒性分析强调了该方法即使在存在中等附加噪声的情况下仍能保持高性能的能力。由于其简单性、效率以及在语音合成系统和语言方面的强大泛化能力，该技术为数字取证和安全应用提供了一种实用工具。

Summary / 总结

This research aims to address the growing threat of malicious use of AI-generated speech by developing a simple, training-free method for detecting and attributing AI-generated speech. The approach uses standardized average residuals to identify unique fingerprints of different speech synthesis models. Experiments show high accuracy, with AUROC scores over 99% in various scenarios, and robust performance even with noise.

该研究旨在通过引入无需训练的方法来检测和归因AI生成的语音，以应对语音生成技术的恶意使用威胁。方法利用标准化平均残差来识别不同语音合成系统特有的特征，作为模型无关的指纹。实验显示，该方法在大多数场景下的AUROC分数超过99%，即使在有噪声的情况下也能保持高性能，使其成为数字取证和安全应用中的实用工具。

TAGAL: Tabular Data Generation using Agentic LLM Methods

Authors: Benoît Ronval, Pierre Dupont, Siegfried Nijssen

First: 2025-09-04T12:25:14+00:00 · Latest: 2025-09-04T12:25:14+00:00

Abs · PDF

Abstract

The generation of data is a common approach to improve the performance of machine learning tasks, among which is the training of models for classification. In this paper, we present TAGAL, a collection of methods able to generate synthetic tabular data using an agentic workflow. The methods leverage Large Language Models (LLMs) for an automatic and iterative process that uses feedback to improve the generated data without any further LLM training. The use of LLMs also allows for the addition of external knowledge in the generation process. We evaluate TAGAL across diverse datasets and different aspects of quality for the generated data. We look at the utility of downstream ML models, both by training classifiers on synthetic data only and by combining real and synthetic data. Moreover, we compare the similarities between the real and the generated data. We show that TAGAL is able to perform on par with state-of-the-art approaches that require LLM training and generally outperforms other training-free approaches. These findings highlight the potential of agentic workflow and open new directions for LLM-based data generation methods.

中文标题/摘要

标题：TAGAL：使用代理型LLM方法生成表格数据

数据生成是提高机器学习任务性能的常见方法，其中也包括分类模型的训练。本文介绍了TAGAL，一种能够使用代理型工作流生成合成表格数据的方法。该方法利用大型语言模型（LLMs）进行自动和迭代的过程，通过反馈不断改进生成的数据，而无需进一步训练LLM。利用LLMs还可以在生成过程中添加外部知识。我们通过多种数据集和生成数据的不同质量方面评估了TAGAL。我们不仅通过仅使用合成数据训练分类器，还通过结合真实和合成数据来评估下游机器学习模型的实用性。此外，我们还比较了真实数据和生成数据之间的相似性。结果显示，TAGAL能够与需要训练LLM的最新方法相媲美，并且通常优于其他无需训练的方法。这些发现突显了代理型工作流的潜力，并为基于LLM的数据生成方法开辟了新的方向。

Summary / 总结

The paper introduces TAGAL, a method for generating synthetic tabular data using Large Language Models (LLMs) in an agentic workflow. The approach leverages LLMs for an iterative process that improves generated data through feedback without further LLM training. Experiments across various datasets show that TAGAL performs comparably to state-of-the-art approaches requiring LLM training and outperforms other training-free methods, indicating the potential of agentic workflows in LLM-based data generation.

该论文介绍了使用大型语言模型（LLMs）在代理工作流中生成合成表格数据的方法——TAGAL。该方法利用LLMs进行一个通过反馈不断改进生成数据的迭代过程，无需进一步训练LLMs。实验结果表明，TAGAL在各种数据集上的表现与需要LLM训练的最新方法相当，并且优于其他无需训练的方法，这表明代理工作流在基于LLM的数据生成中的潜力。

MUNBa: Machine Unlearning via Nash Bargaining

Authors: Jing Wu, Mehrtash Harandi

First: 2024-11-23T12:18:28+00:00 · Latest: 2025-09-04T11:00:46+00:00

Abs · PDF

Abstract

Machine Unlearning (MU) aims to selectively erase harmful behaviors from models while retaining the overall utility of the model. As a multi-task learning problem, MU involves balancing objectives related to forgetting specific concepts/data and preserving general performance. A naive integration of these forgetting and preserving objectives can lead to gradient conflicts and dominance, impeding MU algorithms from reaching optimal solutions. To address the gradient conflict and dominance issue, we reformulate MU as a two-player cooperative game, where the two players, namely, the forgetting player and the preservation player, contribute via their gradient proposals to maximize their overall gain and balance their contributions. To this end, inspired by the Nash bargaining theory, we derive a closed-form solution to guide the model toward the Pareto stationary point. Our formulation of MU guarantees an equilibrium solution, where any deviation from the final state would lead to a reduction in the overall objectives for both players, ensuring optimality in each objective. We evaluate our algorithm's effectiveness on a diverse set of tasks across image classification and image generation. Extensive experiments with ResNet, vision-language model CLIP, and text-to-image diffusion models demonstrate that our method outperforms state-of-the-art MU algorithms, achieving a better trade-off between forgetting and preserving. Our results also highlight improvements in forgetting precision, preservation of generalization, and robustness against adversarial attacks.

中文标题/摘要

标题：MUNBa: 机器去学习通过纳什讨价还价

机器去学习（MU）旨在从模型中选择性地消除有害行为，同时保留模型的整体效用。作为多任务学习问题，MU涉及平衡与遗忘特定概念/数据和保持一般性能相关的目标。简单地整合这些遗忘和保留目标可能导致梯度冲突和支配，阻碍MU算法达到最优解。为了解决梯度冲突和支配问题，我们将MU重新表述为一个两人合作博弈，其中两名玩家，即遗忘玩家和保留玩家，通过他们的梯度提案来最大化他们的整体收益并平衡他们的贡献。为此，借鉴纳什讨价还价理论，我们推导出一个闭式解来引导模型向帕累托稳定点发展。我们对MU的表述保证了一个均衡解，在此解中，任何偏离最终状态都会导致两个玩家的整体目标减少，确保每个目标的最优性。我们在图像分类和图像生成的一系列任务上评估了我们算法的有效性。广泛的实验使用ResNet、视觉-语言模型CLIP和文本到图像扩散模型表明，我们的方法优于最先进的MU算法，实现了遗忘和保留之间的更好权衡。我们的结果还突显了遗忘精度、保持泛化能力和对抗攻击鲁棒性的改进。

Summary / 总结

The paper addresses the challenge of Machine Unlearning (MU) by formulating it as a two-player cooperative game using Nash Bargaining theory. This approach aims to balance the objectives of forgetting specific concepts and preserving overall model performance. The method derives a closed-form solution to guide the model towards a Pareto stationary point, ensuring an optimal trade-off. Experiments on various tasks show that the proposed method outperforms existing MU algorithms, achieving better precision in forgetting and preservation of generalization.

MUNBa将机器遗忘重新表述为两个玩家的合作博弈，使用纳什讨价还价理论来解决梯度冲突问题，导出闭式解引导模型向帕累托稳定点收敛，确保在遗忘和保留两个目标上的最优性。实验表明，MUNBa在各种任务上优于现有方法，实现了更好的遗忘与保留之间的权衡，并提高了遗忘精度和对抗攻击的鲁棒性。

SMooGPT: Stylized Motion Generation using Large Language Models

Authors: Lei Zhong, Yi Yang, Changjian Li

First: 2025-09-04T09:41:18+00:00 · Latest: 2025-09-04T09:41:18+00:00

Abs · PDF

Abstract

Stylized motion generation is actively studied in computer graphics, especially benefiting from the rapid advances in diffusion models. The goal of this task is to produce a novel motion respecting both the motion content and the desired motion style, e.g., ``walking in a loop like a Monkey''. Existing research attempts to address this problem via motion style transfer or conditional motion generation. They typically embed the motion style into a latent space and guide the motion implicitly in a latent space as well. Despite the progress, their methods suffer from low interpretability and control, limited generalization to new styles, and fail to produce motions other than ``walking'' due to the strong bias in the public stylization dataset. In this paper, we propose to solve the stylized motion generation problem from a new perspective of reasoning-composition-generation, based on our observations: i) human motion can often be effectively described using natural language in a body-part centric manner, ii) LLMs exhibit a strong ability to understand and reason about human motion, and iii) human motion has an inherently compositional nature, facilitating the new motion content or style generation via effective recomposing. We thus propose utilizing body-part text space as an intermediate representation, and present SMooGPT, a fine-tuned LLM, acting as a reasoner, composer, and generator when generating the desired stylized motion. Our method executes in the body-part text space with much higher interpretability, enabling fine-grained motion control, effectively resolving potential conflicts between motion content and style, and generalizes well to new styles thanks to the open-vocabulary ability of LLMs. Comprehensive experiments and evaluations, and a user perceptual study, demonstrate the effectiveness of our approach, especially under the pure text-driven stylized motion generation.

中文标题/摘要

标题：SMooGPT：使用大型语言模型进行风格化运动生成

风格化运动生成在计算机图形学中得到了积极的研究，特别得益于扩散模型的迅速发展。该任务的目标是生成既尊重运动内容又符合所需运动风格的新运动，例如“像猴子一样环形行走”。现有研究试图通过运动风格转换或条件运动生成来解决这一问题。它们通常将运动风格嵌入到潜在空间中，并在潜在空间中隐式地引导运动。尽管取得了进展，但它们的方法在可解释性和控制性方面较低，难以泛化到新的风格，并且由于公共风格化数据集中的强烈偏见，无法生成除“行走”之外的运动。在本文中，我们从推理-组合-生成的新视角出发，解决风格化运动生成问题，基于我们的观察：i) 人体运动往往可以用自然语言在以身体部位为中心的方式进行有效描述，ii) 大型语言模型在理解和推理人体运动方面表现出很强的能力，iii) 人体运动具有固有的组合性质，有助于通过有效的重组生成新的运动内容或风格。因此，我们提出利用身体部位文本空间作为中间表示，并提出SMooGPT，这是一种微调后的大型语言模型，在生成所需风格化运动时充当推理者、组合者和生成者。我们的方法在身体部位文本空间中执行，具有更高的可解释性，能够实现精细的运动控制，有效解决运动内容和风格之间的潜在冲突，并由于大型语言模型的开放式词汇能力，能够很好地泛化到新的风格。全面的实验和评估以及用户感知研究证明了我们方法的有效性，特别是在纯文本驱动的风格化运动生成方面。

Summary / 总结

The paper aims to improve stylized motion generation by leveraging large language models (LLMs) to enhance interpretability and control over motion content and style. The method, SMooGPT, uses a body-part text space as an intermediate representation, allowing the model to act as a reasoner, composer, and generator. Experiments show that SMooGPT offers better fine-grained control, resolves conflicts between motion content and style, and generalizes well to new styles due to the open-vocabulary nature of LLMs.

本文提出了一种利用大规模语言模型生成风格化运动的方法SMooGPT，旨在解决现有技术存在的缺乏可解释性和难以泛化到新风格的问题。该方法采用推理-合成-生成的视角，使用身体部位文本空间作为中间表示。该方法具有更高的可解释性、更好的运动控制能力，并能有效处理内容和风格之间的冲突，同时由于大规模语言模型的开放式词汇能力，能够很好地泛化到新风格。实验和用户感知研究证实了其有效性，特别是在纯文本驱动的风格化运动生成方面。

DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model

Authors: Qian Chen, Xianyin Zhang, Lifan Guo, Feng Chen, Chi Zhang

First: 2025-08-18T03:28:57+00:00 · Latest: 2025-09-04T08:05:29+00:00

Abs · PDF

Abstract

Recent advances in large vision-language models (LVLMs) have enabled a new paradigm of end-to-end document image parsing, excelling in Optical Character Recognition (OCR) tasks such as text, table, and formula recognition. However, generative LVLMs, similarly to large language models (LLMs), are prone to hallucinations--generating words that do not exist in input images. Furthermore, LVLMs are designed for general purposes and tend to be less effective on OCR tasks compared to expert models that are trained on domain-specific datasets. In this paper, we propose DianJin-OCR-R1, a reasoning-enhanced framework designed to address these limitations through training reasoning-and-tool interleaved VLMs. Given a recognition instruction, our DianJin-OCR-R1 model first recognizes the content in the input image by its own OCR capabilities, and then calls other tools (i.e., other expert models) to obtain their results as references, finally "looks again" the image and rethinks about the reasoning process to provide the final recognized content. Since architectures of expert models are tailored for specific OCR tasks, which makes them less prone to hallucinations, their results can help VLMs mitigate hallucinations. We evaluate our model on ReST and OmniDocBench, and experimental results show that our DianJin-OCR-R1 models consistently outperform their non-reasoning counterparts and expert OCR models, which proves the effectiveness of our method. Additionally, the results indicate that enhancing expert models, which are typically small and easy to iterate, enable performance improvements for VLMs.

中文标题/摘要

标题：电金-OCR-R1：通过推理与工具交替的视觉-语言模型增强OCR能力

大型视觉-语言模型（LVLM）的最新进展使端到端的文档图像解析成为可能，这在光学字符识别（OCR）任务如文本、表格和公式识别方面表现出色。然而，生成型LVLM与大型语言模型（LLM）一样，容易产生幻觉——生成输入图像中不存在的词语。此外，LVLM设计用于通用目的，与专门针对特定领域数据集训练的专家模型相比，在OCR任务上效果较差。在本文中，我们提出了一种名为DianJin-OCR-R1的推理增强框架，通过训练推理与工具交替的VLM来解决这些限制。给定一个识别指令，我们的DianJin-OCR-R1模型首先利用自身的OCR能力识别输入图像的内容，然后调用其他工具（即其他专家模型）获取其结果作为参考，最后再次审视图像并重新思考推理过程以提供最终识别内容。由于专家模型的架构针对特定的OCR任务进行了定制，这使得它们不太容易产生幻觉，其结果有助于减轻LVLM的幻觉。我们在ReST和OmniDocBench上评估了我们的模型，实验结果表明，我们的DianJin-OCR-R1模型始终优于其非推理版本和专家OCR模型，这证明了我们方法的有效性。此外，结果表明，增强通常较小且易于迭代的专家模型，能够提高LVLM的性能。

Multimodal Feature Fusion Network with Text Difference Enhancement for Remote Sensing Change Detection

Authors: Yijun Zhou, Yikui Zhai, Zilu Ying, Tingfeng Xian, Wenlve Zhou, Zhiheng Zhou, Xiaolin Tian, Xudong Jia, Hongsheng Zhang, C. L. Philip Chen

First: 2025-09-04T07:39:18+00:00 · Latest: 2025-09-04T07:39:18+00:00

Abs · PDF · Code1

Abstract

Although deep learning has advanced remote sensing change detection (RSCD), most methods rely solely on image modality, limiting feature representation, change pattern modeling, and generalization especially under illumination and noise disturbances. To address this, we propose MMChange, a multimodal RSCD method that combines image and text modalities to enhance accuracy and robustness. An Image Feature Refinement (IFR) module is introduced to highlight key regions and suppress environmental noise. To overcome the semantic limitations of image features, we employ a vision language model (VLM) to generate semantic descriptions of bitemporal images. A Textual Difference Enhancement (TDE) module then captures fine grained semantic shifts, guiding the model toward meaningful changes. To bridge the heterogeneity between modalities, we design an Image Text Feature Fusion (ITFF) module that enables deep cross modal integration. Extensive experiments on LEVIRCD, WHUCD, and SYSUCD demonstrate that MMChange consistently surpasses state of the art methods across multiple metrics, validating its effectiveness for multimodal RSCD. Code is available at: https://github.com/yikuizhai/MMChange.

中文标题/摘要

标题：基于文本差异增强的多模态特征融合网络在遥感变化检测中的应用

尽管深度学习已推动遥感变化检测（RSCD）的进步，但大多数方法仅依赖图像模态，限制了特征表示、变化模式建模和泛化能力，尤其是在光照和噪声干扰下。为解决这一问题，我们提出了一种名为MMChange的多模态RSCD方法，结合图像和文本模态以提高准确性和鲁棒性。引入了图像特征精炼（IFR）模块以突出关键区域并抑制环境噪声。为克服图像特征的语义限制，我们采用视觉语言模型（VLM）生成双时相图像的语义描述。随后，文本差异增强（TDE）模块捕捉细微的语义变化，引导模型关注有意义的变化。为弥合模态之间的异质性，我们设计了图像文本特征融合（ITFF）模块，实现深层次的跨模态整合。在LEVIRCD、WHUCD和SYSUCD上的广泛实验表明，MMChange在多个指标上均超越了现有方法，验证了其在多模态RSCD中的有效性。代码可在：https://github.com/yikuizhai/MMChange 获取。

Summary / 总结

The paper proposes MMChange, a multimodal RSCD method combining image and text modalities to enhance accuracy and robustness. It introduces an Image Feature Refinement (IFR) module to highlight key regions and suppress noise, a Textual Difference Enhancement (TDE) module to capture semantic shifts, and an Image Text Feature Fusion (ITFF) module to integrate modalities. Experiments show MMChange outperforms existing methods on LEVIRCD, WHUCD, and SYSUCD datasets, validating its effectiveness for multimodal RSCD.

研究旨在通过结合图像和文本模态来提高遥感变化检测的准确性和鲁棒性。提出的MMChange方法包括图像特征精炼模块以突出关键区域、文本差异增强模块以捕捉语义变化，以及图像文本特征融合模块以实现模态间的深度跨模态集成。在LEVIRCD、WHUCD和SYSUCD上的实验表明，MMChange在多个指标上优于现有方法，验证了其在多模态遥感变化检测中的有效性。

ANTS: Shaping the Adaptive Negative Textual Space by MLLM for OOD Detection

Authors: Zhu Wenjie, Zhang Yabin, Xin Jin, Wenjun Zeng, Lei Zhang

First: 2025-09-04T07:26:20+00:00 · Latest: 2025-09-04T07:26:20+00:00

Abs · PDF

Abstract

The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. In addition, the presence of false negative labels significantly degrades their near-OOD performance. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we identify images likely to be OOD samples as negative images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we first identify the subset of ID classes that are visually similar to negative images and then leverage the reasoning capability of MLLMs to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD) without relying on task-specific prior knowledge, making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 4.2\%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.

中文标题/摘要

标题：ANTS: 通过MLLM塑造适应性负文本空间以进行OOD检测

引入负标签（NLs）已被证明能有效提升Out-of-Distribution (OOD)检测。然而，现有方法往往缺乏对OOD图像的理解，难以构建准确的负空间。此外，假负标签的存在显著降低了其近OOD性能。为解决这些问题，我们提出利用多模态大型语言模型（MLLM）的理解和推理能力，塑造适应性负文本空间（ANTS）。具体而言，我们识别出可能为OOD样本的图像作为负图像，并促使MLLM描述这些图像，生成能够精确刻画OOD分布并增强远OOD检测的表达性负句子。对于近OOD设置，其中OOD样本与分布内（ID）子集相似，我们首先识别出与负图像视觉相似的ID类子集，然后利用MLLM的推理能力生成针对该子集的视觉相似负标签，有效减少假负标签并提高近OOD检测。为了平衡这两种类型的负文本空间，我们设计了一个自适应加权分数，使方法能够在无需依赖特定任务先验知识的情况下处理不同的OOD任务设置（近OOD和远OOD），使其在开放环境中具有高度适应性。在ImageNet基准测试中，我们的ANTS显著降低了FPR95，建立了新的最佳水平。此外，我们的方法无需训练且零样本，具有高可扩展性。

Summary / 总结

The paper introduces ANTS, which uses MLLMs to generate adaptive negative textual spaces for OOD detection. It addresses the limitations of existing methods by leveraging MLLMs to create precise negative descriptions for far-OOD samples and visually similar negative labels for near-OOD samples. The method reduces FPR95 by 4.2% on ImageNet, setting a new state-of-the-art and being training-free and zero-shot, thus highly scalable.

该研究提出ANT斯方法，利用MLLM生成精确的负样本描述以增强OOD检测。ANT斯识别OOD样本并促使MLLM描述它们，创建精准的负标签用于远OOD检测。对于近OOD场景，它生成视觉上相似的负标签以减少误检。ANT斯通过自适应加权评分平衡这两种负样本空间，使其适用于不同的OOD任务。在ImageNet基准测试中，ANT斯将FPR95降低了4.2%，达到新的最佳水平。

Defending LVLMs Against Vision Attacks through Partial-Perception Supervision

Authors: Qi Zhou, Tianlin Li, Qing Guo, Dongxia Wang, Yun Lin, Yang Liu, Jin Song Dong

Venue: ICML 2025

First: 2024-12-17T09:38:58+00:00 · Latest: 2025-09-04T06:43:22+00:00

Comments: Accepted to ICML 2025

Abs · PDF

Abstract

Recent studies have raised significant concerns regarding the vulnerability of Large Vision Language Models (LVLMs) to maliciously injected or perturbed input images, which can mislead their responses. Existing defense methods show that such vision attacks are sensitive to image modifications especially cropping, using majority voting across responses of modified images as corrected responses. However, these modifications often result in partial images and distort the semantics, which reduces response quality on clean images after voting. Instead of directly using responses from partial images for voting, we investigate using them to supervise the LVLM's responses to the original images. We propose a black-box, training-free method called DPS (Defense through Partial-Perception Supervision). In this approach, the model is prompted using the responses generated by a model that perceives only a partial image. With DPS, the model can adjust its response based on partial image understanding when under attack, while confidently maintaining its original response for clean input. Our findings show that the weak model can supervise the strong model: when faced with an attacked input, the strong model becomes less confident and adjusts its response based on the weak model's partial understanding, effectively defending against the attack. With clean input, it confidently maintains its original response. Empirical experiments show our method outperforms the baseline, cutting the average attack success rate by 76.3% across six datasets on three popular models.

中文标题/摘要

标题：通过部分感知监督防御LVLM的视觉攻击

近期研究对大型视觉语言模型（LVLMs）在恶意注入或扰动输入图像时的脆弱性提出了严重关切，这些攻击可以误导模型的响应。现有防御方法表明，此类视觉攻击对图像修改特别敏感，尤其是裁剪，通过跨修改图像响应的多数投票作为正确响应。然而，这些修改通常会导致部分图像，从而扭曲语义，这在投票后降低了干净图像的响应质量。我们不直接使用部分图像的响应进行投票，而是研究使用它们来监督LVLM对原始图像的响应。我们提出了一种无需训练的黑盒方法，称为DPS（通过部分感知监督防御）。在此方法中，模型使用仅感知部分图像的模型生成的响应进行提示。使用DPS，模型在受到攻击时可以根据部分图像的理解调整其响应，同时自信地保持其原始响应以应对干净输入。我们的研究发现，弱模型可以监督强模型：面对攻击输入时，强模型变得不那么自信，并根据弱模型的部分理解调整其响应，从而有效防御攻击。在干净输入时，它自信地保持其原始响应。实验证明，我们的方法优于基线，六个数据集上三个流行模型的平均攻击成功率降低了76.3%。

Summary / 总结

The paper addresses the vulnerability of Large Vision Language Models (LVLMs) to vision attacks by proposing a defense method called DPS (Defense through Partial-Perception Supervision). DPS uses responses from a model that perceives only a partial image to supervise the LVLM's responses to the original image, allowing the model to adjust its response when under attack while maintaining its original response for clean inputs. The method significantly reduces the average attack success rate by 76.3% across six datasets on three popular models.

本文提出了一种名为DPS（通过部分感知监督防御）的方法，用于防御大型视觉语言模型（LVLM）对视觉攻击的脆弱性。DPS 使用来自部分图像的响应来监督 LVLM 对原始图像的响应，使模型在受到攻击时能够根据部分图像的理解调整其响应，同时在干净输入时保持其原始响应。实验结果显示，DPS 在六个数据集上将三种流行模型的平均攻击成功率降低了 76.3%，优于基线方法。

Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model

Authors: Phuoc-Nguyen Bui, Khanh-Binh Nguyen, Hyunseung Choo

Venue: ICCV 2025

First: 2025-09-04T05:42:02+00:00 · Latest: 2025-09-04T05:42:02+00:00

Comments: ICCV 2025 - LIMIT Workshop

Abs · PDF

Abstract

Contrastive vision-language models excel in zero-shot image recognition but face challenges in few-shot scenarios due to computationally intensive offline fine-tuning using prompt learning, which risks overfitting. To overcome these limitations, we propose Attn-Adapter, a novel online few-shot learning framework that enhances CLIP's adaptability via a dual attention mechanism. Our design incorporates dataset-specific information through two components: the Memory Attn-Adapter, which refines category embeddings using support examples, and the Local-Global Attn-Adapter, which enriches image embeddings by integrating local and global features. This architecture enables dynamic adaptation from a few labeled samples without retraining the base model. Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization, maintaining efficient inference and scaling across CLIP backbones.

中文标题/摘要

标题：Attn-Adapter：无需离线微调的视觉-语言模型在线少样本学习者

对比视觉-语言模型在零样本图像识别中表现出色，但在少样本场景中由于使用提示学习进行计算密集型离线微调而面临过拟合风险。为克服这些限制，我们提出了一种名为Attn-Adapter的新颖在线少样本学习框架，通过双重注意力机制增强CLIP的适应性。我们的设计通过两个组件整合了数据集特定的信息：Memory Attn-Adapter，通过支持样本细化类别嵌入；Local-Global Attn-Adapter，通过整合局部和全局特征丰富图像嵌入。该架构能够在少量标记样本上实现动态适应，而无需重新训练基础模型。Attn-Adapter在跨类别和跨数据集泛化方面优于现有方法，保持高效的推理并适用于各种CLIP基础模型。

Summary / 总结

The research aims to improve the few-shot learning capability of vision-language models by addressing the computational challenges of offline fine-tuning. Attn-Adapter, a novel framework, enhances CLIP's adaptability through a dual attention mechanism, incorporating dataset-specific information via Memory Attn-Adapter and Local-Global Attn-Adapter. The method enables dynamic adaptation from a few labeled samples without retraining the base model, outperforming state-of-the-art methods in cross-category and cross-dataset generalization while maintaining efficient inference and scalability across CLIP backbones.

研究旨在通过提出Attn-Adapter，一种新颖的在线少样本学习框架，解决对比视觉-语言模型在少样本场景中的局限性。该框架通过双重注意力机制增强CLIP的适应性，通过Memory Attn-Adapter和Local-Global Attn-Adapter整合数据集特定信息。关键实验发现是，Attn-Adapter在跨类别和跨数据集泛化方面优于现有最佳方法，同时保持高效的推理和在CLIP基础模型上的可扩展性。

Weakly-Supervised Learning of Dense Functional Correspondences

Authors: Stefan Stojanov, Linan Zhao, Yunzhi Zhang, Daniel L. K. Yamins, Jiajun Wu

Venue: ICCV 2025

First: 2025-09-04T05:39:16+00:00 · Latest: 2025-09-04T05:39:16+00:00

Comments: Accepted at ICCV 2025. Project website: https://dense-functional-correspondence.github.io/

Abs · PDF · Project1

Abstract

Establishing dense correspondences across image pairs is essential for tasks such as shape reconstruction and robot manipulation. In the challenging setting of matching across different categories, the function of an object, i.e., the effect that an object can cause on other objects, can guide how correspondences should be established. This is because object parts that enable specific functions often share similarities in shape and appearance. We derive the definition of dense functional correspondence based on this observation and propose a weakly-supervised learning paradigm to tackle the prediction task. The main insight behind our approach is that we can leverage vision-language models to pseudo-label multi-view images to obtain functional parts. We then integrate this with dense contrastive learning from pixel correspondences to distill both functional and spatial knowledge into a new model that can establish dense functional correspondence. Further, we curate synthetic and real evaluation datasets as task benchmarks. Our results demonstrate the advantages of our approach over baseline solutions consisting of off-the-shelf self-supervised image representations and grounded vision language models.

中文标题/摘要

标题：弱监督学习密集功能对应

在形状重建和机器人操作等任务中，跨图像对建立密集对应关系是必不可少的。在不同类别之间的配对挑战中，对象的功能，即对象对其他对象可能产生的效果，可以指导对应关系的建立。因为能够执行特定功能的对象部分在形状和外观上往往具有相似性。我们基于这一观察推导出密集功能对应的概念，并提出了一种弱监督学习范式来解决预测任务。我们方法的核心洞察是，可以利用视觉语言模型为多视角图像伪标签，以获得功能部分。然后，我们将这种技术与基于像素对应关系的密集对比学习相结合，将功能和空间知识提炼到一个新模型中，以建立密集功能对应。此外，我们还整理了合成和真实评估数据集作为任务基准。我们的结果表明，与基于现成的自我监督图像表示和基于视觉语言模型的基线解决方案相比，我们方法的优势。

Summary / 总结

The paper addresses the challenge of establishing dense functional correspondences across different object categories by leveraging the function of objects. It proposes a weakly-supervised learning paradigm that uses vision-language models to pseudo-label multi-view images and integrates this with dense contrastive learning to distill both functional and spatial knowledge. The approach outperforms baseline solutions on synthetic and real datasets, demonstrating its effectiveness in predicting dense functional correspondences.

论文通过利用物体的功能来解决不同类别物体间密集功能对应关系的建立挑战。提出了一种弱监督学习范式，使用视觉-语言模型对多视角图像进行伪标签，并将其与基于像素对应关系的密集对比学习相结合，以提取功能和空间知识。该方法在合成和真实数据集上的表现优于基线解决方案，证明了其在预测密集功能对应关系方面的有效性。

Expedition & Expansion: Leveraging Semantic Representations for Goal-Directed Exploration in Continuous Cellular Automata

Authors: Sina Khajehabdollahi, Gautier Hamon, Marko Cvjetko, Pierre-Yves Oudeyer, Clément Moulin-Frier, Cédric Colas

First: 2025-09-04T03:44:44+00:00 · Latest: 2025-09-04T03:44:44+00:00

Abs · PDF

Abstract

Discovering diverse visual patterns in continuous cellular automata (CA) is challenging due to the vastness and redundancy of high-dimensional behavioral spaces. Traditional exploration methods like Novelty Search (NS) expand locally by mutating known novel solutions but often plateau when local novelty is exhausted, failing to reach distant, unexplored regions. We introduce Expedition and Expansion (E&E), a hybrid strategy where exploration alternates between local novelty-driven expansions and goal-directed expeditions. During expeditions, E&E leverages a Vision-Language Model (VLM) to generate linguistic goals--descriptions of interesting but hypothetical patterns that drive exploration toward uncharted regions. By operating in semantic spaces that align with human perception, E&E both evaluates novelty and generates goals in conceptually meaningful ways, enhancing the interpretability and relevance of discovered behaviors. Tested on Flow Lenia, a continuous CA known for its rich, emergent behaviors, E&E consistently uncovers more diverse solutions than existing exploration methods. A genealogical analysis further reveals that solutions originating from expeditions disproportionately influence long-term exploration, unlocking new behavioral niches that serve as stepping stones for subsequent search. These findings highlight E&E's capacity to break through local novelty boundaries and explore behavioral landscapes in human-aligned, interpretable ways, offering a promising template for open-ended exploration in artificial life and beyond.

中文标题/摘要

标题：探险与扩展：利用语义表示在连续细胞自动机中进行目标导向探索

在连续细胞自动机（CA）中发现多样的视觉模式具有挑战性，因为高维行为空间既庞大又冗余。传统探索方法如新颖性搜索（NS）通过突变已知的新颖解进行局部扩展，但在局部新颖性耗尽时往往会停滞，无法到达遥远的未探索区域。我们引入了探险与扩展（E&E）这一混合策略，其中探索交替进行局部新颖性驱动的扩展和目标导向的探险。在探险期间，E&E 利用视觉语言模型（VLM）生成语言目标——对有趣但假设的模式的描述，从而驱动探索向未开发区域前进。通过在与人类感知相一致的语义空间中操作，E&E 既评估新颖性又以概念上有意义的方式生成目标，从而增强发现行为的可解释性和相关性。在Flow Lenia上进行测试，这是一种以丰富、涌现行为著称的连续CA，E&E 一致地发现了比现有探索方法更多的多样化解决方案。进一步的谱系分析表明，源自探险的解决方案在长期探索中占主导地位，解锁了作为后续搜索踏脚石的新行为生态位。这些发现突显了E&E在打破局部新颖性边界、以与人类对齐的方式探索行为景观方面的潜力，为人工生命中的开放探索提供了有希望的模板，超越了人工生命领域。

Measuring How (Not Just Whether) VLMs Build Common Ground

Authors: Saki Imai, Mert İnan, Anthony Sicilia, Malihe Alikhani

First: 2025-09-04T01:43:49+00:00 · Latest: 2025-09-04T01:43:49+00:00

Abs · PDF

Abstract

Large vision language models (VLMs) increasingly claim reasoning skills, yet current benchmarks evaluate them in single-turn or question answering settings. However, grounding is an interactive process in which people gradually develop shared understanding through ongoing communication. We introduce a four-metric suite (grounding efficiency, content alignment, lexical adaptation, and human-likeness) to systematically evaluate VLM performance in interactive grounding contexts. We deploy the suite on 150 self-play sessions of interactive referential games between three proprietary VLMs and compare them with human dyads. All three models diverge from human patterns on at least three metrics, while GPT4o-mini is the closest overall. We find that (i) task success scores do not indicate successful grounding and (ii) high image-utterance alignment does not necessarily predict task success. Our metric suite and findings offer a framework for future research on VLM grounding.

中文标题/摘要

标题：测量VLMs如何（而不仅仅是是否）建立共同基础

大型视觉语言模型（VLMs）越来越多地声称具备推理能力，但当前基准测试仅在单轮或问答设置中评估它们。然而，接地是一个互动过程，在此过程中，人们通过持续沟通逐渐发展共同理解。我们引入了一套四指标体系（接地效率、内容对齐、词汇适应性和类人度），系统评估VLM在互动接地环境中的表现。我们在150场自我对弈的互动指称游戏中部署了这套体系，并将它们与人类双人组进行比较。所有三个模型在至少三个指标上偏离了人类模式，而GPT4o-mini总体上最接近人类。我们发现，(i) 任务成功率分数并不能表明成功的接地，(ii) 高图像-语句对齐并不一定预测任务成功。我们的指标体系和研究结果为未来VLM接地研究提供了一个框架。

Summary / 总结

The research aims to evaluate how VLMs develop shared understanding in interactive contexts, beyond just their ability to answer questions. The study uses a four-metric suite to assess VLMs in interactive referential games, comparing them to human dyads. Key findings show that VLMs perform differently from humans on multiple metrics, with GPT4o-mini being the closest. The study also reveals that task success scores and image-utterance alignment do not always correlate with successful grounding.

研究旨在评估VLMs在互动情境中如何逐步建立共同理解，而不仅仅是回答问题的能力。研究使用四套指标来评估VLMs在互动参照游戏中表现，并将其与人类双人组进行比较。主要发现表明，VLMs在多个指标上与人类表现不同，GPT4o-mini表现最接近人类。研究还发现，任务成功率和图像-语句对齐并不总是与成功的共同理解相关。

Causality-guided Prompt Learning for Vision-language Models via Visual Granulation

Authors: Mengyu Gao, Qiulei Dong

Venue: ICCV 2025

First: 2025-09-04T01:40:41+00:00 · Latest: 2025-09-04T01:40:41+00:00

Comments: ICCV 2025 Accepted

Abs · PDF

Abstract

Prompt learning has recently attracted much attention for adapting pre-trained vision-language models (e.g., CLIP) to downstream recognition tasks. However, most of the existing CLIP-based prompt learning methods only show a limited ability for handling fine-grained datasets. To address this issue, we propose a causality-guided text prompt learning method via visual granulation for CLIP, called CaPL, where the explored visual granulation technique could construct sets of visual granules for the text prompt to capture subtle discrepancies among different fine-grained classes through casual inference. The CaPL method contains the following two modules: (1) An attribute disentanglement module is proposed to decompose visual features into non-individualized attributes (shared by some classes) and individualized attributes (specific to single classes) using a Brownian Bridge Diffusion Model; (2) A granule learning module is proposed to construct visual granules by integrating the aforementioned attributes for recognition under two causal inference strategies. Thanks to the learned visual granules, more discriminative text prompt is expected to be learned. Extensive experimental results on 15 datasets demonstrate that our CaPL method significantly outperforms the state-of-the-art prompt learning methods, especially on fine-grained datasets.

中文标题/摘要

标题：基于视觉粒化引导因果性的提示学习方法在视觉语言模型中的应用

提示学习最近引起了对适应预训练视觉语言模型（例如CLIP）到下游识别任务的广泛关注。然而，现有的大多数基于CLIP的提示学习方法在处理细粒度数据集时能力有限。为了解决这一问题，我们提出了一种基于视觉粒化的因果性引导文本提示学习方法，称为CaPL，其中探索的视觉粒化技术可以为文本提示构建视觉粒子集，通过因果推理捕捉不同细粒度类之间的细微差异。CaPL方法包含以下两个模块：（1）提出了一种属性解耦模块，使用布朗桥扩散模型将视觉特征分解为非个体化属性（由某些类共享）和个体化属性（仅对单一类特定）；（2）提出了一种粒子学习模块，通过结合上述属性在两种因果推理策略下构建视觉粒子进行识别。由于学习到的视觉粒子，期望能够学习到更具区分性的文本提示。在15个数据集上的广泛实验结果表明，我们的CaPL方法显著优于最先进的提示学习方法，特别是在细粒度数据集上。

Summary / 总结

The research aims to enhance the capability of vision-language models, particularly CLIP, in handling fine-grained datasets through a causality-guided text prompt learning method called CaPL. CaPL uses visual granulation to decompose visual features into non-individualized and individualized attributes, and constructs visual granules for better causal inference. Experimental results on 15 datasets show that CaPL significantly outperforms existing methods, especially on fine-grained datasets.

研究旨在通过因果引导的文本提示学习方法CaPL来增强如CLIP等视觉-语言模型在细粒度识别任务中的能力。CaPL使用视觉粒化技术构建视觉粒集，通过因果推理捕捉不同细粒度类别的细微差异。该方法包括一个属性解耦模块来分解视觉特征，以及一个粒学习模块来整合这些属性进行识别。实验结果表明，CaPL在15个数据集上的表现优于现有提示学习方法，特别是在细粒度数据集上表现更佳。

MedVista3D: Vision-Language Modeling for Reducing Diagnostic Errors in 3D CT Disease Detection, Understanding and Reporting

Authors: Yuheng Li, Yenho Chen, Yuxiang Lai, Jike Zhong, Vanessa Wildman, Xiaofeng Yang

First: 2025-09-04T01:28:44+00:00 · Latest: 2025-09-04T01:28:44+00:00

Abs · PDF

Abstract

Radiologic diagnostic errors-under-reading errors, inattentional blindness, and communication failures-remain prevalent in clinical practice. These issues often stem from missed localized abnormalities, limited global context, and variability in report language. These challenges are amplified in 3D imaging, where clinicians must examine hundreds of slices per scan. Addressing them requires systems with precise localized detection, global volume-level reasoning, and semantically consistent natural language reporting. However, existing 3D vision-language models are unable to meet all three needs jointly, lacking local-global understanding for spatial reasoning and struggling with the variability and noise of uncurated radiology reports. We present MedVista3D, a multi-scale semantic-enriched vision-language pretraining framework for 3D CT analysis. To enable joint disease detection and holistic interpretation, MedVista3D performs local and global image-text alignment for fine-grained representation learning within full-volume context. To address report variability, we apply language model rewrites and introduce a Radiology Semantic Matching Bank for semantics-aware alignment. MedVista3D achieves state-of-the-art performance on zero-shot disease classification, report retrieval, and medical visual question answering, while transferring well to organ segmentation and prognosis prediction. Code and datasets will be released.

中文标题/摘要

标题：MedVista3D：用于减少3D CT疾病检测、理解和报告中的诊断错误的视觉-语言建模

放射学诊断错误，如漏诊、注意盲点和沟通失败，在临床实践中仍然普遍存在。这些问题通常源于局部异常的遗漏、全球上下文的限制以及报告语言的差异性。这些问题在3D成像中被放大，因为临床医生必须检查每幅扫描中的数百个切片。解决这些问题需要具备精确局部检测、全局体素级推理和语义一致自然语言报告的系统。然而，现有的3D视觉-语言模型无法同时满足这三个需求，缺乏空间推理所需的局部-全局理解，并且难以处理未经整理的放射学报告的差异性和噪声。我们提出了MedVista3D，一种用于3D CT分析的多尺度语义增强视觉-语言预训练框架。为了实现联合疾病检测和整体解释，MedVista3D 在全体积上下文中进行局部和全局图像-文本对齐，以实现细粒度的表示学习。为了解决报告的差异性，我们应用了语言模型重写，并引入了放射学语义匹配库，以实现语义感知的对齐。MedVista3D 在零样本疾病分类、报告检索和医学视觉问答方面达到了最先进的性能，同时在器官分割和预后预测方面表现出良好的迁移能力。代码和数据集将被发布。

Summary / 总结

The research aims to reduce diagnostic errors in 3D CT disease detection, understanding, and reporting by addressing issues such as under-reading, inattentional blindness, and communication failures. MedVista3D, a multi-scale semantic-enriched vision-language pretraining framework, is developed to perform local and global image-text alignment for fine-grained representation learning and to handle report variability through language model rewrites and a Radiology Semantic Matching Bank. The model achieves state-of-the-art performance in zero-shot disease classification, report retrieval, and medical visual question answering, and also transfers well to organ segmentation and prognosis prediction.

MedVista3D旨在通过解决漏诊和沟通失误等问题，减少3D CT疾病检测和报告中的诊断错误。它采用多尺度语义增强的视觉-语言预训练框架，进行局部和全局图像-文本对齐，并引入放射学语义匹配库以处理报告的差异性。该模型在零样本疾病分类、报告检索和医学视觉问答方面达到了最先进的性能，并且也适用于器官分割和预后预测任务。

STA-Net: A Decoupled Shape and Texture Attention Network for Lightweight Plant Disease Classification

Authors: Zongsen Qiu

First: 2025-09-03T22:46:20+00:00 · Latest: 2025-09-03T22:46:20+00:00

Abs · PDF · Code1

Abstract

Responding to rising global food security needs, precision agriculture and deep learning-based plant disease diagnosis have become crucial. Yet, deploying high-precision models on edge devices is challenging. Most lightweight networks use attention mechanisms designed for generic object recognition, which poorly capture subtle pathological features like irregular lesion shapes and complex textures. To overcome this, we propose a twofold solution: first, using a training-free neural architecture search method (DeepMAD) to create an efficient network backbone for edge devices; second, introducing the Shape-Texture Attention Module (STAM). STAM splits attention into two branches -- one using deformable convolutions (DCNv4) for shape awareness and the other using a Gabor filter bank for texture awareness. On the public CCMT plant disease dataset, our STA-Net model (with 401K parameters and 51.1M FLOPs) reached 89.00% accuracy and an F1 score of 88.96%. Ablation studies confirm STAM significantly improves performance over baseline and standard attention models. Integrating domain knowledge via decoupled attention thus presents a promising path for edge-deployed precision agriculture AI. The source code is available at https://github.com/RzMY/STA-Net.

中文标题/摘要

标题：STA-Net：一种用于轻量级植物病害分类的解耦形状和纹理注意力网络

响应全球粮食安全需求的上升，精准农业和基于深度学习的植物病害诊断变得至关重要。然而，在边缘设备上部署高精度模型具有挑战性。大多数轻量级网络使用设计用于通用对象识别的注意力机制，这不能很好地捕捉到如不规则病斑形状和复杂纹理等细微的病理特征。为了解决这一问题，我们提出了一种两步解决方案：首先，使用无训练神经架构搜索方法（DeepMAD）为边缘设备创建一个高效的网络骨干；其次，引入形状-纹理注意力模块（STAM）。STAM将注意力机制分为两个分支——一个使用可变形卷积（DCNv4）进行形状感知，另一个使用Gabor滤波器组进行纹理感知。在公共的CCMT植物病害数据集上，我们的STA-Net模型（参数量40.1万，FLOPs 51.1百万）达到了89.00%的准确率和88.96%的F1分数。消融研究证实，STAM在基线和标准注意力模型上显著提高了性能。通过解耦注意力机制整合领域知识，为边缘部署的精准农业AI提供了一条有前景的道路。源代码可在https://github.com/RzMY/STA-Net获取。

Summary / 总结

The research aims to address the challenge of deploying high-precision plant disease classification models on edge devices in precision agriculture. The authors propose STA-Net, which uses a training-free neural architecture search method and introduces the Shape-Texture Attention Module (STAM) to better capture subtle pathological features. On the CCMT dataset, STA-Net achieved 89.00% accuracy and an F1 score of 88.96%, demonstrating significant performance improvements over baseline models.

研究旨在解决在精准农业中将高精度植物疾病分类模型部署到边缘设备的挑战。作者提出了STA-Net，该模型使用无训练神经架构搜索方法，并引入了形状-纹理注意力模块（STAM）以更好地捕捉细微的病理特征。在CCMT数据集上，STA-Net实现了89.00%的准确率和88.96%的F1分数，显著优于基线模型。

Singular Value Few-shot Adaptation of Vision-Language Models

Authors: Taha Koleilat, Hassan Rivaz, Yiming Xiao

First: 2025-09-03T22:00:23+00:00 · Latest: 2025-09-03T22:00:23+00:00

Comments: 10 pages, 2 figures, 8 tables

Abs · PDF · Code1

Abstract

Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present \textbf{CLIP-SVD}, a novel \textit{multi-modal} and \textit{parameter-efficient} adaptation technique that leverages Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only \textbf{0.04\%} of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. The code is publicly available at https://github.com/HealthX-Lab/CLIP-SVD.

中文标题/摘要

标题：CLIP的奇异值少样本适应技术

视觉-语言模型（VLMs）如CLIP在多种应用中展示了令人印象深刻的零样本和少样本学习能力。然而，将这些模型适应到新的细粒度领域仍然困难重重，因为这依赖于提示工程和全模型微调的高成本。现有的适应方法依赖于增强组件，如提示标记和适配模块，这些组件可能会限制适应质量，使模型不稳定，并损害其在预训练期间学到的丰富知识。在本文中，我们提出了**CLIP-SVD**，这是一种新颖的**多模态**和**参数高效**的适应技术，利用奇异值分解（SVD）修改CLIP的内部参数空间，而不注入额外模块。具体来说，我们仅微调CLIP参数矩阵的奇异值以重新缩放基向量进行领域适应，同时保留预训练模型。此设计仅使用模型总参数的**0.04%**便能实现增强的适应性能，并更好地保持其泛化能力。CLIP-SVD在11个自然和10个生物医学数据集上实现了最先进的分类结果，在少样本设置中在准确性和泛化方面均优于先前的方法。此外，我们利用基于自然语言的方法分析CLIP适应的有效性和动态，以实现CLIP-SVD的可解释性。代码可在https://github.com/HealthX-Lab/CLIP-SVD上公开获取。

Summary / 总结

This work addresses the challenge of adapting vision-language models like CLIP to new domains with limited fine-tuning. It introduces CLIP-SVD, a parameter-efficient adaptation technique that uses Singular Value Decomposition to modify only the singular values of CLIP's parameter matrices, thereby achieving state-of-the-art results on 11 natural and 10 biomedical datasets with just 0.04% of the model's parameters. This method enhances adaptation performance while preserving the model's generalization ability.

CLIP-SVD 是一种新颖的参数高效适应技术，用于像 CLIP 这样的视觉-语言模型，它利用奇异值分解来修改内部参数空间而不添加额外模块。它仅微调模型参数的 0.04%，在 11 个自然和 10 个生物医学数据集上取得了最先进的分类结果，优于先前方法的准确性和泛化能力。此外，通过自然语言方法分析适应效果和动态，提供可解释性。

Short-Form Video Recommendations with Multimodal Embeddings: Addressing Cold-Start and Bias Challenges

Authors: Andrii Dzhoha, Katya Mirylenka, Egor Malykh, Marco-Andrea Buchmann, Francesca Catino

First: 2025-07-25T14:57:04+00:00 · Latest: 2025-09-03T20:09:30+00:00

Abs · PDF

Abstract

In recent years, social media users have spent significant amounts of time on short-form video platforms. As a result, established platforms in other domains, such as e-commerce, have begun introducing short-form video content to engage users and increase their time spent on the platform. The success of these experiences is due not only to the content itself but also to a unique UI innovation: instead of offering users a list of choices to click, platforms actively recommend content for users to watch one at a time. This creates new challenges for recommender systems, especially when launching a new video experience. Beyond the limited interaction data, immersive feed experiences introduce stronger position bias due to the UI and duration bias when optimizing for watch-time, as models tend to favor shorter videos. These issues, together with the feedback loop inherent in recommender systems, make it difficult to build effective solutions. In this paper, we highlight the challenges faced when introducing a new short-form video experience and present our experience showing that, even with sufficient video interaction data, it can be more beneficial to leverage a video retrieval system using a fine-tuned multimodal vision-language model to overcome these challenges. This approach demonstrated greater effectiveness compared to conventional supervised learning methods in online experiments conducted on our e-commerce platform.

中文标题/摘要

标题：基于多模态嵌入的短视频推荐：解决冷启动和偏差挑战

近年来，社交媒体用户在短视频平台上花费了大量时间。因此，其他领域的已建立平台，如电子商务，已经开始引入短视频内容以吸引用户并增加他们在平台上的停留时间。这些体验的成功不仅归功于内容本身，还归功于一种独特的UI创新：平台不再为用户提供可供点击的选择列表，而是积极推荐用户逐个观看的内容。这为推荐系统带来了新的挑战，尤其是在推出新的视频体验时。除了有限的交互数据外，沉浸式流体验由于UI和优化观看时间时的时长偏差，引入了更强的位置偏差。这些问题，加上推荐系统固有的反馈循环，使得构建有效的解决方案变得困难。在本文中，我们强调了引入新的短视频体验时面临的挑战，并展示了即使有足够的视频交互数据，利用微调的多模态视觉-语言模型的视频检索系统也可以更有效地克服这些挑战。这种方法在我们电子商务平台进行的在线实验中，比传统的监督学习方法更有效。

Summary / 总结

This paper addresses the challenges of recommending short-form videos, particularly in the context of e-commerce platforms. It highlights issues such as cold-start and bias in user interactions and watch-time optimization. The authors propose using a fine-tuned multimodal vision-language model for video retrieval, which shows better performance than traditional supervised learning methods in online experiments on their platform.

本文探讨了推荐短格式视频所面临的挑战，特别是在电子商务平台上的情境。作者提出使用微调的多模态视觉-语言模型来克服冷启动和偏差问题。实验结果表明，与传统的监督学习方法相比，这种方法在提高用户参与度和观看时间方面更为有效。

E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition

Authors: Aryan Gupta, Anupam Purwar

First: 2025-09-03T18:08:41+00:00 · Latest: 2025-09-03T18:08:41+00:00

Comments: Sprinklr OCR provides a fast and compute light way of performing OCR

Abs · PDF

Abstract

Optical Character Recognition (OCR) in multilingual, noisy, and diverse real-world images remains a significant challenge for optical character recognition systems. With the rise of Large Vision-Language Models (LVLMs), there is growing interest in their ability to generalize and reason beyond fixed OCR pipelines. In this work, we introduce Sprinklr-Edge-OCR, a novel OCR system built specifically optimized for edge deployment in resource-constrained environments. We present a large-scale comparative evaluation of five state-of-the-art LVLMs (InternVL, Qwen, GOT OCR, LLaMA, MiniCPM) and two traditional OCR systems (Sprinklr-Edge-OCR, SuryaOCR) on a proprietary, doubly hand annotated dataset of multilingual (54 languages) images. Our benchmark covers a broad range of metrics including accuracy, semantic consistency, language coverage, computational efficiency (latency, memory, GPU usage), and deployment cost. To better reflect real-world applicability, we also conducted edge case deployment analysis, evaluating model performance on CPU only environments. Among the results, Qwen achieved the highest precision (0.54), while Sprinklr-Edge-OCR delivered the best overall F1 score (0.46) and outperformed others in efficiency, processing images 35 faster (0.17 seconds per image on average) and at less than 0.01 of the cost (0.006 USD per 1,000 images) compared to LVLM. Our findings demonstrate that the most optimal OCR systems for edge deployment are the traditional ones even in the era of LLMs due to their low compute requirements, low latency, and very high affordability.

中文标题/摘要

标题：E-ARMOR：边缘案例评估与多语言光学字符识别审查

多语言、嘈杂和多样化的现实世界图像中的光学字符识别（OCR）仍然是光学字符识别系统的一个重大挑战。随着大型视觉-语言模型（LVLM）的兴起，人们越来越关注它们在固定OCR流水线之外的泛化和推理能力。在本文中，我们介绍了Sprinklr-Edge-OCR，这是一种专门针对资源受限环境边缘部署优化的新型OCR系统。我们对五种最先进的LVLM（InternVL、Qwen、GOT OCR、LLaMA、MiniCPM）和两种传统OCR系统（Sprinklr-Edge-OCR、SuryaOCR）在专有的、双人工标注的多语言（54种语言）图像数据集上进行了大规模比较评估。我们的基准测试涵盖了准确性、语义一致性、语言覆盖率、计算效率（延迟、内存、GPU使用情况）和部署成本等广泛指标。为了更好地反映实际应用情况，我们还进行了边缘案例部署分析，评估模型在仅CPU环境下的性能。结果中，Qwen的精确度最高（0.54），而Sprinklr-Edge-OCR的整体F1分数最高（0.46），在效率方面也优于其他系统，处理图像速度比其他系统快35倍（平均每张图像0.17秒），成本也低得多（每1,000张图像0.006美元），比LVLM低99%。我们的研究结果表明，即使在LLM时代，传统的OCR系统仍然是边缘部署的最佳选择，因为它们具有低计算需求、低延迟和极高的性价比。

Summary / 总结

This work addresses the challenge of OCR in multilingual and noisy images by evaluating five state-of-the-art Large Vision-Language Models (LVLMs) and two traditional OCR systems on a proprietary dataset of 54 languages. The evaluation covers accuracy, semantic consistency, language coverage, computational efficiency, and deployment cost. Qwen showed the highest precision, but Sprinklr-Edge-OCR achieved the best overall F1 score, processing images 35 times faster and at a much lower cost. The study suggests that traditional OCR systems are more suitable for edge deployment due to their low compute requirements, low latency, and high affordability.

该研究通过评估五种最先进的大型视觉-语言模型和两种传统OCR系统，在包含54种语言的专有数据集上，考察了准确率、语义一致性、语言覆盖率、计算效率和部署成本。Qwen在精确度上表现最佳，但Sprinklr-Edge-OCR在整体F1分数上表现最好，处理速度提高了35倍，并且成本低得多。研究结果表明，传统OCR系统更适合边缘部署，因为它们具有低计算要求、低延迟和高性价比的特点。

LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence

Authors: Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, Li Mao, Mingchao Hao, Ningbo Dai, Renzhe Xu, Shuyang Li, Tianyang Zhang, Yue He, Yuanrui Wang, Yunjia Zhang, Zijing Xu, Dongzhe Li, Fang Gao, Hao Zou, Jiandong Liu, Jiashuo Liu, Jiawei Xu, Kaijie Cheng, Kehan Li, Linjun Zhou, Qing Li, Shaohua Fan, Xiaoyu Lin, Xinyan Han, Xuanyue Li, Yan Lu, Yuan Xue, Yuanyuan Jiang, Zimu Wang, Zhenlei Wang, Peng Cui

First: 2025-09-03T17:39:08+00:00 · Latest: 2025-09-03T17:39:08+00:00

Comments: 56 pages

Abs · PDF

Abstract

We argue that progress toward general intelligence requires complementary foundation models grounded in language, the physical world, and structured data. This report presents LimiX, the first installment of our large structured-data models (LDMs). LimiX treats structured data as a joint distribution over variables and missingness, thus capable of addressing a wide range of tabular tasks through query-based conditional prediction via a single model. LimiX is pretrained using masked joint-distribution modeling with an episodic, context-conditional objective, where the model predicts for query subsets conditioned on dataset-specific contexts, supporting rapid, training-free adaptation at inference. We evaluate LimiX across 10 large structured-data benchmarks with broad regimes of sample size, feature dimensionality, class number, categorical-to-numerical feature ratio, missingness, and sample-to-feature ratios. With a single model and a unified interface, LimiX consistently surpasses strong baselines including gradient-boosting trees, deep tabular networks, recent tabular foundation models, and automated ensembles, as shown in Figure 1 and Figure 2. The superiority holds across a wide range of tasks, such as classification, regression, missing value imputation, and data generation, often by substantial margins, while avoiding task-specific architectures or bespoke training per task. All LimiX models are publicly accessible under Apache 2.0.

中文标题/摘要

标题：LimiX：释放通用智能的结构化数据建模能力

我们认为通用地智能的进步需要语言、物理世界和结构化数据的互补基础模型。本报告介绍了LimiX，这是我们大型结构化数据模型（LDMs）的第一部分。LimiX 将结构化数据视为变量和缺失值的联合分布，因此能够通过单个模型基于查询的条件预测来解决广泛的表格任务。LimiX 使用掩码联合分布建模进行预训练，目标是基于上下文的事件性目标，其中模型根据数据集特定的上下文条件预测查询子集，支持快速、无需训练的推理适应。我们在10个大型结构化数据基准测试中评估了LimiX，这些基准测试涵盖了样本大小、特征维度、类别数量、分类到数值特征的比例、缺失值以及样本到特征比率的广泛范围。使用单一模型和统一接口，LimiX 一致地超越了包括梯度提升树、深度表格网络、近期的表格基础模型和自动化集成在内的强大基线，如图1和图2所示。这种优越性在诸如分类、回归、缺失值填充和数据生成等广泛任务中普遍存在，通常差距显著，同时避免了特定任务的架构或针对每个任务的定制训练。

Summary / 总结

LimiX is designed to advance general intelligence by integrating structured data modeling with language and physical world foundation models. It treats structured data as a joint distribution, enabling query-based conditional prediction through a single model. LimiX outperforms various strong baselines across 10 large structured-data benchmarks, including classification, regression, missing value imputation, and data generation, with a unified model and interface, demonstrating its versatility and effectiveness without task-specific architectures or bespoke training.

研究旨在通过构建扎根于结构化数据、语言和物理世界的基础模型来推进通用智能。LimiX 是首个大型结构化数据模型，将结构化数据视为变量和缺失值的联合分布，能够通过单个模型进行基于查询的条件预测。在10个基准测试中，LimiX 在分类、回归和缺失值填充等任务上超越了梯度提升树、深度表型网络等强基线模型，使用单一统一的模型和接口。该模型在样本大小和特征维度等各种数据特性上表现出一致的性能。

WildFireCan-MMD: A Multimodal Dataset for Classification of User-Generated Content During Wildfires in Canada

Authors: Braeden Sherritt, Isar Nejadgholi, Efstratios Aivaliotis, Khaled Mslmani, Marzieh Amini

First: 2025-04-17T14:43:56+00:00 · Latest: 2025-09-03T16:22:06+00:00

Abs · PDF

Abstract

Rapid information access is vital during wildfires, yet traditional data sources are slow and costly. Social media offers real-time updates, but extracting relevant insights remains a challenge. In this work, we focus on multimodal wildfire social media data, which, although existing in current datasets, is currently underrepresented in Canadian contexts. We present WildFireCan-MMD, a new multimodal dataset of X posts from recent Canadian wildfires, annotated across twelve key themes. We evaluate zero-shot vision-language models on this dataset and compare their results with those of custom-trained and baseline classifiers. We show that while baseline methods and zero-shot prompting offer quick deployment, custom-trained models outperform them when labelled data is available. Our best-performing custom model reaches 84.48% f-score, outperforming VLMs and baseline classifiers. We also demonstrate how this model can be used to uncover trends during wildfires, through the collection and analysis of a large unlabeled dataset. Our dataset facilitates future research in wildfire response, and our findings highlight the importance of tailored datasets and task-specific training. Importantly, such datasets should be localized, as disaster response requirements vary across regions and contexts.

中文标题/摘要

标题：WildFireCan-MMD：加拿大野火期间用户生成内容多模态分类数据集

在野火期间快速获取信息至关重要，但传统数据源速度慢且成本高。社交媒体可以提供实时更新，但提取相关见解仍是一项挑战。在本研究中，我们关注多模态野火社交媒体数据，尽管这些数据目前存在于现有数据集中，但在加拿大语境下却相对不足。我们介绍了WildFireCan-MMD，这是一个包含来自加拿大近期野火的X条多模态帖子的新数据集，并在十二个关键主题上进行了标注。我们评估了零样本视觉-语言模型在该数据集上的表现，并将其结果与自定义训练和基线分类器进行了比较。我们表明，虽然基线方法和零样本提示可以快速部署，但在有标注数据时，自定义训练模型的表现更优。我们最好的自定义模型达到了84.48%的F分数，优于视觉语言模型和基线分类器。我们还展示了如何使用该模型来揭示野火期间的趋势，通过收集和分析大量未标注的数据集。我们的数据集促进了未来在野火响应方面的研究，我们的发现强调了定制数据集和任务特定训练的重要性。重要的是，这样的数据集应该本地化，因为灾害响应需求在不同地区和背景下有所不同。

Summary / 总结

This study addresses the need for rapid information access during wildfires by leveraging social media data. The researchers developed WildFireCan-MMD, a multimodal dataset of user-generated content from recent Canadian wildfires, annotated across twelve themes. They evaluated zero-shot vision-language models and custom-trained classifiers, finding that custom-trained models outperformed zero-shot methods and baseline classifiers, achieving an 84.48% f-score. The study also demonstrates the utility of this dataset in uncovering trends during wildfires through the analysis of large unlabeled datasets.

本研究旨在利用社交媒体数据快速获取野火信息，开发了WildFireCan-MMD，这是一个包含最近加拿大野火用户生成内容的多模态数据集，涵盖了十二个关键主题的注释。研究人员评估了零样本视觉-语言模型和自训练分类器，发现自训练模型优于零样本模型和基线模型，达到了84.48%的f分数。研究还强调了为特定灾害响应需求定制数据集的重要性。

TinyDrop: Tiny Model Guided Token Dropping for Vision Transformers

Authors: Guoxin Wang, Qingyuan Wang, Binhua Huang, Shaowu Chen, Deepu John

First: 2025-09-03T14:55:49+00:00 · Latest: 2025-09-03T14:55:49+00:00

Abs · PDF

Abstract

Vision Transformers (ViTs) achieve strong performance in image classification but incur high computational costs from processing all image tokens. To reduce inference costs in large ViTs without compromising accuracy, we propose TinyDrop, a training-free token dropping framework guided by a lightweight vision model. The guidance model estimates the importance of tokens while performing inference, thereby selectively discarding low-importance tokens if large vit models need to perform attention calculations. The framework operates plug-and-play, requires no architectural modifications, and is compatible with diverse ViT architectures. Evaluations on standard image classification benchmarks demonstrate that our framework reduces FLOPs by up to 80% for ViTs with minimal accuracy degradation, highlighting its generalization capability and practical utility for efficient ViT-based classification.

中文标题/摘要

标题：TinyDrop: 由轻量级视觉模型引导的Vision Transformer中微小模型指导的标记丢弃

Vision Transformers (ViTs) 在图像分类中表现出强大的性能，但处理所有图像标记会带来高昂的计算成本。为了在不牺牲准确性的前提下降低大型ViTs的推理成本，我们提出TinyDrop，这是一种基于轻量级视觉模型的无需训练的标记丢弃框架。指导模型在进行推理时估计标记的重要性，从而在大型ViT模型需要进行注意力计算时选择性地丢弃低重要性的标记。该框架即插即用，无需进行架构修改，并且兼容多种ViT架构。在标准图像分类基准上的评估表明，我们的框架可以将ViTs的FLOPs最多减少80%，同时准确率下降 minimal，突显了其泛化能力和高效的实用价值。

Summary / 总结

TinyDrop is a training-free token dropping framework for Vision Transformers (ViTs) that uses a lightweight guidance model to estimate the importance of tokens during inference. This allows for selective discarding of low-importance tokens, reducing FLOPs by up to 80% in large ViTs with minimal accuracy degradation. The framework is plug-and-play and compatible with various ViT architectures, demonstrating its generalization capability and practical utility for efficient ViT-based classification.

TinyDrop 是一种无需训练的 Vision Transformer (ViT) 的 token 舍弃框架，通过轻量级的指导模型在推理时估计 token 的重要性，从而舍弃低重要性的 token，最多可减少 80% 的 FLOPs，同时保持最小的准确率下降。该框架兼容多种 ViT 架构，并且可以轻松集成而无需任何架构修改。

ChordPrompt: Orchestrating Cross-Modal Prompt Synergy for Multi-Domain Incremental Learning in CLIP

Authors: Zhiyuan Wang, Bokui Chen

First: 2025-06-24T13:22:06+00:00 · Latest: 2025-09-03T12:23:15+00:00

Comments: Accepted by the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2025)

Abs · PDF

Abstract

Continual learning (CL) empowers pre-trained vision-language models to adapt effectively to novel or previously underrepresented data distributions without comprehensive retraining, enhancing their adaptability and efficiency. While vision-language models like CLIP show great promise, they struggle to maintain performance across domains in incremental learning scenarios. Existing prompt learning methods face two main limitations: 1) they primarily focus on class-incremental learning scenarios, lacking specific strategies for multi-domain task incremental learning; 2) most current approaches employ single-modal prompts, neglecting the potential benefits of cross-modal information exchange. To address these challenges, we propose the \ChordPrompt framework, which facilitates a harmonious interplay between visual and textual prompts. \ChordPrompt introduces cross-modal prompts to leverage interactions between visual and textual information. Our approach also employs domain-adaptive text prompts to select appropriate prompts for continual adaptation across multiple domains. Comprehensive experiments on multi-domain incremental learning benchmarks demonstrate that \ChordPrompt outperforms state-of-the-art methods in zero-shot generalization and downstream task performance.

中文标题/摘要

标题：ChordPrompt：跨模态提示协同 orchestrating 跨模态提示协同用于 CLIP 的多域增量学习

持续学习（CL）使预训练的视觉-语言模型能够有效地适应新的或以前未充分代表的数据分布，而无需进行全面的重新训练，从而增强其适应性和效率。虽然像CLIP这样的视觉-语言模型显示出巨大的潜力，但在增量学习场景中，它们难以在不同领域中保持性能。现有的提示学习方法面临两个主要限制：1）它们主要集中在类别增量学习场景上，缺乏针对多域任务增量学习的具体策略；2）大多数当前方法使用单模态提示，忽视了跨模态信息交换的潜在益处。为了解决这些挑战，我们提出了ChordPrompt框架，该框架促进了视觉和文本提示之间的和谐互动。ChordPrompt引入了跨模态提示，以利用视觉和文本信息之间的交互。我们的方法还使用领域自适应文本提示来选择适合持续适应多个领域的提示。在多域增量学习基准上的全面实验表明，ChordPrompt在零样本泛化和下游任务性能方面优于最先进的方法。

Summary / 总结

The research aims to improve the adaptability of pre-trained vision-language models like CLIP in multi-domain incremental learning scenarios. The ChordPrompt framework introduces cross-modal prompts to enhance the interaction between visual and textual information and employs domain-adaptive text prompts for continual adaptation. Experiments show that ChordPrompt outperforms existing methods in zero-shot generalization and downstream task performance.

研究旨在提高预训练的视觉-语言模型如CLIP在多领域增量学习场景中的适应性。ChordPrompt框架引入了跨模态提示以增强视觉和文本信息之间的交互，并使用领域自适应文本提示进行持续适应。实验表明，ChordPrompt在多领域基准上的零样本泛化和下游任务性能上优于现有方法。

Mitigating Hallucination in Large Vision-Language Models through Aligning Attention Distribution to Information Flow

Authors: Jianfei Zhao, Feng Zhang, Xin Sun, Chong Feng

Venue: EMNLP 2025

First: 2025-05-20T12:10:13+00:00 · Latest: 2025-09-03T11:34:49+00:00

Comments: Accepted to Findings of EMNLP 2025

Abs · PDF

Abstract

Due to the unidirectional masking mechanism, Decoder-Only models propagate information from left to right. LVLMs (Large Vision-Language Models) follow the same architecture, with visual information gradually integrated into semantic representations during forward propagation. Through systematic analysis, we observe that the majority of the visual information is absorbed into the semantic representations. However, the model's attention distribution does not exhibit sufficient emphasis on semantic representations. This misalignment between the attention distribution and the actual information flow undermines the model's visual understanding ability and contributes to hallucinations. To address this issue, we enhance the model's visual understanding by leveraging the core information embedded in semantic representations. Specifically, we identify attention heads that focus on core semantic representations based on their attention distributions. Then, through a two-stage optimization paradigm, we propagate the advantages of these attention heads across the entire model, aligning the attention distribution with the actual information flow. We evaluate our method on three image captioning benchmarks using five different LVLMs, demonstrating its effectiveness in significantly reducing hallucinations. Further experiments reveal a trade-off between reduced hallucinations and richer details. Notably, our method allows for manual adjustment of the model's conservativeness, enabling flexible control to meet diverse real-world requirements.

中文标题/摘要

标题：通过使注意力分布与信息流对齐来缓解大型视觉-语言模型中的幻觉

由于单向掩码机制，解码器仅模型从左到右传播信息。大型视觉-语言模型（LVLMs）遵循相同的架构，在前向传播过程中，视觉信息逐渐整合到语义表示中。通过系统分析，我们发现大部分视觉信息被整合到语义表示中。然而，模型的注意力分布并未充分强调语义表示。这种注意力分布与实际信息流之间的不一致削弱了模型的视觉理解能力，并导致幻觉。为了解决这一问题，我们通过利用嵌入在语义表示中的核心信息来增强模型的视觉理解能力。具体来说，我们根据注意力分布识别出专注于核心语义表示的注意力头。然后，通过两阶段优化范式，我们将这些注意力头的优势在整个模型中传播，使注意力分布与实际信息流对齐。我们在三个图像字幕基准上使用五种不同的LVLMs评估了我们的方法，证明了其在显著减少幻觉方面的有效性。进一步的实验揭示了减少幻觉与更丰富的细节之间的权衡。值得注意的是，我们的方法允许手动调整模型的保守性，从而实现灵活控制以满足多样化的现实需求。

Summary / 总结

Due to the unidirectional masking mechanism, Decoder-Only models propagate information from left to right.

研究旨在通过使注意力分布与信息流对齐来减轻大型视觉-语言模型中的幻觉。方法是识别专注于核心语义表示的注意力头，并通过优化这些头在整个模型中传播其优势。实验结果表明，这种方法在三个图像字幕基准上使用五种不同的LVLM显著减少了幻觉，尽管这会带来幻觉减少与细节丰富之间的权衡。还可以手动调整模型的保守性，以灵活适应不同的实际需求。