arXiv 论文速递

2025-12-05 03:31
Snapshot: 20251205_0331
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
Authors: Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, Jonathan Tremblay
First: 2025-12-03T18:50:04+00:00 · Latest: 2025-12-03T18:50:04+00:00
Abstract
Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, BOP-ASK) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines. Project page: https://spacetools.github.io/.
中文标题/摘要
标题:SpaceTools:通过双交互式强化学习实现工具增强的空间推理
视觉语言模型(VLMs)展示了强大的定性视觉理解能力,但在需要精确度量的空间推理方面存在困难,这正是体现于实体应用中的需求。代理范式表明,VLMs 可以利用各种工具来增强这些能力,例如深度估计器、分割模型和姿态估计器。然而,如何在不依赖于手工设计的提示策略或强制执行固定预定义工具管道的情况下实现这一愿景仍是一个开放性挑战。强化学习可以弥补这一差距,但由于多工具推理中的搜索空间庞大,它目前仅限于处理单个视觉工具的推理。我们引入了双交互式强化学习(DIRL),这是一种两阶段训练框架,其中VLMs通过交互式探索和反馈学习协调多种工具。在教学阶段,我们将通过交互式RL训练的单工具专家的演示与使用所有工具的前沿模型的轨迹结合起来。在探索阶段,模型进一步通过持续的RL进一步细化多工具协调。我们的模型SpaceTools,具有工具增强的空间推理能力,在空间理解基准测试(RoboSpatial-Home、BLINK、BOP-ASK)上实现了最先进的性能,并展示了使用7-DOF机器人作为工具的可靠现实世界操作。DIRL在vanilla SFT(+12%)和RL(+16%)基线上提供了显著改进。项目页面:https://spacetools.github.io/
Summary / 总结
SpaceTools uses Double Interactive Reinforcement Learning (DIRL) to enable Vision Language Models (VLMs) to coordinate multiple tools for precise spatial reasoning, improving performance on spatial understanding benchmarks and real-world manipulation tasks. DIRL consists of a teaching phase with demonstrations from tool specialists and an exploration phase for further refinement. SpaceTools outperforms previous methods by 12% on RoboSpatial-Home and 16% on RoboSpatial benchmarks, demonstrating enhanced spatial reasoning and tool-augmented capabilities.
SpaceTools 使用 Double Interactive Reinforcement Learning (DIRL) 让 Vision Language Models 协调多种工具进行精确的空间推理,改进了空间理解基准测试和实际操作任务的表现。DIRL 包括一个由工具专家演示的教学阶段和一个通过继续 RL 进一步细化多工具协调的阶段,导致了最先进的结果,并且在 RoboSpatial-Home 基准测试上分别比强基线提高了 12% 和 16%。
Jina-VLM: Small Multilingual Vision Language Model
Authors: Andreas Koukounas, Georgios Mastrapas, Florian Hönicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao
First: 2025-12-03T18:13:41+00:00 · Latest: 2025-12-03T18:13:41+00:00
Comments: 18 pages, 1-7 main content
Abstract
We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. Across standard VQA benchmarks and multilingual evaluations, Jina-VLM outperforms comparable models while preserving competitive text-only performance.
中文标题/摘要
标题:Jina-VLM:小型多语言视觉语言模型
我们提出了Jina-VLM,这是一种参数量为24亿的视觉-语言模型,在开放的2B规模的视觉语言模型中,其在多语言视觉问答方面达到了最先进的水平。该模型通过一种注意力池化连接器将SigLIP2视觉编码器与Qwen3语言骨干网络耦合在一起,从而能够高效处理任意分辨率的图像。在标准的视觉问答基准测试和多语言评估中,Jina-VLM 在保持与同类模型相当的纯文本性能的同时,表现更优。
Summary / 总结
Jina-VLM is a 2.4 billion parameter vision-language model designed for multilingual visual question answering, achieving state-of-the-art performance. It integrates a SigLIP2 vision encoder with a Qwen3 language model via an attention-pooling connector, allowing efficient processing of images. The model outperforms similar models in multilingual evaluations while maintaining strong text-only performance across standard VQA benchmarks.
Jina-VLM 是一个 2.4B 参数的视觉语言模型,旨在实现多语言视觉问答,其性能在开放的 2B 级别模型中处于领先水平。该模型通过注意力池化连接器将 SigLIP2 视觉编码器与 Qwen3 语言骨干连接起来,实现对任意分辨率图像的高效处理。在标准 VQA 基准测试和多语言评估中,Jina-VLM 的表现优于同类模型,同时保持了竞争力的纯文本性能。
C3G: Learning Compact 3D Representations with 2K Gaussians
Authors: Honggyu An, Jaewoo Jung, Mungyeom Kim, Sunghwan Hong, Chaehyun Kim, Kazumi Fukuda, Minkyeong Jeon, Jisang Han, Takuya Narihira, Hyuna Ko, Junsu Kim, Yuki Mitsufuji, Seungryong Kim
First: 2025-12-03T17:59:05+00:00 · Latest: 2025-12-03T17:59:05+00:00
Comments: Project Page : https://cvlab-kaist.github.io/C3G/
Abstract
Reconstructing and understanding 3D scenes from unposed sparse views in a feed-forward manner remains as a challenging task in 3D computer vision. Recent approaches use per-pixel 3D Gaussian Splatting for reconstruction, followed by a 2D-to-3D feature lifting stage for scene understanding. However, they generate excessive redundant Gaussians, causing high memory overhead and sub-optimal multi-view feature aggregation, leading to degraded novel view synthesis and scene understanding performance. We propose C3G, a novel feed-forward framework that estimates compact 3D Gaussians only at essential spatial locations, minimizing redundancy while enabling effective feature lifting. We introduce learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, ensuring each Gaussian integrates relevant visual features across views. We then exploit the learned attention patterns for Gaussian decoding to efficiently lift features. Extensive experiments on pose-free novel view synthesis, 3D open-vocabulary segmentation, and view-invariant feature aggregation demonstrate our approach's effectiveness. Results show that a compact yet geometrically meaningful representation is sufficient for high-quality scene reconstruction and understanding, achieving superior memory efficiency and feature fidelity compared to existing methods.
中文标题/摘要
标题:C3G:使用2K高斯分布学习紧凑的3D表示
从未摆姿势的稀疏视图以前馈方式重建和理解3D场景仍然是3D计算机视觉中的一个具有挑战性的任务。最近的方法使用逐像素3D高斯散点进行重建,随后通过2D到3D特征提升阶段进行场景理解。然而,它们生成了过多的冗余高斯分布,导致高内存开销和多视图特征聚合的次优性能,从而降低了新颖视图合成和场景理解的效果。我们提出了C3G,这是一种新颖的前馈框架,仅在关键空间位置估计紧凑的3D高斯分布,从而最小化冗余并使特征提升有效。我们引入了可学习的令牌,通过自我注意力聚合多视图特征以指导高斯分布的生成,确保每个高斯分布整合来自不同视图的相关视觉特征。然后利用学习到的注意力模式进行高斯解码,以高效地提升特征。在无姿态新颖视图合成、3D开放词汇分割和视不变特征聚合方面的广泛实验表明了我们方法的有效性。结果表明,一个紧凑且几何上有意义的表示足以实现高质量的场景重建和理解,与现有方法相比,具有更高的内存效率和特征保真度。
Summary / 总结
The paper addresses the challenge of reconstructing 3D scenes from sparse unposed views using a feed-forward approach. It introduces C3G, which estimates compact 3D Gaussians only at essential spatial locations to reduce redundancy and improve feature aggregation. The method uses learnable tokens and self-attention to guide Gaussian generation and feature lifting, leading to better novel view synthesis and scene understanding. Experiments show C3G outperforms existing methods in terms of memory efficiency and feature fidelity.
论文解决了从未定位的稀疏视图重建和理解3D场景的挑战,采用前馈方法。提出C3G,在关键空间位置估计紧凑的3D高斯分布,以减少冗余并增强特征提升。实验表明,C3G在新型视图合成、3D语义分割和特征聚合方面表现出色,具有更高的内存效率和特征保真度,优于现有方法。
TARA Test-by-Adaptive-Ranks for Quantum Anomaly Detection with Conformal Prediction Guarantees
Authors: Davut Emre Tasar, Ceren Ocal Tasar
First: 2025-12-03T17:53:38+00:00 · Latest: 2025-12-03T17:53:38+00:00
Abstract
Quantum key distribution (QKD) security fundamentally relies on the ability to distinguish genuine quantum correlations from classical eavesdropper simulations, yet existing certification methods lack rigorous statistical guarantees under finite-sample conditions and adversarial scenarios. We introduce TARA (Test by Adaptive Ranks), a novel framework combining conformal prediction with sequential martingale testing for quantum anomaly detection that provides distribution-free validity guarantees. TARA offers two complementary approaches. TARA k, based on Kolmogorov Smirnov calibration against local hidden variable (LHV) null distributions, achieving ROC AUC = 0.96 for quantum-classical discrimination. And TARA-m, employing betting martingales for streaming detection with anytime valid type I error control that enables real time monitoring of quantum channels. We establish theoretical guarantees proving that under (context conditional) exchangeability, conformal p-values remain uniformly distributed even for strongly contextual quantum data, confirming that quantum contextuality does not break conformal prediction validity a result with implications beyond quantum certification to any application of distribution-free methods to nonclassical data. Extensive validation on both IBM Torino (superconducting, CHSH = 2.725) and IonQ Forte Enterprise (trapped ion, CHSH = 2.716) quantum processors demonstrates cross-platform robustness, achieving 36% security margins above the classical CHSH bound of 2. Critically, our framework reveals a methodological concern affecting quantum certification more broadly: same-distribution calibration can inflate detection performance by up to 44 percentage points compared to proper cross-distribution calibration, suggesting that prior quantum certification studies using standard train test splits may have systematically overestimated adversarial robustness.
中文标题/摘要
标题:TARA基于自适应等级的量子异常检测方法及其同分布校准保证
量子密钥分发(QKD)的安全性从根本上依赖于区分真正的量子相关性和经典窃听者模拟的能力,但现有的认证方法在有限样本条件下和对抗场景下缺乏严格的统计保证。我们引入了TARA(自适应等级测试),这是一种结合了同分布预测和序贯鞅检验的新型框架,用于量子异常检测,提供了无分布的正确性保证。TARA提供了两种互补的方法。TARA k基于与局部隐变量(LHV)零分布的柯尔莫哥洛夫-斯米尔诺夫校准,实现了量子-经典区分的ROC AUC = 0.96。而TARA-m则利用赌注鞅进行流式检测,具有随时有效的第一类错误控制,能够实时监控量子信道。我们建立了理论保证,证明在(条件下的)交换性下,同分布的同分布预测p值仍然均匀分布,即使对于强上下文量子数据也是如此,这证实了量子上下文性不会破坏同分布预测的有效性,这一结果不仅对量子认证,而且对任何应用到非经典数据的无分布方法都有重要意义。在对IBM Torino(超导体,CHSH = 2.725)和IonQ Forte Enterprise(囚禁离子,CHSH = 2.716)量子处理器的广泛验证中,我们展示了跨平台的稳健性,实现了36%的安全边际,高于经典CHSH界限2。关键的是,我们的框架揭示了影响更广泛的量子认证方法的一个方法论问题:同分布校准可能会将检测性能夸大高达44个百分点,相比于适当的跨分布校准,这表明先前使用标准训练测试分割进行的量子认证研究可能系统地高估了对抗鲁棒性。
Summary / 总结
TARA (Test by Adaptive Ranks) is a novel framework for quantum anomaly detection that integrates conformal prediction with sequential martingale testing, providing distribution-free validity guarantees. TARA offers two methods: TARA k, which uses Kolmogorov Smirnov calibration to achieve a high ROC AUC of 0.96 for quantum-classical discrimination, and TARA-m, which employs betting martingales for real-time monitoring with anytime valid type I error control. Theoretical guarantees confirm that TARA remains valid under exchangeability, even for strongly contextual quantum data. Experimental validation on IBM Torino and IonQ Forte Enterprise processors shows robust cross-platform performance, with 36% security margins above the classical CHSH bound. Additionally, TARA highlights a methodological concern in quantum certification where same-distribution calibration can overestimate detection performance by up to 44 percentage points compared to cross-distribution calibration.
TARA (Test by Adaptive Ranks) 是一个结合了形式化预测和序列鞅测试的框架,用于量子异常检测,提供了无分布假设的有效性保证。TARA k 和 TARA-m 提供了两种方法:TARA k 实现了量子-经典区分的 ROC AUC 为 0.96,而 TARA-m 允许实时监控并具有随时有效的第一类错误控制。理论保证确认,在上下文量子数据下,形式化 p 值仍然均匀分布,这意味着量子上下文性不会破坏形式化预测的有效性。广泛的验证表明,TARA 在 IBM 和 IonQ 量子处理器上表现出跨平台的稳健性,安全边际达到 36%。研究还指出,相同分布校准可能会将检测性能夸大高达 44 个百分点,相比之下,跨分布校准更为准确。
Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
Authors: Jialuo Li, Bin Li, Jiahao Li, Yan Lu
First: 2025-12-03T17:36:06+00:00 · Latest: 2025-12-03T17:36:06+00:00
Abstract
The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. We first identify and validate a query typology distinguishing between global query and localized query. We demonstrate that while uniform sampling is both effective and efficient for global queries, localized queries indeed necessitate query-aware selection for optimal performance. Building on this insight, we propose DIG, a training-free frame selection framework that adapts its strategy based on the query type. Specifically,DIG employs efficient uniform sampling for global queries while activating a specialized pipeline to extract query-relevant frames for localized queries. Experiments on three long-form video understanding benchmarks demonstrate that DIG consistently outperforms existing baselines and robustly improves LMM performance, even when scaling the input frame count to 256.
中文标题/摘要
标题:分割,然后接地:根据查询类型调整框架选择以适应长视频理解
大型多模态模型(LMMs)在长视频理解中的应用受到上下文长度有限和处理密集视频标记的计算成本高昂的限制。因此,最近的研究集中在查询感知的帧选择方法上,这些方法通常会带来显著的计算开销。本文挑战了复杂搜索机制在所有情况下都是必要的这一假设。我们首先识别并验证了一种查询类型学,区分全局查询和局部查询。我们证明,均匀采样对于全局查询既有效又高效,而对于局部查询,确实需要查询感知的选择以获得最佳性能。基于这一洞察,我们提出了DIG,一种无需训练的框架选择框架,其策略根据查询类型进行调整。具体而言,DIG 使用高效的均匀采样进行全局查询,而激活专门的管道以提取与查询相关的帧进行局部查询。在三个长视频理解基准上的实验表明,DIG 一致地优于现有基线,并且即使将输入帧数扩展到 256,也能稳健地提高 LMM 的性能。
Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs
Authors: Oren Rachmil, Roy Betser, Itay Gershon, Omer Hofman, Nitay Yakoby, Yuval Meron, Idan Yankelev, Asaf Shabtai, Yuval Elovici, Roman Vainshtein
Venue: AAAI 2026
First: 2025-12-03T17:23:39+00:00 · Latest: 2025-12-03T17:23:39+00:00
Comments: Accepted to the AAAI 2026 Deployable AI (DAI) Workshop
Abstract
Aligning proprietary large language models (LLMs) with internal organizational policies has become an urgent priority as organizations increasingly deploy LLMs in sensitive domains such as legal support, finance, and medical services. Beyond generic safety filters, enterprises require reliable mechanisms to detect policy violations within their regulatory and operational frameworks, where breaches can trigger legal and reputational risks. Existing content moderation frameworks, such as guardrails, remain largely confined to the safety domain and lack the robustness to capture nuanced organizational policies. LLM-as-a-judge and fine-tuning approaches, though flexible, introduce significant latency and lack interpretability. To address these limitations, we propose a training-free and efficient method that treats policy violation detection as an out-of-distribution (OOD) detection problem. Inspired by whitening techniques, we apply a linear transformation to decorrelate the model's hidden activations and standardize them to zero mean and unit variance, yielding near-identity covariance matrix. In this transformed space, we use the Euclidean norm as a compliance score to detect policy violations. The method requires only the policy text and a small number of illustrative samples, which makes it light-weight and easily deployable. On a challenging policy benchmark, our approach achieves state-of-the-art results, surpassing both existing guardrails and fine-tuned reasoning models. This work provides organizations with a practical and statistically grounded framework for policy-aware oversight of LLMs, advancing the broader goal of deployable AI governance. Code is available at: https://tinyurl.com/policy-violation-detection
中文标题/摘要
标题:在LLM中通过激活空间去相关进行无训练政策违规检测
随着组织越来越多地在法律支持、金融和医疗服务等敏感领域部署专有大型语言模型(LLMs),将LLMs与内部组织政策对齐已成为当务之急。除了通用的安全过滤器外,企业还需要可靠的机制来在其监管和运营框架内检测政策违规,因为违规可能会引发法律和声誉风险。现有的内容审核框架,如护栏,主要局限于安全领域,缺乏捕捉复杂组织政策的稳健性。LLM作为法官和微调方法虽然灵活,但会引入显著的延迟并缺乏可解释性。为了解决这些限制,我们提出了一种无训练且高效的检测方法,将政策违规检测视为离群值检测问题。受去相关技术的启发,我们应用线性变换来解相关模型的隐藏激活,并标准化为零均值和单位方差,得到接近单位协方差矩阵。在变换后的空间中,我们使用欧几里得范数作为合规评分来检测政策违规。该方法仅需政策文本和少量示例样本,使其轻量级且易于部署。在一项具有挑战性的政策基准测试中,我们的方法达到了最先进的效果,超越了现有的护栏和微调推理模型。这项工作为企业提供了一种实用且统计上合理的框架,用于LLM的政策意识监督,推动了可部署AI治理的更广泛目标。代码可在:https://tinyurl.com/policy-violation-detection 获取
Summary / 总结
The paper addresses the need for reliable policy violation detection in large language models (LLMs) used in sensitive domains. It proposes a training-free method that treats policy violation detection as an out-of-distribution detection problem, using a linear transformation to decorrelate and standardize the model's hidden activations. This approach uses the Euclidean norm as a compliance score to detect violations, requiring only policy text and a few illustrative samples. The method outperforms existing guardrails and fine-tuned reasoning models on a challenging benchmark, providing a practical and statistically grounded framework for policy-aware LLM oversight.
论文针对大型语言模型(LLMs)在敏感领域中需要可靠的政策违规检测的需求,提出了一种无需训练的方法,将政策违规检测视为异常分布检测问题,通过激活空间白化来解相关并标准化模型的激活。该方法只需要政策文本和少量示例样本,实现了基准测试中的最佳结果,超越了现有的护栏和微调推理模型。
DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation
Authors: Zexin Lin, Hawen Wan, Yebin Zhong, Xiaoqiang
First: 2025-12-03T17:22:29+00:00 · Latest: 2025-12-03T17:22:29+00:00
Abstract
Vision-Language Models (VLMs) deployed in safety-critical applications such as autonomous driving must handle continuous visual streams under imperfect conditions. However, existing benchmarks focus on static, high-quality images and ignore temporal degradation and error propagation, which are critical failure modes where transient visual corruption induces hallucinations that persist across subsequent frames. We introduce DIQ-H, the first benchmark for evaluating VLM robustness under dynamic visual degradation in temporal sequences. DIQ-H applies physics-based corruptions including motion blur, sensor noise, and compression artifacts, and measures hallucination persistence, error recovery, and temporal consistency through multi-turn question-answering tasks. To enable scalable annotation, we propose Uncertainty-Guided Iterative Refinement (UIR), which generates reliable pseudo-ground-truth using lightweight VLMs with uncertainty filtering, achieving a 15.3 percent accuracy improvement. Experiments on 16 state-of-the-art VLMs reveal substantial robustness gaps: even advanced models such as GPT-4o achieve only a 78.5 percent recovery rate, while open-source models struggle with temporal consistency at less than 60 percent. DIQ-H provides a comprehensive platform for evaluating VLM reliability in real-world deployments.
中文标题/摘要
标题:DIQ-H:评估在时间视觉退化下VLM的幻觉持久性
部署在自动驾驶等安全关键应用中的视觉-语言模型(VLMs)必须在不完美的视觉流下处理连续的视觉信息。然而,现有的基准测试主要集中在静态、高质量的图像上,忽略了时间退化和错误传播,这是关键故障模式,其中短暂的视觉损坏会引发持续到后续帧的幻觉。我们引入了DIQ-H,这是第一个用于评估VLM在时间序列中动态视觉退化下的鲁棒性的基准测试。DIQ-H 应用了基于物理的损坏,包括运动模糊、传感器噪声和压缩伪影,并通过多轮问答任务来衡量幻觉持久性、错误恢复和时间一致性。为了实现可扩展的注释,我们提出了基于不确定性迭代细化(UIR)的方法,该方法使用具有不确定性过滤的轻量级VLM生成可靠的伪地面真值,实现了15.3%的准确率提升。在16个最先进的VLM上的实验揭示了显著的鲁棒性差距:即使是先进的模型如GPT-4o也只能实现78.5%的恢复率,而开源模型在时间一致性方面低于60%。DIQ-H 提供了一个全面的平台,用于评估VLM在实际部署中的可靠性。
Summary / 总结
The research aims to evaluate the robustness of Vision-Language Models (VLMs) under dynamic visual degradation, which is crucial for safety-critical applications like autonomous driving. The study introduces DIQ-H, a benchmark that applies physics-based corruptions to temporal sequences and measures hallucination persistence, error recovery, and temporal consistency. Experiments show that even advanced models like GPT-4o have a recovery rate of only 78.5%, while open-source models struggle with temporal consistency. To enable scalable annotation, the study proposes Uncertainty-Guided Iterative Refinement (UIR), which improves accuracy by 15.3%.
研究旨在评估视觉-语言模型(VLMs)在动态视觉降级条件下的鲁棒性,这对于自动驾驶等安全关键应用至关重要。研究引入了DIQ-H基准,该基准对动态视觉序列应用物理基础的破坏,并测量幻觉持续性、错误恢复和时间一致性。实验表明,即使是先进的模型如GPT-4o,恢复率也只有78.5%,开源模型在时间一致性方面表现不佳。为了实现注释的可扩展性,研究提出了不确定性引导的迭代改进(UIR)方法,该方法通过使用具有不确定性过滤的轻量级VLM生成可靠的伪地面真值,准确率提高了15.3%。
Refining Machine Learning Potentials through Thermodynamic Theory of Phase Transitions
Authors: Paul Fuchs, Julija Zavadlav
First: 2025-12-03T17:06:26+00:00 · Latest: 2025-12-03T17:06:26+00:00
Abstract
Foundational Machine Learning Potentials can resolve the accuracy and transferability limitations of classical force fields. They enable microscopic insights into material behavior through Molecular Dynamics simulations, which can crucially expedite material design and discovery. However, insufficiently broad and systematically biased reference data affect the predictive quality of the learned models. Often, these models exhibit significant deviations from experimentally observed phase transition temperatures, in the order of several hundred kelvins. Thus, fine-tuning is necessary to achieve adequate accuracy in many practical problems. This work proposes a fine-tuning strategy via top-down learning, directly correcting the wrongly predicted transition temperatures to match the experimental reference data. Our approach leverages the Differentiable Trajectory Reweighting algorithm to minimize the free energy differences between phases at the experimental target pressures and temperatures. We demonstrate that our approach can accurately correct the phase diagram of pure Titanium in a pressure range of up to 5 GPa, matching the experimental reference within tenths of kelvins and improving the liquid-state diffusion constant. Our approach is model-agnostic, applicable to multi-component systems with solid-solid and solid-liquid transitions, and compliant with top-down training on other experimental properties. Therefore, our approach can serve as an essential step towards highly accurate application-specific and foundational machine learning potentials.
中文标题/摘要
标题:通过相变热力学理论精炼机器学习势能
基础机器学习势能可以解决经典势场的准确性和可转移性限制。它们通过分子动力学模拟提供材料行为的微观洞察,从而加速材料设计和发现。然而,不充分广泛且系统性偏倚的参考数据影响了学习模型的预测质量。通常,这些模型在相变温度上与实验观察结果存在显著偏差,偏差范围可达几百开尔文。因此,需要进行精细调整以在许多实际问题中实现足够的准确性。本研究提出了一种自上而下的学习策略进行精细调整,直接纠正错误预测的相变温度以匹配实验参考数据。我们的方法利用可微轨迹重加权算法,在实验目标压力和温度下最小化相态之间的自由能差异。我们证明,我们的方法可以在高达5 GPa的压力范围内准确修正纯钛的相图,与实验参考数据相差几开尔文,并提高液态扩散常数。我们的方法是模型无关的,适用于具有固态-固态和固态-液态转变的多组分系统,并且可以用于其他实验性质的自上而下的训练。因此,我们的方法可以作为实现高度准确的应用特定和基础机器学习势能的重要步骤。
Summary / 总结
This study addresses the limitations of machine learning potentials in accurately predicting phase transition temperatures, which are crucial for material design. It introduces a fine-tuning strategy using top-down learning and the Differentiable Trajectory Reweighting algorithm to correct these inaccuracies. The approach successfully matches experimental phase transition temperatures for pure titanium within tenths of kelvins and enhances the liquid-state diffusion constant. This method is model-agnostic and can be applied to various systems with solid-solid and solid-liquid transitions, making it a significant step towards highly accurate machine learning potentials.
该研究解决了机器学习势在预测相变温度方面的局限性,这些温度对于材料设计至关重要。它提出了一种基于自上而下的学习策略和可微轨迹重新加权算法来修正这些不准确之处。该方法成功地将纯钛在5 GPa范围内的相图与实验数据匹配,提高了液态扩散常数,并展示了对多组分系统的广泛适用性。
Training for Identity, Inference for Controllability: A Unified Approach to Tuning-Free Face Personalization
Authors: Lianyu Pang, Ji Zhou, Qiping Wang, Baoquan Zhao, Zhenguo Yang, Qing Li, Xudong Mao
First: 2025-12-03T16:57:50+00:00 · Latest: 2025-12-03T16:57:50+00:00
Comments: 17 pages, 13 figures
Abstract
Tuning-free face personalization methods have developed along two distinct paradigms: text embedding approaches that map facial features into the text embedding space, and adapter-based methods that inject features through auxiliary cross-attention layers. While both paradigms have shown promise, existing methods struggle to simultaneously achieve high identity fidelity and flexible text controllability. We introduce UniID, a unified tuning-free framework that synergistically integrates both paradigms. Our key insight is that when merging these approaches, they should mutually reinforce only identity-relevant information while preserving the original diffusion prior for non-identity attributes. We realize this through a principled training-inference strategy: during training, we employ an identity-focused learning scheme that guides both branches to capture identity features exclusively; at inference, we introduce a normalized rescaling mechanism that recovers the text controllability of the base diffusion model while enabling complementary identity signals to enhance each other. This principled design enables UniID to achieve high-fidelity face personalization with flexible text controllability. Extensive experiments against six state-of-the-art methods demonstrate that UniID achieves superior performance in both identity preservation and text controllability. Code will be available at https://github.com/lyuPang/UniID
中文标题/摘要
标题:身份训练,推理控制:一种无需调优的统一面部个性化方法
无需调优的面部个性化方法沿着两个不同的范式发展:文本嵌入方法将面部特征映射到文本嵌入空间,以及基于适配器的方法通过辅助交叉注意力层注入特征。虽然这两种范式都显示出潜力,但现有方法难以同时实现高身份保真度和灵活的文本控制。我们引入了UniID,这是一种统一的无需调优框架,将这两种范式协同整合。我们的关键见解是,在合并这些方法时,它们应该仅相互强化与身份相关的信息,同时保留非身份属性的原始扩散先验。我们通过一个原则性的训练-推理策略实现这一点:在训练期间,我们采用一种以身份为中心的学习方案,引导两个分支仅捕获身份特征;在推理期间,我们引入一种归一化重缩放机制,恢复基础扩散模型的文本控制能力,同时使互补的身份信号相互增强。这种原则性设计使UniID能够实现高保真度的面部个性化,同时具有灵活的文本控制能力。与六种最先进的方法的广泛实验表明,UniID在身份保留和文本控制方面均表现出更优的性能。代码将在https://github.com/lyuPang/UniID上提供
Summary / 总结
The paper introduces UniID, a unified tuning-free framework for face personalization that combines text embedding and adapter-based methods. By mutually reinforcing identity-relevant information and preserving the original diffusion prior for non-identity attributes, UniID achieves high identity fidelity and flexible text controllability. Experiments show that UniID outperforms six state-of-the-art methods in both identity preservation and text controllability.
论文提出了UniID,这是一种结合了文本嵌入和适配器方法的统一无调优框架,用于面部个性化。通过在训练期间强化身份相关信息并在推理时使用归一化缩放机制,该框架解决了同时保持高身份保真度和灵活文本控制的挑战。实验表明,UniID 在身份保真度和文本控制方面均优于六种最先进的方法。
Tada-DIP: Input-adaptive Deep Image Prior for One-shot 3D Image Reconstruction
Authors: Evan Bell, Shijun Liang, Ismail Alkhouri, Saiprasad Ravishankar
First: 2025-12-03T16:56:38+00:00 · Latest: 2025-12-03T16:56:38+00:00
Comments: 6 pages, 8 figures, 2025 Asilomar Conference on Signals, Systems, and Computers. Code is available at github.com/evanbell02/Tada-DIP/
Abstract
Deep Image Prior (DIP) has recently emerged as a promising one-shot neural-network based image reconstruction method. However, DIP has seen limited application to 3D image reconstruction problems. In this work, we introduce Tada-DIP, a highly effective and fully 3D DIP method for solving 3D inverse problems. By combining input-adaptation and denoising regularization, Tada-DIP produces high-quality 3D reconstructions while avoiding the overfitting phenomenon that is common in DIP. Experiments on sparse-view X-ray computed tomography reconstruction validate the effectiveness of the proposed method, demonstrating that Tada-DIP produces much better reconstructions than training-data-free baselines and achieves reconstruction performance on par with a supervised network trained using a large dataset with fully-sampled volumes.
中文标题/摘要
标题:Tada-DIP:输入自适应深度图像先验用于单次3D图像重建
深度图像先验(DIP)最近作为一种有前途的一次性神经网络图像重建方法而崭露头角。然而,DIP在3D图像重建问题中的应用有限。在本文中,我们引入了Tada-DIP,这是一种高效且完全3D的DIP方法,用于解决3D逆问题。通过结合输入自适应和去噪正则化,Tada-DIP能够产生高质量的3D重建结果,同时避免了DIP中常见的过拟合现象。在稀视图X射线计算机断层扫描重建实验中,验证了所提出方法的有效性,表明Tada-DIP产生的重建结果明显优于无训练数据基线,并且达到了与使用大量数据集训练的监督网络相当的重建性能。
Summary / 总结
Tada-DIP is a novel 3D image reconstruction method that builds upon Deep Image Prior (DIP) by incorporating input-adaptive techniques and denoising regularization. This approach effectively addresses overfitting issues and produces high-quality 3D reconstructions. Experiments on sparse-view X-ray computed tomography show that Tada-DIP outperforms training-data-free baselines and achieves reconstruction performance comparable to a supervised network trained on large datasets.
Tada-DIP 是一种结合输入自适应和去噪正则化的 3D 图像重建方法,能够从稀疏视图 X 射线计算机断层扫描数据中生成高质量的 3D 重建图像。该方法解决了 Deep Image Prior (DIP) 中常见的过拟合问题,并优于无训练数据基线,重建性能与大型数据集训练的监督网络相当。
MUT3R: Motion-aware Updating Transformer for Dynamic 3D Reconstruction
Authors: Guole Shen, Tianchen Deng, Xingrui Qin, Nailin Wang, Jianyu Wang, Yanbo Wang, Yongtao Chen, Hesheng Wang, Jingchuan Wang
First: 2025-12-03T16:36:53+00:00 · Latest: 2025-12-03T16:36:53+00:00
Abstract
Recent stateful recurrent neural networks have achieved remarkable progress on static 3D reconstruction but remain vulnerable to motion-induced artifacts, where non-rigid regions corrupt attention propagation between the spatial memory and image feature. By analyzing the internal behaviors of the state and image token updating mechanism, we find that aggregating self-attention maps across layers reveals a consistent pattern: dynamic regions are naturally down-weighted, exposing an implicit motion cue that the pretrained transformer already encodes but never explicitly uses. Motivated by this observation, we introduce MUT3R, a training-free framework that applies the attention-derived motion cue to suppress dynamic content in the early layers of the transformer during inference. Our attention-level gating module suppresses the influence of dynamic regions before their artifacts propagate through the feature hierarchy. Notably, we do not retrain or fine-tune the model; we let the pretrained transformer diagnose its own motion cues and correct itself. This early regulation stabilizes geometric reasoning in streaming scenarios and leads to improvements in temporal consistency and camera pose robustness across multiple dynamic benchmarks, offering a simple and training-free pathway toward motion-aware streaming reconstruction.
中文标题/摘要
标题:MUT3R:运动感知更新变换器用于动态3D重建
近期的状态依赖循环神经网络在静态3D重建方面取得了显著进展,但在运动引起的伪影方面仍然脆弱,其中非刚性区域破坏了空间记忆与图像特征之间的注意力传播。通过对状态和图像标记更新机制的内部行为进行分析,我们发现,跨层聚合自注意力图揭示了一致的模式:动态区域自然被下调,暴露了一个预训练变换器已编码但从未明确使用的隐式运动线索。受此观察的启发,我们引入了MUT3R,这是一种无需训练的框架,该框架在推理过程中将注意力衍生的运动线索应用于抑制变换器早期层中的动态内容。我们的注意力级门控模块在动态区域的伪影传播到特征层次之前抑制其影响。值得注意的是,我们没有重新训练或微调模型;我们让预训练的变换器诊断其自身的运动线索并自我修正。这种早期调节在流式场景中稳定了几何推理,并在多个动态基准测试中提高了时间一致性和相机姿态的鲁棒性,提供了一条简单且无需训练的运动感知流式重建途径。
Summary / 总结
The research addresses the issue of motion-induced artifacts in 3D reconstruction by introducing MUT3R, a training-free framework that utilizes an attention-derived motion cue to suppress dynamic content in the early layers of the transformer during inference. This approach stabilizes geometric reasoning in streaming scenarios and improves temporal consistency and camera pose robustness across multiple benchmarks.
MUT3R 是一个无需训练的框架,通过在推理过程中抑制早期层中的动态内容,利用注意力机制提取的运动线索来稳定流式场景中的几何推理,并提高时间一致性和相机姿态鲁棒性。这种方法无需重新训练模型,提供了一种简单且有效的运动感知动态 3D 重建方案。
A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models
Authors: X. Y. Han, Yuan Zhong
First: 2025-12-03T16:00:02+00:00 · Latest: 2025-12-03T16:00:02+00:00
Abstract
In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token. An operational challenge in this design is load balancing: routing tokens to minimize the number of idle experts, which is important for the efficient utilization of (costly) GPUs. We provide a theoretical framework for analyzing the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure -- proposed by DeepSeek's Wang et al. (2024) -- by casting it as a one-step-per-iteration primal-dual method for an assignment problem. First, in a stylized deterministic setting, our framework yields several insightful structural properties: (i) a monotonic improvement of a Lagrangian objective, (ii) a preference rule that moves tokens from overloaded to underloaded experts, and (iii) an approximate-balancing guarantee. Then, we incorporate the stochastic and dynamic nature of AI training using a generalized online optimization formulation. In the online setting, we derive a strong convexity property of the objective that leads to a logarithmic expected regret bound under certain step-size choices. Additionally, we present real experiments on 1B-parameter DeepSeekMoE models to complement our theoretical findings. Together, these results build a principled framework for analyzing the Auxiliary-Loss-Free Load Balancing of s-MoE in AI models.
Summary / 总结
The research aims to address the load balancing challenge in Sparse Mixture-of-Experts (s-MoE) layers by providing a theoretical framework for the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure. The method is analyzed as a primal-dual optimization approach, yielding insights into its structural properties and performance guarantees. Experiments on 1B-parameter DeepSeekMoE models confirm the theoretical findings, offering a principled framework for efficient load balancing in large-scale AI models.
论文提供了一种分析Sparse Mixture-of-Experts (s-MoE)中Auxiliary-Loss-Free Load Balancing (ALF-LB)程序的理论框架,这对于大型AI训练中高效利用GPU至关重要。该框架被表述为一个分配问题的原始-对偶方法,并包括了拉格朗日目标单调改进、令牌路由偏好规则以及近似平衡保证的见解。研究还考虑了AI训练的随机性和动态性,推导出在某些步长选择下的对数期望后悔界。通过在DeepSeekMoE模型上的实际实验验证了理论发现。
Hierarchical Vision Language Action Model Using Success and Failure Demonstrations
Authors: Jeongeun Park, Jihwan Yoon, Byungwoo Jeon, Juhan Park, Jinwoo Shin, Namhoon Cho, Kyungjae Lee, Sangdoo Yun, Sungjoon Choi
First: 2025-12-03T15:58:38+00:00 · Latest: 2025-12-03T15:58:38+00:00
Comments: https://vine-vla.github.io/
Abstract
Prior Vision-Language-Action (VLA) models are typically trained on teleoperated successful demonstrations, while discarding numerous failed attempts that occur naturally during data collection. However, these failures encode where and how policies can be fragile, information that can be exploited to improve robustness. We address this problem by leveraging mixed-quality datasets to learn failure-aware reasoning at planning time. We introduce VINE, a hierarchical vision-language-action model that separates high-level reasoning (System 2) from low-level control (System 1) under a hierarchical reinforcement learning formalism, making failures usable as a structured learning signal rather than noisy supervision. System 2 performs feasibility-guided tree search over a 2D scene-graph abstraction: it proposes subgoal transitions, predicts success probabilities from both successes and failures, and prunes brittle branches before execution, effectively casting plan evaluation as feasibility scoring. The selected subgoal sequence is then passed to System 1, which executes low-level actions without modifying the agent's core skills. Trained entirely from offline teleoperation data, VINE integrates negative experience directly into the decision loop. Across challenging manipulation tasks, this approach consistently improves success rates and robustness, demonstrating that failure data is an essential resource for converting the broad competence of VLAs into robust execution.
中文标题/摘要
标题:基于成功与失败示范的分层视觉语言行动模型
先前的视觉-语言-行动(VLA)模型通常仅基于远程操作的成功示范进行训练,而忽略了数据收集过程中自然产生的大量失败尝试。然而,这些失败包含了政策可能脆弱的地方和方式,这些信息可以被利用以提高鲁棒性。我们通过利用混合质量的数据集来解决这一问题,在规划时学习失败意识的推理。我们引入了VINE,一种基于分层强化学习形式的分层视觉语言行动模型,将高层推理(系统2)与低层控制(系统1)分离,使失败能够作为结构化的学习信号而非噪声监督使用。系统2在二维场景图抽象上进行可行性引导的树搜索:它提出子目标转换,从成功和失败中预测成功率,并在执行前修剪脆弱分支,有效将计划评估转化为可行性评分。选择的子目标序列随后传递给系统1,它执行低级动作而不修改代理的核心技能。VINE完全从离线远程操作数据中训练,将负面经验直接整合到决策循环中。在具有挑战性的操作任务中,这种方法一致地提高了成功率和鲁棒性,证明了失败数据是将VLAs的广泛能力转化为稳健执行的关键资源。
Summary / 总结
The research addresses the limitation of prior Vision-Language-Action (VLA) models that only use successful demonstrations, by incorporating both successes and failures. It introduces VINE, a hierarchical model that separates high-level reasoning from low-level control, using a hierarchical reinforcement learning framework. VINE improves success rates and robustness in challenging manipulation tasks by leveraging failure data to guide planning and pruning brittle branches, thus enhancing the model's ability to handle unexpected situations.
研究通过引入VINE模型,解决了先前Vision-Language-Action (VLA)模型仅使用成功演示的局限性,该模型利用成功和失败的演示数据。VINE将高层推理和低层控制分离,使用层次化强化学习框架将失败数据作为结构化的学习信号。这种方法在复杂的操作任务中提高了成功率和鲁棒性。
Autonomous Reinforcement Learning Robot Control with Intel's Loihi 2 Neuromorphic Hardware
Authors: Kenneth Stewart, Roxana Leontie, Samantha Chapin, Joe Hays, Sumit Bam Shrestha, Carl Glen Henshaw
First: 2025-12-03T15:56:39+00:00 · Latest: 2025-12-03T15:56:39+00:00
Comments: Submitted for review at NICE 2026 (Neuro-Inspired Computational Elements) conference
Abstract
We present an end-to-end pipeline for deploying reinforcement learning (RL) trained Artificial Neural Networks (ANNs) on neuromorphic hardware by converting them into spiking Sigma-Delta Neural Networks (SDNNs). We demonstrate that an ANN policy trained entirely in simulation can be transformed into an SDNN compatible with Intel's Loihi 2 architecture, enabling low-latency and energy-efficient inference. As a test case, we use an RL policy for controlling the Astrobee free-flying robot, similar to a previously hardware in space-validated controller. The policy, trained with Rectified Linear Units (ReLUs), is converted to an SDNN and deployed on Intel's Loihi 2, then evaluated in NVIDIA's Omniverse Isaac Lab simulation environment for closed-loop control of Astrobee's motion. We compare execution performance between GPU and Loihi 2. The results highlight the feasibility of using neuromorphic platforms for robotic control and establish a pathway toward energy-efficient, real-time neuromorphic computation in future space and terrestrial robotics applications.
中文标题/摘要
标题:使用英特尔Loihi 2神经形态硬件的自主强化学习机器人控制
我们提出了一种端到端的流水线,用于将强化学习(RL)训练的仿生神经网络(ANNs)部署到神经形态硬件上,通过将其转换为脉冲Sigma-Delta神经网络(SDNNs)。我们证明了一个在完全模拟中训练的ANN策略可以被转换为与英特尔Loihi 2架构兼容的SDNN,从而实现低延迟和节能的推理。作为测试案例,我们使用了一个用于控制Astrobee自由飞行机器人的RL策略,类似于之前在太空中验证过的控制器。该策略使用修正线性单元(ReLUs)进行训练,然后被转换为SDNN并部署在英特尔Loihi 2上,在NVIDIA的Omniverse Isaac Lab模拟环境中进行闭环控制评估。我们比较了GPU和Loihi 2之间的执行性能。结果突显了使用神经形态平台进行机器人控制的可行性,并为未来太空和陆地机器人应用中的节能、实时神经形态计算奠定了途径。
Summary / 总结
The research aims to deploy reinforcement learning-trained artificial neural networks on neuromorphic hardware by converting them into spiking sigma-delta neural networks. The method involves training an ANN policy in simulation and then transforming it into an SDNN compatible with Intel's Loihi 2 architecture. The policy, trained with ReLUs, was deployed on Loihi 2 and evaluated in a simulation environment for controlling Astrobee's motion. The results show that neuromorphic platforms can be used for robotic control, offering low-latency and energy-efficient inference.
研究提出了一种将强化学习训练的人工神经网络部署到神经形态硬件上的端到端管道,通过将其转换为脉冲Sigma-Delta神经网络实现。一个用于控制Astrobee自由飞行机器人的RL策略被转换为与Intel的Loihi 2架构兼容的SDNN,并在仿真环境中进行了评估。结果表明,使用神经形态平台进行机器人控制的可行性,并展示了未来机器人应用中高效、实时计算的潜力。
OmniDexVLG: Learning Dexterous Grasp Generation from Vision Language Model-Guided Grasp Semantics, Taxonomy and Functional Affordance
Authors: Lei Zhang, Diwen Zheng, Kaixin Bai, Zhenshan Bing, Zoltan-Csaba Marton, Zhaopeng Chen, Alois Christian Knoll, Jianwei Zhang
First: 2025-12-03T15:28:23+00:00 · Latest: 2025-12-03T15:28:23+00:00
Comments: Project Website: https://sites.google.com/view/omnidexvlg, 16 pages
Abstract
Dexterous grasp generation aims to produce grasp poses that align with task requirements and human interpretable grasp semantics. However, achieving semantically controllable dexterous grasp synthesis remains highly challenging due to the lack of unified modeling of multiple semantic dimensions, including grasp taxonomy, contact semantics, and functional affordance. To address these limitations, we present OmniDexVLG, a multimodal, semantics aware grasp generation framework capable of producing structurally diverse and semantically coherent dexterous grasps under joint language and visual guidance. Our approach begins with OmniDexDataGen, a semantic rich dexterous grasp dataset generation pipeline that integrates grasp taxonomy guided configuration sampling, functional affordance contact point sampling, taxonomy aware differential force closure grasp sampling, and physics based optimization and validation, enabling systematic coverage of diverse grasp types. We further introduce OmniDexReasoner, a multimodal grasp type semantic reasoning module that leverages multi agent collaboration, retrieval augmented generation, and chain of thought reasoning to infer grasp related semantics and generate high quality annotations that align language instructions with task specific grasp intent. Building upon these components, we develop a unified Vision Language Grasping generation model that explicitly incorporates grasp taxonomy, contact structure, and functional affordance semantics, enabling fine grained control over grasp synthesis from natural language instructions. Extensive experiments in simulation and real world object grasping and ablation studies demonstrate that our method substantially outperforms state of the art approaches in terms of grasp diversity, contact semantic diversity, functional affordance diversity, and semantic consistency.
中文标题/摘要
标题:OmniDexVLG:基于视觉语言模型引导的抓取语义、分类和功能适用性生成的学习
灵巧抓取生成旨在生成与任务要求和人类可解释的抓取语义相匹配的抓取姿态。然而,由于缺乏对多个语义维度的统一建模,包括抓取分类、接触语义和功能适用性,实现语义可控的灵巧抓取合成仍然极具挑战性。为了解决这些限制,我们提出了OmniDexVLG,这是一种多模态、语义感知的抓取生成框架,能够在联合语言和视觉引导下生成结构多样且语义一致的灵巧抓取。我们的方法始于OmniDexDataGen,这是一个语义丰富的灵巧抓取数据集生成管道,该管道结合了抓取分类引导的配置采样、功能适用性接触点采样、分类意识下的差异力闭合抓取采样、基于物理的优化和验证,从而实现对各种抓取类型的系统覆盖。我们进一步引入了OmniDexReasoner,这是一种多模态抓取类型语义推理模块,利用多智能体协作、检索增强生成和链式推理来推断与语言指令和任务特定抓取意图相一致的抓取相关语义和高质量注释。基于这些组件,我们开发了一种统一的视觉语言抓取生成模型,明确地结合了抓取分类、接触结构和功能适用性语义,从而从自然语言指令中实现对抓取合成的精细控制。在模拟和真实世界物体抓取中的广泛实验以及消融研究均表明,我们的方法在抓取多样性、接触语义多样性、功能适用性多样性以及语义一致性方面显著优于现有方法。
Summary / 总结
OmniDexVLG is a multimodal grasp generation framework that addresses the challenge of semantically controllable dexterous grasp synthesis by integrating a rich dexterous grasp dataset generation pipeline and a multimodal semantic reasoning module. The framework generates diverse and semantically coherent grasps under language and visual guidance. Experimental results show that OmniDexVLG outperforms existing methods in terms of grasp diversity, contact semantic diversity, functional affordance diversity, and semantic consistency.
OmniDexVLG 是一个多模态抓取生成框架,通过整合抓取分类学、接触语义和功能可利用性,解决了可控的灵巧抓取合成难题。它使用数据生成管道(OmniDexDataGen)和推理模块(OmniDexReasoner),在语言和视觉引导下生成多样且一致的抓取。实验结果表明,OmniDexVLG 在抓取和接触语义多样性、功能可利用性多样性以及语义一致性方面优于现有方法。
PULSE: A Unified Multi-Task Architecture for Cardiac Segmentation, Diagnosis, and Few-Shot Cross-Modality Clinical Adaptation
Authors: Hania Ghouse, Maryam Alsharqi, Farhad R. Nezami, Muzammil Behzad
First: 2025-12-03T14:49:01+00:00 · Latest: 2025-12-03T14:49:01+00:00
Abstract
Cardiac image analysis remains fragmented across tasks: anatomical segmentation, disease classification, and grounded clinical report generation are typically handled by separate networks trained under different data regimes. No existing framework unifies these objectives within a single architecture while retaining generalization across imaging modalities and datasets. We introduce PULSE, a multi-task vision-language framework built on self-supervised representations and optimized through a composite supervision strategy that balances region overlap learning, pixel wise classification fidelity, and boundary aware IoU refinement. A multi-scale token reconstruction decoder enables anatomical segmentation, while shared global representations support disease classification and clinically grounded text output allowing the model to transition from pixels to structures and finally clinical reasoning within one architecture. Unlike prior task-specific pipelines, PULSE learns task-invariant cardiac priors, generalizes robustly across datasets, and can be adapted to new imaging modalities with minimal supervision. This moves the field closer to a scalable, foundation style cardiac analysis framework.
中文标题/摘要
标题:PULSE:一种统一的多任务架构,用于心脏分割、诊断和少量样本跨模态临床适应
心脏图像分析仍然分散在各个任务中:解剖分割、疾病分类和基于临床报告生成通常由分别在不同数据制度下训练的独立网络处理。目前没有任何框架能够在单一架构中统一这些目标,同时在成像模态和数据集之间保持泛化能力。我们引入了PULSE,这是一种基于自监督表示的多任务视觉-语言框架,并通过一种综合监督策略进行优化,该策略平衡了区域重叠学习、像素级分类准确性和边界感知的IoU细化。多尺度标记重建解码器支持解剖分割,共享的全局表示支持疾病分类和基于临床的文本输出,使模型能够在单一架构中从像素过渡到结构,最终进行临床推理。与先前的任务特定管道不同,PULSE 学习任务不变的心脏先验知识,能够在数据集之间稳健泛化,并且在最少监督的情况下可以适应新的成像模态。这使该领域更接近一种可扩展的、基础模型风格的心脏分析框架。
CoDA: From Text-to-Image Diffusion Models to Training-Free Dataset Distillation
Authors: Letian Zhou, Songhua Liu, Xinchao Wang
First: 2025-12-03T14:45:57+00:00 · Latest: 2025-12-03T14:45:57+00:00
Comments: 34 pages, 24 figures
Abstract
Prevailing Dataset Distillation (DD) methods leveraging generative models confront two fundamental limitations. First, despite pioneering the use of diffusion models in DD and delivering impressive performance, the vast majority of approaches paradoxically require a diffusion model pre-trained on the full target dataset, undermining the very purpose of DD and incurring prohibitive training costs. Second, although some methods turn to general text-to-image models without relying on such target-specific training, they suffer from a significant distributional mismatch, as the web-scale priors encapsulated in these foundation models fail to faithfully capture the target-specific semantics, leading to suboptimal performance. To tackle these challenges, we propose Core Distribution Alignment (CoDA), a framework that enables effective DD using only an off-the-shelf text-to-image model. Our key idea is to first identify the "intrinsic core distribution" of the target dataset using a robust density-based discovery mechanism. We then steer the generative process to align the generated samples with this core distribution. By doing so, CoDA effectively bridges the gap between general-purpose generative priors and target semantics, yielding highly representative distilled datasets. Extensive experiments suggest that, without relying on a generative model specifically trained on the target dataset, CoDA achieves performance on par with or even superior to previous methods with such reliance across all benchmarks, including ImageNet-1K and its subsets. Notably, it establishes a new state-of-the-art accuracy of 60.4% at the 50-images-per-class (IPC) setup on ImageNet-1K. Our code is available on the project webpage: https://github.com/zzzlt422/CoDA
中文标题/摘要
标题:CoDA:从文本到图像的扩散模型到无需训练的数据集蒸馏
现有的利用生成模型进行数据集蒸馏(DD)的方法面临两个根本性的限制。首先,尽管这些方法开创性地使用了扩散模型进行DD并取得了令人印象深刻的效果,但大多数方法仍然需要一个在完整目标数据集上预训练的扩散模型,这违背了DD的初衷,并导致了高昂的训练成本。其次,尽管一些方法转向了通用的文本到图像模型,而不依赖于特定目标数据集的训练,但它们仍然遭受了显著的分布不匹配问题,因为这些基础模型中的大规模先验未能忠实捕捉目标特定的语义,导致性能不佳。为了解决这些挑战,我们提出了核心分布对齐(CoDA)框架,该框架仅使用现成的文本到图像模型即可实现有效的DD。我们的核心思想是,首先使用鲁棒的基于密度的发现机制识别目标数据集的“内在核心分布”,然后引导生成过程使生成的样本与该核心分布对齐。通过这种方式,CoDA 有效地弥合了通用生成先验与目标语义之间的差距,产生了高度代表性的蒸馏数据集。广泛的实验表明,在不依赖于特定目标数据集的生成模型的情况下,CoDA 在所有基准测试中,包括 ImageNet-1K 及其子集,其性能与或优于依赖于此类模型的方法,甚至在 ImageNet-1K 的 50 个类别每类 50 张图像(IPC)设置下,建立了新的最佳准确率 60.4%。我们的代码可在项目网页上获得:https://github.com/zzzlt422/CoDA
Summary / 总结
CoDA addresses the limitations of existing Dataset Distillation (DD) methods by proposing a framework that uses an off-the-shelf text-to-image model. It identifies the intrinsic core distribution of the target dataset and aligns generated samples with this distribution, thereby bridging the gap between general-purpose generative priors and target-specific semantics. Experiments show that CoDA achieves performance comparable to or better than previous methods that rely on target-specific training, with a new state-of-the-art accuracy of 60.4% on ImageNet-1K at the 50-images-per-class setup.
CoDA通过提出一个框架,使用现成的文本到图像模型来解决现有数据集蒸馏(DD)方法的局限性。该框架首先识别目标数据集的内在核心分布,并使生成的样本与该分布对齐,从而在通用生成先验和目标特定语义之间建立桥梁。实验表明,CoDA在不依赖于特定目标数据集训练的生成模型的情况下,其性能与依赖此类训练的方法相当或更好,在ImageNet-1K 50类每类50张图像的设置下达到了新的最佳准确率60.4%。
Fairy2i: Training Complex LLMs from Real LLMs with All Parameters in $\{\pm 1, \pm i\}$
Authors: Feiyu Wang, Xinyu Tan, Bokai Huang, Yihao Zhang, Guoan Wang, Peizhuang Cong, Tong Yang
First: 2025-12-02T16:14:08+00:00 · Latest: 2025-12-03T14:15:05+00:00
Comments: 15 pages, 3 figures
Abstract
Large language models (LLMs) have revolutionized artificial intelligence, yet their massive memory and computational demands necessitate aggressive quantization, increasingly pushing representations toward the theoretical limit of a single bit. While complex-valued LLMs, such as iFairy, offer a superior chance for low-bit representation compared to real-valued counterparts, they require training from scratch, preventing the utilization of the vast ecosystem of pre-trained real-valued foundation models. Here we present Fairy2i, a universal framework that transforms pre-trained real-valued layers into an equivalent widely-linear complex form, enabling extremely low-bit quantization while reusing existing checkpoints. By proving a lossless mathematical equivalence between real and widely-linear maps, we convert standard Transformers into the complex domain and employ a phase-aware quantization scheme with a highly efficient codebook of fourth roots of unity. Furthermore, we introduce a recursive residual quantization mechanism that iteratively minimizes quantization error, allowing inference to proceed via efficient multiplication-free accumulation. We demonstrate that Fairy2i restores the performance of LLaMA-2 7B at an effective 2-bit precision to levels nearly comparable with full-precision baselines, significantly outperforming state-of-the-art real-valued binary and ternary quantization methods. This work bridges the gap between the representational efficiency of complex-valued arithmetic and the practical utility of pre-trained models, paving a new way for efficient inference on commodity hardware.
中文标题/摘要
标题:Fairy2i:使用所有参数在{$\pm 1, \pm i$}中的实LLM训练复杂LLM
大型语言模型(LLM)已经彻底改变了人工智能,但它们巨大的内存和计算需求迫使人们进行激进的量化,越来越接近理论上的单比特表示极限。虽然复数LLM,如iFairy,相比实数LLM提供了更低比特表示的更好机会,但它们需要从头开始训练,无法利用庞大的预训练实数基础模型生态系统。在这里,我们提出了Fairy2i,这是一种通用框架,可以将预训练的实数层转换为等效的广泛线性复数形式,从而实现极低比特量化并重用现有检查点。通过证明实数和广泛线性映射之间的无损数学等价性,我们将标准Transformer转换到复数域,并采用一种基于四次单位根的高效码本的相位感知量化方案。此外,我们引入了一种递归残差量化机制,该机制迭代地最小化量化误差,允许通过高效的无乘法累加进行推理。我们证明,Fairy2i可以在有效2比特精度下恢复LLaMA-2 7B的性能,几乎与全精度基线相当,显著优于最先进的实数二进制和三进制量化方法。这项工作在复数算术的表示效率和预训练模型的实际用途之间架起了一座桥梁,为在普通硬件上高效推理开辟了一条新途径。
Summary / 总结
Fairy2i is a framework that converts pre-trained real-valued LLMs into complex-valued models, enabling low-bit quantization while maintaining performance. It uses a phase-aware quantization scheme and a recursive residual quantization mechanism to minimize quantization error, achieving performance comparable to full-precision models at 2-bit precision. This method bridges the gap between representational efficiency and practical utility, allowing for efficient inference on commodity hardware.
Fairy2i 是一个框架,将预训练的实值语言模型转换为复值模型,同时保持性能并实现低比特量化。它使用相位感知量化方案和递归残差量化机制来最小化量化误差。实验结果表明,Fairy2i 可以将 LLaMA-2 7B 在 2 比特精度下的性能恢复到接近全精度基线的水平,并且优于现有的实值二进制和三进制量化方法。
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
Authors: Zichuan Lin, Yicheng Liu, Yang Yang, Lvfang Tao, Deheng Ye
First: 2025-12-03T13:43:30+00:00 · Latest: 2025-12-03T13:43:30+00:00
Comments: 15 pages, 9 figures
Abstract
Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.
中文标题/摘要
标题:AdaptVision:通过自适应视觉获取提高效率的视觉-语言模型
视觉-语言模型(VLMs)在视觉问答任务中取得了显著的成功,但它们对大量视觉标记的依赖引入了显著的计算开销。虽然现有的高效VLM方法通过固定比例压缩视觉标记来减少视觉标记,但它们是被动的,缺乏适应不同任务需求的能力。这促使了一个基本问题:VLMs能否自主确定每个样本所需的最小视觉标记数量?受人类主动视觉机制的启发,我们引入了AdaptVision,这是一种通过粗到细的方法实现自适应视觉标记获取的高效VLM范式。我们的模型最初处理来自低分辨率图像的压缩视觉标记,并在必要时通过调用边界框工具裁剪关键区域以选择性地获取额外的视觉信息。我们使用强化学习框架训练AdaptVision,该框架仔细平衡准确性和效率。我们方法的核心是解耦轮次策略优化(DTPO),它将学习目标分解为两个部分:(1)工具学习,优化正确工具的使用;(2)准确度改进,通过细化生成的响应来提高答案的正确性。基于这种表述,我们进一步通过为每个目标计算单独的优势来解耦优势估计。这种表述使得AdaptVision的优化比vanilla GRPO更有效。在多个VQA基准上的全面实验表明,AdaptVision在消耗远少于最先进的高效VLM方法的视觉标记的情况下实现了更好的性能。
Summary / 总结
AdaptVision is an efficient VLM paradigm that uses a coarse-to-fine approach to adaptively acquire visual tokens, reducing computational overhead. It trains the model using a reinforcement learning framework with Decoupled Turn Policy Optimization (DTPO) to balance accuracy and efficiency. Experiments show that AdaptVision outperforms existing efficient VLM methods by using fewer visual tokens across multiple VQA benchmarks.
AdaptVision 是一种自主确定每个样本所需最小视觉令牌数量的高效 VLM 架构,灵感来源于人类的主动视觉机制。它采用粗到细的方法,初始处理压缩的视觉令牌,并在必要时选择性地获取额外的信息。AdaptVision 使用强化学习框架和 Decoupled Turn Policy Optimization (DTPO) 来平衡准确性和效率。实验表明,AdaptVision 在使用更少的视觉令牌时优于最先进的高效 VLM 方法。
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue
Authors: Wenwen Tong, Hewei Guo, Dongchuan Ran, Jiangnan Chen, Jiefan Lu, Kaibin Wang, Keqiang Li, Xiaoxu Zhu, Jiakui Li, Kehan Li, Xueheng Li, Lumin Li, Chenxu Guo, Jiasheng Zhou, Jiandong Chen, Xianye Wu, Jiahao Wang, Silei Wu, Lei Chen, Hanming Deng, Yuxuan Song, Dinghao Zhou, Guiping Zhong, Ken Zheng, Shiyin Kang, Lewei Lu
First: 2025-10-15T16:52:48+00:00 · Latest: 2025-12-03T13:16:15+00:00
Abstract
We introduce InteractiveOmni, a unified and open-source omni-modal large language model for audio-visual multi-turn interaction, ranging from 4B to 8B parameters, designed to lead the field of lightweight models by offering comprehensive omni-modal understanding and speech generation capabilities. To achieve this, we integrate the vision encoder, audio encoder, large language model, and speech decoder into a unified model for understanding and generation tasks. We design a multi-stage training strategy to ensure robust cross-modal capabilities, including pre-training for omni-modal understanding, followed by post-training with speech conversation and audio-visual interaction. To enable human-like long-term conversational ability, we meticulously curate a multi-turn training dataset that enhances the model's ability to handle complex and multi-turn interactions. To effectively evaluate the multi-turn memory and speech interaction capabilities, we construct the multi-modal multi-turn memory benchmark and the multi-turn speech interaction benchmark. Experiments demonstrate that InteractiveOmni significantly outperforms leading open-source models and provides a more intelligent multi-turn audio-visual experience, particularly in its long-term memory capabilities. Notably, InteractiveOmni-4B is comparable to the much larger model like Qwen2.5-Omni-7B on general benchmarks, and it can retain 97% of the performance of the InteractiveOmni-8B while utilizing only 50% of the model size. Achieving state-of-the-art results against similarly sized models across image, audio, video understanding, and speech generation tasks, InteractiveOmni is an accessible, open-source foundation for next-generation intelligent interactive systems.
中文标题/摘要
标题:InteractiveOmni:一种统一的多模态模型用于音频-视觉多轮对话
我们介绍了InteractiveOmni,这是一种统一且开源的多模态大型语言模型,适用于从4B到8B参数的音频-视觉多轮交互,旨在通过提供全面的多模态理解和语音生成能力引领轻量级模型领域。为此,我们将视觉编码器、音频编码器、大型语言模型和语音解码器整合到一个统一的模型中,用于理解和生成任务。我们设计了多阶段训练策略以确保稳健的跨模态能力,包括预训练以实现多模态理解,然后通过语音对话和音频-视觉交互进行后训练。为了使模型具备类似人类的长期对话能力,我们精心策划了一个多轮训练数据集,以增强模型处理复杂和多轮交互的能力。为了有效评估多轮记忆和语音交互能力,我们构建了多模态多轮记忆基准和多轮语音交互基准。实验表明,InteractiveOmni在多轮音频-视觉体验方面显著优于领先开源模型,并且在长期记忆能力方面尤为突出。值得注意的是,InteractiveOmni-4B在通用基准测试中与更大的Qwen2.5-Omni-7B相当,同时仅使用50%的模型大小即可保留InteractiveOmni-8B 97%的性能。InteractiveOmni在图像、音频、视频理解和语音生成任务中均达到同类最佳结果,是下一代智能交互系统的基础开源平台。
Summary / 总结
InteractiveOmni is a unified omni-modal model for audio-visual multi-turn dialogue, integrating vision, audio, language, and speech components into a single model. It employs a multi-stage training strategy to enhance cross-modal understanding and speech generation. Experiments show that InteractiveOmni outperforms other open-source models, especially in long-term memory capabilities, and achieves state-of-the-art results in various tasks while being more efficient in terms of model size.
InteractiveOmni 是一个统一的多模态模型,用于音频-视觉多轮对话,整合了视觉、音频、语言和语音组件于一体。它使用多阶段训练策略来增强跨模态理解和语音生成能力。实验表明,InteractiveOmni 在长期记忆能力方面优于其他开源模型,并在各种任务中取得了最先进的成果,且模型大小比 Qwen2.5-Omni-7B 更小。
Universally Converging Representations of Matter Across Scientific Foundation Models
Authors: Sathya Edamadaka, Soojung Yang, Ju Li, Rafael Gómez-Bombarelli
Venue: NeurIPS 2025 Oral
First: 2025-12-03T12:47:06+00:00 · Latest: 2025-12-03T12:47:06+00:00
Comments: Oral spotlight at NeurIPS 2025 UniReps Workshop
Abstract
Machine learning models of vastly different modalities and architectures are being trained to predict the behavior of molecules, materials, and proteins. However, it remains unclear whether they learn similar internal representations of matter. Understanding their latent structure is essential for building scientific foundation models that generalize reliably beyond their training domains. Although representational convergence has been observed in language and vision, its counterpart in the sciences has not been systematically explored. Here, we show that representations learned by nearly sixty scientific models, spanning string-, graph-, 3D atomistic, and protein-based modalities, are highly aligned across a wide range of chemical systems. Models trained on different datasets have highly similar representations of small molecules, and machine learning interatomic potentials converge in representation space as they improve in performance, suggesting that foundation models learn a common underlying representation of physical reality. We then show two distinct regimes of scientific models: on inputs similar to those seen during training, high-performing models align closely and weak models diverge into local sub-optima in representation space; on vastly different structures from those seen during training, nearly all models collapse onto a low-information representation, indicating that today's models remain limited by training data and inductive bias and do not yet encode truly universal structure. Our findings establish representational alignment as a quantitative benchmark for foundation-level generality in scientific models. More broadly, our work can track the emergence of universal representations of matter as models scale, and for selecting and distilling models whose learned representations transfer best across modalities, domains of matter, and scientific tasks.
中文标题/摘要
标题:物质在科学基础模型中的普遍收敛表示
不同模态和架构的机器学习模型被训练以预测分子、材料和蛋白质的行为。然而,尚不清楚它们是否学习了相似的物质内部表示。理解其潜在结构对于构建能够在其训练领域之外可靠泛化的科学基础模型至关重要。尽管在语言和视觉中观察到了表示收敛,但在科学领域中其对应物尚未系统地进行探索。在这里,我们展示了近六十个科学模型学习的表示,在广泛的化学系统中高度一致。在不同数据集上训练的模型对小分子具有高度相似的表示,随着性能的提高,机器学习原子间势在表示空间中收敛,表明基础模型学习了物理现实的共同底层表示。然后我们展示了两种不同的科学模型范式:在与训练输入相似的输入上,高绩效模型在表示空间中紧密对齐,而弱模型则在局部次优解中发散;在与训练中看到的结构截然不同的结构上,几乎所有模型都塌缩到一个低信息量的表示,表明当前的模型仍然受限于训练数据和归纳偏见,并未真正编码普遍结构。我们的研究结果确立了表示一致性作为科学模型基础级泛化的定量基准。更广泛地说,我们的工作可以跟踪模型规模时物质的普遍表示的出现,并选择和提炼出其学习表示在不同模态、物质领域和科学任务中转移最佳的模型。
Summary / 总结
The study investigates whether different machine learning models learn similar internal representations of matter, which is crucial for building reliable scientific foundation models. By analyzing nearly sixty models across various modalities, the researchers found that these models exhibit highly aligned representations of small molecules and interatomic potentials, suggesting a common underlying representation of physical reality. However, when faced with vastly different structures, models converge to a low-information representation, indicating limitations in their generality and reliance on training data.
研究旨在了解不同模态和架构的机器学习模型是否学习了相似的物质内部表示,这对于构建可靠的科学基础模型至关重要。研究显示,包括字符串、图、3D 原子和蛋白质模型在内的各种模型在不同化学系统中学习的表示高度一致。不同数据集训练的模型对小分子有相似的表示,并且随着性能的提高,它们的表示趋于一致,表明它们学习了物理现实的共同底层表示。然而,当面对与训练数据不同的输入时,模型会陷入局部最优解,几乎所有模型在非常不同的结构上都会坍缩到低信息量的表示,表明当前模型仍受到训练数据和归纳偏见的限制,尚未真正编码出普遍结构。这些发现为科学模型的基础级通用性提供了一个基准,并有助于跟踪随着模型规模扩大而出现的物质的普遍表示的演变,以及选择和提炼出在不同模态、物质领域和科学任务中表示转移最佳的模型。
Dual-level Modality Debiasing Learning for Unsupervised Visible-Infrared Person Re-Identification
Authors: Jiaze Li, Yan Lu, Bin Liu, Guojun Yin, Mang Ye
First: 2025-12-03T12:43:16+00:00 · Latest: 2025-12-03T12:43:16+00:00
Abstract
Two-stage learning pipeline has achieved promising results in unsupervised visible-infrared person re-identification (USL-VI-ReID). It first performs single-modality learning and then operates cross-modality learning to tackle the modality discrepancy. Although promising, this pipeline inevitably introduces modality bias: modality-specific cues learned in the single-modality training naturally propagate into the following cross-modality learning, impairing identity discrimination and generalization. To address this issue, we propose a Dual-level Modality Debiasing Learning (DMDL) framework that implements debiasing at both the model and optimization levels. At the model level, we propose a Causality-inspired Adjustment Intervention (CAI) module that replaces likelihood-based modeling with causal modeling, preventing modality-induced spurious patterns from being introduced, leading to a low-biased model. At the optimization level, a Collaborative Bias-free Training (CBT) strategy is introduced to interrupt the propagation of modality bias across data, labels, and features by integrating modality-specific augmentation, label refinement, and feature alignment. Extensive experiments on benchmark datasets demonstrate that DMDL could enable modality-invariant feature learning and a more generalized model.
中文标题/摘要
标题:双层次模态去偏学习在无监督可见光-红外行人重识别中的应用
两阶段学习管道在无监督可见光-红外行人重识别(USL-VI-ReID)中取得了令人鼓舞的结果。它首先进行单模态学习,然后进行跨模态学习以解决模态差异问题。尽管如此,该管道不可避免地引入了模态偏见:单模态训练中学到的模态特定线索自然传播到后续的跨模态学习中,影响身份识别和泛化能力。为了解决这一问题,我们提出了一种双层次模态去偏学习(DMDL)框架,该框架在模型和优化两个层面实施去偏。在模型层面,我们提出了一种因果启发的调整干预(CAI)模块,用因果建模替代基于似然的建模,防止由模态引起的虚假模式被引入,从而得到低偏见的模型。在优化层面,引入了一种协作无偏训练(CBT)策略,通过集成模态特定增强、标签细化和特征对齐来中断模态偏见在数据、标签和特征之间的传播。在基准数据集上的广泛实验表明,DMDL能够实现模态不变特征学习和更泛化的模型。
Summary / 总结
The paper addresses the issue of modality bias in unsupervised visible-infrared person re-identification by proposing a Dual-level Modality Debiasing Learning (DMDL) framework. This framework includes a Causality-inspired Adjustment Intervention (CAI) module at the model level and a Collaborative Bias-free Training (CBT) strategy at the optimization level. The CAI module uses causal modeling to prevent spurious patterns from modality-specific cues, while the CBT strategy interrupts the propagation of modality bias through data, labels, and features. Experiments show that DMDL improves modality-invariant feature learning and generalization.
论文提出了一种双重模态去偏学习框架(DMDL),以解决无监督可见-红外行人再识别中的模态偏见问题。该框架包括模型层面的因果启发调整干预(CAI)模块和优化层面的协作无偏训练(CBT)策略。CAI模块防止模态特定线索引入的虚假模式,而CBT策略中断了模态偏见在数据、标签和特征间的传播。实验表明,DMDL能够实现模态不变的特征学习和更泛化的模型。
Crossing the Sim2Real Gap Between Simulation and Ground Testing to Space Deployment of Autonomous Free-flyer Control
Authors: Kenneth Stewart, Samantha Chapin, Roxana Leontie, Carl Glen Henshaw
First: 2025-12-03T12:33:35+00:00 · Latest: 2025-12-03T12:33:35+00:00
Comments: published at iSpaRo 2025
Abstract
Reinforcement learning (RL) offers transformative potential for robotic control in space. We present the first on-orbit demonstration of RL-based autonomous control of a free-flying robot, the NASA Astrobee, aboard the International Space Station (ISS). Using NVIDIA's Omniverse physics simulator and curriculum learning, we trained a deep neural network to replace Astrobee's standard attitude and translation control, enabling it to navigate in microgravity. Our results validate a novel training pipeline that bridges the simulation-to-reality (Sim2Real) gap, utilizing a GPU-accelerated, scientific-grade simulation environment for efficient Monte Carlo RL training. This successful deployment demonstrates the feasibility of training RL policies terrestrially and transferring them to space-based applications. This paves the way for future work in In-Space Servicing, Assembly, and Manufacturing (ISAM), enabling rapid on-orbit adaptation to dynamic mission requirements.
中文标题/摘要
标题:跨越仿真与地面测试之间的Sim2Real鸿沟,将自主自由飞行器控制从太空部署到实际应用
强化学习(RL)为太空中的机器人控制提供了变革性的潜力。我们首次在国际空间站(ISS)上展示了基于RL的自主控制自由飞行器NASA Astrobee的在轨演示。利用NVIDIA的Omniverse物理模拟器和课程学习,我们训练了一个深度神经网络来替代Astrobee的标准姿态和位移控制,使其能够在微重力环境中导航。我们的结果验证了一种新颖的训练管道,该管道利用GPU加速的科学级模拟环境来实现高效的蒙特卡洛RL训练,从而弥合了仿真到现实(Sim2Real)的鸿沟。这一成功的部署证明了在地面上训练RL策略并将其转移到太空应用的可行性。这为未来在轨服务、组装和制造(ISAM)工作铺平了道路,使快速适应动态任务需求成为可能。
Summary / 总结
The research aims to leverage reinforcement learning (RL) for robotic control in space, specifically demonstrating the first on-orbit RL-based autonomous control of NASA's Astrobee aboard the ISS. Using NVIDIA's Omniverse simulator and curriculum learning, a deep neural network was trained to replace Astrobee's standard control systems, allowing it to navigate in microgravity. The study validates a novel training pipeline that bridges the simulation-to-reality gap, showing the feasibility of training RL policies on Earth and applying them in space, which is crucial for future In-Space Servicing, Assembly, and Manufacturing (ISAM) missions.
研究旨在利用强化学习(RL)在太空中进行机器人控制,具体展示了在国际空间站(ISS)上使用RL自主控制NASA的Astrobee。通过使用NVIDIA的Omniverse模拟器和课程学习,训练了一个深度神经网络来替代Astrobee的标准控制系统,使其能够在微重力环境中导航。研究验证了一种新的训练管道,能够弥合模拟与现实之间的差距,展示了在地球上训练RL策略并应用于太空应用的可行性,这对于未来的在轨服务、组装和制造(ISAM)任务至关重要。
Autonomous Planning In-space Assembly Reinforcement-learning free-flYer (APIARY) International Space Station Astrobee Testing
Authors: Samantha Chapin, Kenneth Stewart, Roxana Leontie, Carl Glen Henshaw
First: 2025-12-03T12:16:52+00:00 · Latest: 2025-12-03T12:16:52+00:00
Comments: iSpaRo 2025, Best Paper Award in Orbital Robotics
Abstract
The US Naval Research Laboratory's (NRL's) Autonomous Planning In-space Assembly Reinforcement-learning free-flYer (APIARY) experiment pioneers the use of reinforcement learning (RL) for control of free-flying robots in the zero-gravity (zero-G) environment of space. On Tuesday, May 27th 2025 the APIARY team conducted the first ever, to our knowledge, RL control of a free-flyer in space using the NASA Astrobee robot on-board the International Space Station (ISS). A robust 6-degrees of freedom (DOF) control policy was trained using an actor-critic Proximal Policy Optimization (PPO) network within the NVIDIA Isaac Lab simulation environment, randomizing over goal poses and mass distributions to enhance robustness. This paper details the simulation testing, ground testing, and flight validation of this experiment. This on-orbit demonstration validates the transformative potential of RL for improving robotic autonomy, enabling rapid development and deployment (in minutes to hours) of tailored behaviors for space exploration, logistics, and real-time mission needs.
中文标题/摘要
标题:自主规划在轨组装强化学习自由飞行器(APIARY)国际空间站Astrobee测试
美国海军研究实验室(NRL)的自主规划在轨组装强化学习自由飞行器(APIARY)实验开创了在零重力(零-G)空间环境中使用强化学习(RL)控制自由飞行机器人之先河。2025年5月27日星期二,APIARY团队使用NASA的Astrobee机器人在国际空间站(ISS)上进行了我们所知的首次在轨自由飞行器的RL控制实验。一个稳健的6自由度(DOF)控制策略是在NVIDIA Isaac Lab模拟环境中使用演员-评论家近端策略优化(PPO)网络训练的,通过随机化目标姿态和质量分布来增强鲁棒性。本文详细介绍了该实验的模拟测试、地面测试和飞行验证。该在轨演示验证了RL在提高机器人自主性方面的变革潜力,使其能够快速开发和部署(几分钟到几小时)针对太空探索、物流和实时任务需求的定制行为。
PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention
Authors: Ziwen Li, Xin Wang, Hanlue Zhang, Runnan Chen, Runqi Lin, Xiao He, Han Huang, Yandong Guo, Fakhri Karray, Tongliang Liu, Mingming Gong
First: 2025-12-03T12:14:29+00:00 · Latest: 2025-12-03T12:14:29+00:00
Abstract
The Vision-Language-Action (VLA) models have demonstrated remarkable performance on embodied tasks and shown promising potential for real-world applications. However, current VLAs still struggle to produce consistent and precise target-oriented actions, as they often generate redundant or unstable motions along trajectories, limiting their applicability in time-sensitive scenarios.In this work, we attribute these redundant actions to the spatially uniform perception field of existing VLAs, which causes them to be distracted by target-irrelevant objects, especially in complex environments.To address this issue, we propose an efficient PosA-VLA framework that anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions. The pose-conditioned anchor attention mechanism enables the model to better align instruction semantics with actionable visual cues, thereby improving action generation precision and efficiency. Moreover, our framework adopts a lightweight architecture and requires no auxiliary perception modules (e.g., segmentation or grounding networks), ensuring efficient inference. Extensive experiments verify that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks and shows robust generalization in a variety of challenging environments.
中文标题/摘要
标题:PosA-VLA: 基于姿态条件锚注意力增强的动作生成
视觉-语言-动作(VLA)模型在执行体感任务方面表现出色,并展示了在实际应用中的巨大潜力。然而,当前的VLA模型仍然难以生成一致且精确的目标导向动作,因为它们经常沿轨迹生成冗余或不稳定的动作,限制了其在时间敏感场景中的应用。在本工作中,我们将这些冗余动作归因于现有VLA模型的空间均匀感知场,这导致它们容易被与目标无关的物体所吸引,尤其是在复杂环境中。为了解决这一问题,我们提出了一种高效的PosA-VLA框架,通过姿态条件监督锚定视觉注意力,一致地引导模型的感知朝向与任务相关的区域。姿态条件锚注意力机制使模型能够更好地将指令语义与可操作的视觉线索对齐,从而提高动作生成的精确性和效率。此外,我们的框架采用轻量级架构,并不需要辅助感知模块(例如,分割或语义网络),确保了高效的推理。广泛的实验验证了我们的方法在各种机器人操作基准测试中执行精确且高效的行为,并在多种具有挑战性的环境中展示了鲁棒的泛化能力。
Summary / 总结
This work addresses the issue of inconsistent and imprecise actions generated by existing Vision-Language-Action models, which often get distracted by irrelevant objects. To tackle this, the authors propose PosA-VLA, which uses pose-conditioned anchor attention to guide the model's perception towards task-relevant regions. This method improves action precision and efficiency and does not require additional perception modules, making it suitable for real-world applications. Experiments show that PosA-VLA performs well across various robotic manipulation benchmarks.
本文提出PosA-VLA框架,通过姿态条件化的锚注意力机制引导模型的感知聚焦于任务相关区域,以解决现有Vision-Language-Action模型在生成冗余和不稳定动作的问题。该方法提高了动作生成的精确性和效率,无需额外的感知模块。实验结果表明,PosA-VLA在多种机器人操作基准测试中表现出精确和高效的行为,并在复杂环境中具有良好的泛化能力。
Flowchart2Mermaid: A Vision-Language Model Powered System for Converting Flowcharts into Editable Diagram Code
Authors: Pritam Deka, Barry Devereux
First: 2025-12-01T20:07:59+00:00 · Latest: 2025-12-03T11:47:04+00:00
Comments: Submitted to EACL 2026 Demo Track
Abstract
Flowcharts are common tools for communicating processes but are often shared as static images that cannot be easily edited or reused. We present Flowchart2Mermaid, a lightweight web system that converts flowchart images into editable Mermaid.js code which is a markup language for visual workflows, using a detailed system prompt and vision-language models. The interface supports mixed-initiative refinement through inline text editing, drag-and-drop node insertion, and natural-language commands interpreted by an integrated AI assistant. Unlike prior image-to-diagram tools, our approach produces a structured, version-controllable textual representation that remains synchronized with the rendered diagram. We further introduce evaluation metrics to assess structural accuracy, flow correctness, syntax validity, and completeness across multiple models.
中文标题/摘要
标题:Flowchart2Mermaid:一种基于视觉-语言模型的流程图转换系统
流程图是用于传达过程的常见工具,但通常以静态图像形式共享,无法轻松编辑或重复使用。我们提出了Flowchart2Mermaid,这是一种轻量级的网络系统,使用详细的系统提示和视觉-语言模型将流程图图像转换为可编辑的Mermaid.js代码,这是一种用于可视化工作流的标记语言。该界面支持通过内联文本编辑、拖放节点插入和由集成AI助手解释的自然语言命令进行混合主动细化。与之前的图像到图表工具不同,我们的方法生成了一种结构化、版本可控的文本表示,始终保持与渲染图表的同步。我们还引入了评估指标来评估结构准确性、流程正确性、语法有效性以及多个模型的完整性。
Summary / 总结
Flowchart2Mermaid is a web system that converts flowchart images into editable Mermaid.js code using vision-language models and a system prompt. It supports mixed-initiative refinement through inline text editing and natural-language commands. The system produces a structured, version-controllable textual representation that remains synchronized with the rendered diagram. Key experimental findings include structural accuracy, flow correctness, syntax validity, and completeness across multiple models.
Flowchart2Mermaid 是一个使用视觉-语言模型和详细系统提示将流程图图像转换为可编辑的 Mermaid.js 代码的轻量级网页系统。它支持通过内联文本编辑和自然语言命令进行混合主动细化。该系统生成一个结构化、版本可控的文本表示,始终保持与渲染图的同步。关键实验发现包括结构准确性、流程正确性、语法有效性以及多个模型的完整性。
Optical Context Compression Is Just (Bad) Autoencoding
Authors: Ivan Yee Lee, Cheng Yang, Taylor Berg-Kirkpatrick
First: 2025-12-03T10:27:27+00:00 · Latest: 2025-12-03T10:27:27+00:00
Abstract
DeepSeek-OCR demonstrates that rendered text can be reconstructed with high fidelity from a small number of vision tokens. This finding has sparked excitement about vision-based context compression for language models. But the evaluation stops at reconstruction; whether these representations help language modeling remains untested. We test two assumptions implicit in the optical-compression narrative: that vision-based compression provides unique advantages for text reconstruction from compressed representations, and that DeepSeek-OCR's reconstruction results are evidence that vision-based compression will be useful for language modeling. Comparing their vision encoder against simple alternatives--parameter-free mean pooling and a learned hierarchical encoder--we find that these simple approaches match or surpass vision for reconstruction at matched compression ratios, and outperform it for language modeling--where vision-based compression fails to beat truncation. The excitement around optical context compression outpaces the evidence. Code and checkpoints are available at https://github.com/ivnle/bad-autoencoding
中文标题/摘要
标题:光学上下文压缩只是(糟糕的)自编码
DeepSeek-OCR 证明,可以从少量的视觉标记中以高保真度重建渲染的文本。这一发现激发了对基于视觉的上下文压缩在语言模型中的应用的兴趣。但评估仅限于重建;这些表示是否有助于语言建模尚未得到测试。我们测试了光学压缩叙述中隐含的两个假设:基于视觉的压缩为从压缩表示重建文本提供了独特的优势,以及 DeepSeek-OCR 的重建结果是基于视觉的压缩将对语言建模有用的证据。将他们的视觉编码器与无参数的均值池化和学习到的分层编码器等简单替代方案进行比较,我们发现这些简单方法在匹配的压缩比下与视觉编码器匹配或超越其表现,并在语言建模中表现更好——在这一点上,基于视觉的压缩无法超越截断。光学上下文压缩的兴奋程度超过了证据。代码和检查点可在 https://github.com/ivnle/bad-autoencoding 获取
Summary / 总结
The study challenges the notion that vision-based context compression is uniquely beneficial for language modeling by comparing a vision encoder from DeepSeek-OCR with simpler alternatives. It finds that parameter-free mean pooling and a learned hierarchical encoder match or outperform the vision encoder in text reconstruction and language modeling tasks at similar compression ratios, suggesting that vision-based compression may not provide unique advantages for language modeling as previously thought.
研究通过将DeepSeek-OCR中的视觉编码器与参数无关的均值池化和学习的分层编码器等简单替代方案进行比较,质疑视觉上下文压缩在语言建模中的独特优势。研究发现,在相似的压缩比下,这些简单方法在文本重构方面表现得同样好或更好,并且在语言建模任务中超越了基于视觉的压缩方法。这表明,关于光学上下文压缩的兴奋可能在缺乏其在语言模型中的优势证据之前为时过早。
AlignCheck: a Semantic Open-Domain Metric for Factual Consistency Assessment
Authors: Ahmad Aghaebrahimian
First: 2025-12-03T10:14:31+00:00 · Latest: 2025-12-03T10:14:31+00:00
Abstract
Large Language Models have significantly advanced natural language processing tasks, but remain prone to generating incorrect or misleading but plausible arguments. This issue, known as hallucination, is particularly concerning in high-stakes domains like clinical applications, where factual inaccuracies can have severe consequences. Existing evaluation metrics fail to adequately assess factual consistency and lack interpretability, making diagnosing and mitigating errors difficult. We propose an interpretable framework for factual consistency assessment for in-domain and open-domain texts to address these limitations. Our approach decomposes text into atomic facts and introduces a flexible, schema-free methodology. Unlike previous methods with an absolute metric, we incorporate a weighted metric to enhance factual evaluation. Additionally, we propose a mechanism to control assessment complexity in intricate domains. We benchmark our approach on popular general and clinical datasets and release our code to support fact-aware model training in future research.
中文标题/摘要
标题:AlignCheck:一种语义开放域事实一致性评估度量
大型语言模型在自然语言处理任务中取得了显著进展,但仍容易生成错误或误导但看似合理的论点。这一问题被称为幻觉,在临床应用等高风险领域尤其令人担忧,因为事实不准确可能会产生严重后果。现有评估指标无法充分评估事实一致性且缺乏可解释性,使得诊断和缓解错误变得困难。我们提出了一种可解释的框架,用于评估领域内和开放域文本的事实一致性,以解决这些局限性。我们的方法将文本分解为原子事实,并引入了一种灵活的、无模式的方法。与之前的使用绝对度量的方法不同,我们引入了加权度量来增强事实评估。此外,我们还提出了一种机制来控制复杂领域的评估复杂性。我们在流行的通用和临床数据集上对我们的方法进行了基准测试,并发布了代码以支持未来研究中的事实感知模型训练。
Summary / 总结
The research aims to address the issue of hallucination in large language models, which can generate incorrect but plausible arguments, especially in high-stakes domains like clinical applications. The proposed AlignCheck framework assesses factual consistency by decomposing texts into atomic facts and using a weighted metric, enhancing interpretability. Key findings include improved factual evaluation and the ability to control assessment complexity in intricate domains, as demonstrated through benchmarking on general and clinical datasets.
研究旨在解决大型语言模型中幻觉的问题,这些模型可能会生成错误但看似合理的论点,特别是在临床等高风险领域。提出的AlignCheck框架通过将文本分解为原子事实并使用加权的灵活度量来评估事实一致性。关键发现包括提高了可解释性,并且能够更好地处理复杂领域,这些通过在通用和临床数据集上的基准测试得到了验证。
MemVerse: Multimodal Memory for Lifelong Learning Agents
Authors: Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yirong Chen, Licheng Wen, Xuemeng Yang, Daocheng Fu, Pinlong Cai, Nianchen Deng, Yi Yu, Shuyue Hu, Botian Shi, Ding Wang
First: 2025-12-03T10:06:14+00:00 · Latest: 2025-12-03T10:06:14+00:00
Comments: 11 pages, 2 figures, 2 tables
Abstract
Despite rapid progress in large-scale language and vision models, AI agents still suffer from a fundamental limitation: they cannot remember. Without reliable memory, agents catastrophically forget past experiences, struggle with long-horizon reasoning, and fail to operate coherently in multimodal or interactive environments. We introduce MemVerse, a model-agnostic, plug-and-play memory framework that bridges fast parametric recall with hierarchical retrieval-based memory, enabling scalable and adaptive multimodal intelligence. MemVerse maintains short-term memory for recent context while transforming raw multimodal experiences into structured long-term memories organized as hierarchical knowledge graphs. This design supports continual consolidation, adaptive forgetting, and bounded memory growth. To handle real-time demands, MemVerse introduces a periodic distillation mechanism that compresses essential knowledge from long-term memory into the parametric model, allowing fast, differentiable recall while preserving interpretability. Extensive experiments demonstrate that MemVerse significantly improves multimodal reasoning and continual learning efficiency, empowering agents to remember, adapt, and reason coherently across extended interactions.
中文标题/摘要
标题:MemVerse:终身学习代理的多模态记忆
尽管大规模语言和视觉模型取得了快速进展,但AI代理仍然面临一个根本性的限制:它们无法记忆。缺乏可靠的记忆,代理会灾难性地忘记过去的经历,难以进行长期推理,并且在多模态或交互环境中无法协调地运行。我们引入了MemVerse,这是一种模型无关的即插即用记忆框架,它将快速参数检索与分层检索记忆相结合,从而实现可扩展和自适应的多模态智能。MemVerse保持了近期上下文的短期记忆,同时将原始多模态体验转化为结构化的长期记忆,组织成分层知识图谱。这种设计支持持续的巩固、适应性遗忘和有界的记忆增长。为了应对实时需求,MemVerse引入了一种周期性蒸馏机制,将长期记忆中的关键知识压缩到参数模型中,从而实现快速、可微的检索,同时保持可解释性。广泛的实验表明,MemVerse显著提高了多模态推理和持续学习的效率,使代理能够在长时间交互中记住、适应并协调推理。
Summary / 总结
MemVerse is a model-agnostic memory framework designed to enhance the memory capabilities of AI agents, addressing the issue of catastrophic forgetting. It combines fast parametric recall with hierarchical retrieval-based memory, enabling efficient and adaptive multimodal reasoning. Key findings show that MemVerse improves multimodal reasoning and continual learning efficiency, allowing agents to remember and adapt coherently over extended interactions.
MemVerse 是一个模型无关的记忆框架,旨在解决 AI 代理在记住过去经验方面的局限性。它结合了快速参数检索和层次检索记忆,以支持可扩展和适应性的多模态智能。关键发现表明,MemVerse 提高了多模态推理和持续学习的效率,使代理能够在长时间交互中记住并进行连贯的推理。
The promising potential of vision language models for the generation of textual weather forecasts
Authors: Edward C. C. Steele, Dinesh Mane, Emilio Monti, Luis Orus, Rebecca Chantrill-Cheyette, Matthew Couch, Kirstine I. Dale, Simon Eaton, Govindarajan Rangarajan, Amir Majlesi, Steven Ramsdale, Michael Sharpe, Craig Smith, Jonathan Smith, Rebecca Yates, Holly Ellis, Charles Ewen
First: 2025-12-03T10:00:15+00:00 · Latest: 2025-12-03T10:00:15+00:00
Comments: 7 pages, 2 tables
Abstract
Despite the promising capability of multimodal foundation models, their application to the generation of meteorological products and services remains nascent. To accelerate aspiration and adoption, we explore the novel use of a vision language model for writing the iconic Shipping Forecast text directly from video-encoded gridded weather data. These early results demonstrate promising scalable technological opportunities for enhancing production efficiency and service innovation within the weather enterprise and beyond.
中文标题/摘要
标题:视觉语言模型在生成天气预报文本方面的前景
尽管多模态基础模型具有令人鼓舞的能力,但它们在生成气象产品和服务方面的应用仍处于起步阶段。为了加速期望和采用,我们探索了使用视觉语言模型直接从视频编码的网格天气数据中编写标志性航向预报文本的新型用途。这些初步结果展示了增强天气企业乃至更广泛领域生产效率和服务创新的有希望的可扩展技术机会。
Summary / 总结
The research aims to leverage the potential of vision language models to generate textual weather forecasts, particularly focusing on the iconic Shipping Forecast text. The study employs a vision language model to directly generate this text from video-encoded weather data, showing promising results that could enhance production efficiency and service innovation in the weather industry.
该研究探索使用视觉语言模型直接从视频编码的天气数据生成文本天气预报,旨在提高天气企业在生产效率和服务创新方面的效率和创新。研究采用视觉语言模型编写航海预报文本,并展示了气象产品和服务中具有前景的可扩展技术机会。
History
20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553