arXiv 论文速递

LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence

Authors: Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, Li Mao, Mingchao Hao, Ningbo Dai, Renzhe Xu, Shuyang Li, Tianyang Zhang, Yue He, Yuanrui Wang, Yunjia Zhang, Zijing Xu, Dongzhe Li, Fang Gao, Hao Zou, Jiandong Liu, Jiashuo Liu, Jiawei Xu, Kaijie Cheng, Kehan Li, Linjun Zhou, Qing Li, Shaohua Fan, Xiaoyu Lin, Xinyan Han, Xuanyue Li, Yan Lu, Yuan Xue, Yuanyuan Jiang, Zimu Wang, Zhenlei Wang, Peng Cui

First: 2025-09-03T17:39:08+00:00 · Latest: 2025-09-03T17:39:08+00:00

Comments: 56 pages

Abs · PDF

Abstract

We argue that progress toward general intelligence requires complementary foundation models grounded in language, the physical world, and structured data. This report presents LimiX, the first installment of our large structured-data models (LDMs). LimiX treats structured data as a joint distribution over variables and missingness, thus capable of addressing a wide range of tabular tasks through query-based conditional prediction via a single model. LimiX is pretrained using masked joint-distribution modeling with an episodic, context-conditional objective, where the model predicts for query subsets conditioned on dataset-specific contexts, supporting rapid, training-free adaptation at inference. We evaluate LimiX across 10 large structured-data benchmarks with broad regimes of sample size, feature dimensionality, class number, categorical-to-numerical feature ratio, missingness, and sample-to-feature ratios. With a single model and a unified interface, LimiX consistently surpasses strong baselines including gradient-boosting trees, deep tabular networks, recent tabular foundation models, and automated ensembles, as shown in Figure 1 and Figure 2. The superiority holds across a wide range of tasks, such as classification, regression, missing value imputation, and data generation, often by substantial margins, while avoiding task-specific architectures or bespoke training per task. All LimiX models are publicly accessible under Apache 2.0.

中文标题/摘要

标题：LimiX：释放通用智能的结构化数据建模能力

我们认为通用地智能的进步需要语言、物理世界和结构化数据的互补基础模型。本报告介绍了LimiX，这是我们大型结构化数据模型（LDM）的第一部分。LimiX 将结构化数据视为变量和缺失值的联合分布，因此能够通过单个模型基于查询的条件预测来解决广泛的表格任务。LimiX 使用掩码联合分布建模进行预训练，目标是基于上下文的事件性目标，其中模型根据数据集特定的上下文对查询子集进行预测，支持快速、无需训练的推理适应。我们在10个大型结构化数据基准测试中评估了LimiX，这些基准测试涵盖了样本大小、特征维度、类别数量、分类到数值特征的比例、缺失值以及样本到特征比率的广泛范围。使用单个模型和统一的接口，LimiX 一致地超越了包括梯度提升树、深度表格网络、近期的表格基础模型和自动化集成在内的强大基线，如图1和图2所示。这种优越性在分类、回归、缺失值填充和数据生成等多种任务中普遍存在，通常差距显著，同时避免了特定任务的架构或针对每个任务的定制训练。所有LimiX模型均在Apache 2.0许可下公开。

Summary / 总结

LimiX is designed to enhance the capability of foundation models in handling structured data, aiming to support general intelligence. It treats structured data as a joint distribution over variables and missingness, enabling query-based conditional prediction. LimiX outperforms various strong baselines across 10 large structured-data benchmarks, including classification, regression, missing value imputation, and data generation, with a single model and unified interface, demonstrating its versatility and effectiveness.

LimiX旨在增强基础模型在处理结构化数据的能力，以支持通用智能。它将结构化数据视为变量和缺失值的联合分布，从而能够通过单个模型进行基于查询的条件预测。LimiX在10个大型结构化数据基准测试中表现出色，包括分类、回归、缺失值填充和数据生成，使用单一模型和统一接口，展示了其多样性和有效性。