Google AI 推出 TabFM:混合注意力表格基礎模型,實現零樣本分類與回歸
重點摘要
Google Research 發表 TabFM,這是一個專為表格資料設計的基礎模型。TabFM 無需針對特定資料集進行訓練,即可執行分類與回歸任務,每次預測僅需一次前向傳遞。該模型將表格預測重新定義為情境學習問題,現已於 Hugging Face 和 GitHub 上架。簡而言之,TabFM 能在未見過的表格上進行預測,無需訓練、調參或特徵工程。它將整個資料集視為一個提示,透過情境學習進行預測。其架構結合了 TabPFN 風格的行/列注意力機制與 TabICL 風格的情境學習,訓練過程使用了數億個來自結構因果模型的合成資料集。Google BigQuery 也將很快透過 AI.PREDICT SQL 指令開放 TabFM 功能。什麼是 TabFM?表格資料構成了……
Google Research introduced TabFM, a foundation model built for tabular data. TabFM performs classification and regression without dataset-specific training. Every prediction comes from a single forward pass. The model reframes tabular prediction as an in-context learning problem. It is available now on Hugging Face and GitHub. TL;DR TabFM predicts on unseen tables with no training, tuning, or feature engineering. It reads the full dataset as one prompt, then predicts via in-context learning. The architecture combines TabPFN-style row/column attention with TabICL-style in-context learning. Training used hundreds of millions of synthetic datasets from structural causal models. Google BigQuery will expose TabFM through an AI.PREDICT SQL command soon. What is TabFM? Tabular data forms the backbone of enterprise data infrastructure. Tasks like customer churn and financial fraud detection live in tables. For years, tree-based methods dominated this space. XGBoost, AdaBoost, and random forests offered robust results on structured data. Google frames TabFM as the tabular counterpart to TimesFM, its zero-shot time-series model. That reliability carried a cost. Fitting XGBoost to a new dataset is rarely one .fit() call. Data scientists spend hours on hyperparameter optimization and feature engineering. They do this just to extract a reliable signal from raw data. TabFM targets exactly that bottleneck. TabFM applies the zero-shot logic that large language models made familiar. LLMs learn new tasks from in-context examples, without updating any weights. This technique is called in-context learning (ICL). TabFM brings the same idea to tables. It generates predictions on previously unseen tables in one pass. How It Works Traditional models update parameters for each dataset’s distribution. TabFM skips that step entirely. It takes the whole dataset as a single unified prompt. That prompt holds both training examples and target testing rows. The model reads column and row relationships at inference time. Tables are not text. They are two-dimensional and inherently orderless. Swapping two rows or two columns does not change their meaning. Standard language models process one-dimensional, ordered sequences instead. To bridge that gap, TabFM synthesizes TabPFN and TabICL into a hybrid design. It relies on three mechanisms: Alternating row and column attention: The raw table passes through a multilayer attention module. Following TabPFN, attention alternates across columns (features) and rows (examples). This deep contextualization captures feature interactions and dependencies. It performs work that would otherwise need manual feature crafting. Row compression: Each row’s cross-attended information compresses into a single dense vector. In-context learning: A dedicated Transformer runs over these compressed embeddings. Following TabICL, attending to compressed rows cuts computation cost sharply. Prediction stays efficient even on much larger datasets. Training On Synthetic Data at Scale Foundation models need vast, diverse data. High-quality tabular datasets are scarce in the open-source space. Industrial tables carry proprietary schemas and sensitive information. That makes them inaccessible for broad pre-training. Synthetic tables can be generated to be arbitrarily large. Google’s research team calls them effectively the only viable option at this scale. So TabFM trains entirely on hundreds of millions of synthetic datasets. These are generated dynamically using structural causal models (SCMs). Each incorporates a wide variety of random functions. The approach captures distributions and complex feature relationships found in real tables. The research team reports the model generalizes well to unseen real-world data. Performance and Benchmarking The research team evaluated TabFM on TabArena. TabArena is a living benchmark that computes Elo scores from head-to-head win rates. The evaluation spans 38 classification datasets and 13 regression datasets. Sample sizes range from 700 to 150,000. Two configurations were tested. Plain TabFM runs out-of-the-box in a single forward pass. It needs no tuning or cross-validation. TabFM-Ensemble adds cross features and SVD (Singular Value Decomposition) features. It computes optimal weights for a 32-way ensemble using a non-negative least squares solver. For classification, it also adds Platt scaling as a calibration step. The research team reports TabFM consistently outperforms heavily tuned, industry-standard supervised algorithms. Full per-fold metrics and head-to-head win rates sit on the GitHub page. AspectTraditional GBDT (XGBoost)TabFMTabFM-EnsemblePer-dataset trainingRequiredNone (in-context learning)NoneHyperparameter tuningExtensive, manualNoneEnsemble weights via NNLSFeature engineeringManual, domain-specificLearned by attentionAdds cross + SVD featuresPredictionAfter full trainingSingle forward pass32-way ensembleCalibrationManual (optional)—Platt scaling (classification) Getting Started: Installation and Code Installation clones the repository and installs it locally. The base install uses CPU-only JAX. A cuda extra pulls the CUDA 12 plugin and NVIDIA libraries for GPU runs. Core requirements are specific. You need Python 3.11 or later. It pins jax==0.10.1 and flax==0.12.7, using the modern flax.nnx API. Hugging Face Hub downloads the pre-trained weights automatically. Copy CodeCopiedUse a different Browserimport numpy as np import pandas as pd from tabfm import tabfm_v1_0_0 from tabfm import TabFMClassifier # Load pre-trained TabFM v1.0.0 (downloads from Hugging Face) model = tabfm_v1_0_0.load() # scikit-learn compatible classifier clf = TabFMClassifier(model=model) X_train = pd.DataFrame({ "age": [25.0, 45.0, 35.0, 50.0], "job": ["engineer", "manager", "engineer", "manager"], "income": [80000, 120000, 90000, 130000] }) y_train = np.array(["low_risk", "high_risk", "low_risk", "high_risk"]) X_test = pd.DataFrame({ "age": [30.0, 48.0], "job": ["engineer", "manager"], "income": [85000, 125000] }) clf.fit(X_train, y_train) predictions = clf.predict(X_test) probabilities = clf.predict_proba(X_test) print("Predictions:", predictions) print("Class Probabilities:\n", probabilities) Here fit() prepares ordinal encoders and numerical scalers. It does not train model weights on your data. The regressor mirrors this pattern with TabFMRegressor and reg.predict(). Use Cases With Examples The API fits common predictive tasks directly. For customer churn, the context holds past customers labeled churned or retained. TabFM scores churn risk for new customers in one pass. For credit risk, rows carry age, job, and income features. Labels mark low_risk or high_risk, as in the sample code. New applicants get scored without a training cycle. For regression, house price prediction is a natural fit. Context rows carry square footage and neighborhood. TabFM returns a predicted price for unseen listings. Interactive Explainer (function(){ var f = document.getElementById("tabfm-playground"); window.addEventListener("message", function(e){ if(e && e.data && typeof e.data.tabfmHeight === "number"){ f.style.height = (e.data.tabfmHeight + 40) + "px"; } }); })(); Check out the Repo and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us The post Google AI Introduces TabFM: A Hybrid-Attention Tabular Foundation Model for Zero-Shot Classification and Regression appeared first on MarkTechPost.
Related
相關文章

Claude Science幾周幹完兩年活,10倍科研提速真來了?
Anthropic推出Claude Science,將科研流程拆解為可逐步審計的流水線,號稱能在幾週內完成原本兩年的工作量。該系統旨在實現10倍的科研效率提升,而非單純追求模型智慧。這項創新可能顯著加速科學研究的進展速度。

數據實錘:遊戲用AI後,表現低52%
這篇消息聚焦「數據實錘:遊戲用AI後,表現低52%」。原始導語提到:真正決定遊戲成敗的,依然是開發團隊如何使用它。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。
一個貼磚工人的“第六感”,給AI預測上了一課
2026年世界盃小組賽結束後,聯想集團與咪咕視頻發起的“世界盃預測人機大戰”出現了一個足夠戲劇化的結果。在同一張32強答卷上,12大AI大模型、專業解說、運動員、名人嘉賓和超2500萬參與用戶共同預測32支晉級球隊。最終,AI陣營最高分停在29中,30以上無人也無AI抵達。一位來自重慶彭水的29歲貼磚工人李先生,成為唯一32中31的人。
庫克破冰會談:蘋果與歐盟就新版Siri AI入歐展開建設性磋商
庫克與歐盟監管負責人就延遲推出的新版Siri展開“建設性”會談,重點探討在遵守《數字市場法》互操作性等競爭規則下引入AI功能。雙方試圖通過對話解決歐盟市場的准入爭端。

華為官宣全球首個商用多模態文旅大模型規模化應用
華為中國宣佈,2026 年 6 月 29 日,全球首個商用多模態文旅大模型 ——“博觀文旅大模型”在西安規模應用。截至今年 3 月,“博觀”支撐開發的 AI 伴遊智能體已覆蓋超 400 萬用戶。#博觀文旅大模型# #AI文旅#

AI招聘對上AI求職,一場“魔法對轟”
AI招聘工具與AI求職軟體正展開一場「魔法對轟」,雙方皆運用人工智慧優化「人崗匹配」的流程。求職者透過AI生成履歷、模擬面試,而企業則用AI篩選履歷、解讀面試表現,形成新一輪技術競賽。這場對決正重新定義招聘與求職的底層邏輯,引發業界對效率與公平性的關注。