MarkTechPost AI模型更新

Datalab 發布 lift:9B 開源權重視覺模型,透過 Schema 從 PDF 提取結構化 JSON

2026年6月23日 19:35

重點摘要

Datalab 已發布 lift,這是一個 9B 參數的開源權重視覺模型,專為結構化提取而設計。你只需傳入一個 JSON schema,它就會回傳符合該結構的 JSON 物件。該模型可直接讀取 PDF 和圖片,並根據你的 schema 進行解碼。這是 Datalab 首個純粹為提取任務打造的模型。該團隊此前已推出開源 OCR 工具:chandra、marker 和 surya。lift 將這項工作擴展至 schema 驅動的欄位提取。在 Datalab 的 225 份文件基準測試中,lift 的欄位準確率高達 90.2%。研究團隊表示,這是他們測試過最強大的小型可自託管模型。每份文件的中位處理時間為 9.5 秒。什麼是 Datalab lift?lift 是一個 9B 參數的視覺模型,用於結構化提取。它接受標準 JSON Schema 作為輸入,並回傳符合該結構的有效 JSON。

站內 AI 整理稿

Datalab has released lift, a 9B open-weights vision model for structured extraction. You pass it a JSON schema, and it returns a JSON object that matches. The model reads PDFs and images directly, then decodes against your schema. This is Datalab’s first model built purely for extraction. The team already ships open-source OCR tools: chandra, marker, and surya. lift extends that work into schema-driven field extraction. lift scores 90.2% field accuracy on Datalab’s 225-document benchmark. The research team reports it as the strongest small self-hostable model they tested. It runs at a median of 9.5 seconds per document. What is Datalab lift? lift is a 9B-parameter vision model for structured extraction. It accepts standard JSON Schema as input. It returns valid JSON of that shape as output. The model handles multi-page documents in a single pass. It can read values that span across pages. Whole documents go in at once, not page by page. Two inference modes ship with the package. Local inference runs through HuggingFace. Remote inference runs through a vLLM server, which Datalab recommends for production. The code is Apache 2.0. The weights use a modified OpenRAIL-M license. lift enters a small but growing field of open extraction models. Some are purpose-built, like the NuExtract family. Others are general vision-language models pressed into extraction, like Qwen3.5-9B. It pairs a vision-language base with schema-constrained decoding and trained abstention. On Datalab’s benchmark, it leads that open group on field accuracy. Schema-Constrained Decoding: The Core Mechanism The main design choice is schema-constrained decoding. lift decodes its output directly against your schema. The result is always valid JSON of the correct shape. Here is what happens under the hood. lift first turns your JSON Schema into a Pydantic model. It then normalizes that into a strict JSON Schema. The schema is passed to the vLLM server as a response_format constraint. During generation, the server compiles the schema into a grammar. At each step, the model assigns a probability to every possible next token. The grammar defines which tokens are valid continuations. Tokens that would break the schema are masked out. The model can only sample from what remains. This is why the output is always valid JSON of the right shape. The structure is enforced token by token, not checked afterward. There is a sharp limit to this guarantee. Constrained decoding governs structure and types, not meaning. A field typed as number will hold a number. Whether it holds the correct number is a separate question. The model can emit a valid value that is simply wrong. Validity is not correctness. lift also widens every field to allow null. Each scalar leaf in the compiled schema accepts its type or null. So the model can abstain on any field without breaking the structure. Abstention is both trained behavior and a property of the constraint. You write standard JSON Schema. Supported types include string, number, integer, boolean, arrays of those, arrays of objects, and nested objects. A field description guides the model when a name is ambiguous. This is also where a quiet failure mode lives. Some constructs cannot be compiled: enum, anyOf/oneOf, $ref, and additionalProperties. When lift cannot compile your schema, it does not stop. It logs a warning and generates without the constraint. The structural guarantee is gone for that run, with no hard error. Output may then fail to match your schema at all. The practical rule is simple. Keep schemas inside the supported subset. Validate the returned JSON against your schema downstream. Do not assume valid output just because the call returned. Here is a simple invoice schema: Copy CodeCopiedUse a different Browser{ "type": "object", "properties": { "invoice_number": {"type": "string", "description": "Invoice identifier"}, "total": {"type": "number", "description": "Total amount due"}, "line_items": { "type": "array", "items": { "type": "object", "properties": { "description": {"type": "string"}, "amount": {"type": "number"} } } } }, "required": ["invoice_number", "total"] } Abstention by Default Real extraction is hard for a non-obvious reason. Beyond reading fields that exist, the real challenge is not inventing fields that are absent. A model that hallucinates a tax ID is worse than one returning nothing. The error is silent and hard to catch downstream. lift is trained to leave genuinely missing fields null. Mark a field required only when it must appear. Fields absent from a document come back null. This gives you an extractor that can report a value is not present. Benchmark Datalab evaluated lift on a 225-document extraction benchmark. Documents ran 6 to 64 pages each, with roughly 11,000 scored fields. Adversarial cases were planted throughout the set. Those cases include cross-page values and exhaustive lists. They also include fields that must be left null and near-miss distractors. Multi-source aggregation was tested as well. Every model received the same rendered page images. Each extracted every document in a single pass. Scoring was a deterministic exact-match against ground truth, with numeric tolerance and normalized strings. ModelSizeField accuracyFull-document accuracyMedian latency*FeaturesDatalab API—95.9%44.4%30.8sCitations + VerificationGemini Flash 3.5—91.3%40.0%28.1slift9B90.2%20.9%9.5sAzure Content Understanding—83.4%22.2%73.7sCitationsNuExtract34B81.5%8.4%8.3sQwen3.5-9B9B76.32%24.0%16.8s * Per document, 8 concurrent requests. Local models (lift, Qwen3.5-9B, NuExtract3) were served with vLLM on a single GPU. Gemini, Datalab, and Azure ran via API. Latency varies with hardware and load; treat it as relative. Two details matter here. Field accuracy is the fraction of individual fields extracted correctly. Full-document accuracy is the fraction of documents where every field is correct. On field accuracy, lift leads the self-hostable models. It sits ahead of NuExtract3 and the Qwen3.5-9B base. It is also the fastest of the accurate models in the table. At 9.5s median, lift is roughly 3x faster than Gemini Flash 3.5. It stays within about a point of that model’s field accuracy. Full-document accuracy is a harder metric: every field must be correct. Here lift scores 20.9%, ahead of only NuExtract3. The hosted APIs lead, at 44.4% and 40.0%. A note on reading these numbers. This is Datalab’s own benchmark, so treat it as a vendor result. Its adversarial design rewards models tuned to abstain, which lift is. Full-document accuracy is low for every model, topping out at 44.4%. That reflects how hard single-pass extraction is on long documents. The numbers are also a snapshot; models change. This is the reality of single-pass, single-model extraction on hard documents. It tells you where lift fits. It is excellent for field-level extraction that feeds a human-in-the-loop review or aggregate analytics. It is not yet a drop-in for zero-touch, every-field-must-be-perfect automation. For that last mile, Datalab’s hosted API adds per-field verification, citations, and confidence scores on the same approach. A Practitioner Workflow: From Schema to Reviewed Data Three use cases show the shape of the work. Invoice processing: define invoice_number, total, and line_items, and a missing tax_id returns null. Contract review: a two-page agreement carries a value across pages, which single-pass extraction stitches together. Document pipelines: an accounts-payable queue trusts that absent due dates return null, avoiding silent errors. Here is one of them as an end-to-end workflow. The goal is a clean, reviewed dataset, not raw model output. 1. Define the schema. Add a description to any field whose name is not obvious. Mark only truly mandatory fields as required. 2. Run extraction. Pass the schema and the file to lift. Use a dict, a file path, or a saved schema name. 3. Branch on the result. A failed call or a null extraction goes to review. A missin

Related

相關文章

剛剛,物理世界的Anthropic現身,團隊來自中國

這篇消息聚焦「剛剛,物理世界的Anthropic現身,團隊來自中國」。原始導語提到:中國團隊拿下世界模型量產第一 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

3 小時前
智東西模型更新

101億!又一AI Infra獨角獸拿下鉅額融資

美國AI推理基礎設施獨角獸Baseten宣布完成15億美元(約101.6億人民幣)F輪融資,估值達130億美元,累計融資總額逾20億美元。該公司專注於幫助企業部署和運行開源或自訂AI模型,過去一年營收年增約20倍,客戶包括Cursor、Notion等知名平台。融資將用於擴充算力基礎設施、研發與團隊招聘,預計今年員工總數成長三倍。

12 小時前

企業微信"大圓"內測上線:左滑一下,AI就能幫你盤客戶、寫總結

這篇消息聚焦「企業微信"大圓"內測上線:左滑一下,AI就能幫你盤客戶、寫總結」。原始導語提到:企業微信近期低調開啟AI助理「大圓」的內測,這款新工具主打無縫融入日常工作流程——用戶只要在手機端向左滑動,就能喚醒AI,協助整理客戶資訊、撰寫工作總結,操作體驗就像聊天一樣直覺。不同於過去常見的獨立聊天機器人,「大圓」最大的特色在於它深度結合了企業微信內部的群聊對話、文件檔案與會議記錄等既有數據,能根據不同場景(例如業務討論、專案回顧)提供即時且貼近真實需求的回覆,等於讓AI長在原本的工作... 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

13 小時前9200