MarkTechPost AIAI Agent

使用 Lift 將研究 PDF 轉換為結構化 JSON:基於 Schema 引導的欄位級別受控評估

2026年7月1日 21:09

重點摘要

在本教學中,我們圍繞 Lift 構建了一個完整的 PDF 到結構化資料提取工作流程,重點在於受控評估而非簡單的演示運行。首先,我們準備一個相容 Colab 的 GPU 環境,為可用硬體選擇合適的精度模式,並修補模型載入,以確保 Lift 後端即使在受限的 16 GB GPU 上也能透過 4 位元 NF4 量化可靠運行。接著,我們生成包含刻意幹擾項的合成多頁研究報告,包括驗證與測試指標的模糊性、基線與提議模型的比較、缺失程式碼發布案例以及布林型最先進聲明。這為基於 Schema 引導的提取提供了一個真實的測試平臺,模型必須從中恢復標題、作者、資料集等資訊。

站內 AI 整理稿

In this tutorial, we build a complete PDF-to-structured-data extraction workflow around Lift, with a focus on controlled evaluation rather than a simple demo run. We begin by preparing a Colab-compatible GPU environment, selecting the appropriate precision mode for the available hardware, and patching model loading to ensure the Lift backend runs reliably even on constrained 16 GB GPUs via 4-bit NF4 quantization. From there, we generate synthetic multi-page research reports with deliberately placed distractors, including validation-versus-test metric ambiguity, baseline-versus-proposed-model comparisons, missing code-release cases, and boolean state-of-the-art claims. This provides a realistic testbed for schema-guided extraction, in which the model must recover titles, authors, datasets, metrics, hyperparameters, limitations, and repository links from document layouts rather than plain text. Configuring Runtime and Dependencies Copy CodeCopiedUse a different BrowserN_DOCS = 3 FORCE_FULL_PRECISION = False FORCE_4BIT = False SHOW_FIRST_PAGE = True RUN_ON_REAL_PDF = False REAL_PDF_URL = "https://arxiv.org/pdf/1512.03385" REAL_PDF_PAGES = "0-3" PIN_PILLOW = True PILLOW_VERSION = "11.3.0" import os, sys, subprocess, json, re, time, warnings warnings.filterwarnings("ignore") os.environ["TOKENIZERS_PARALLELISM"] = "false" def pip(*pkgs, upgrade=False): """Install without invoking a shell (so '[hf]' is never glob-expanded).""" args = [sys.executable, "-m", "pip", "install", "-q"] + (["-U"] if upgrade else []) + list(pkgs) print(" pip install", *pkgs) subprocess.run(args, check=False) print("STEP 1/7 · Installing lift + light dependencies (first run is the slow one)…") pip("reportlab", "pypdfium2", "pandas", "matplotlib") pip("lift-pdf[hf]") pip("bitsandbytes", "accelerate", upgrade=True) if PIN_PILLOW: pip(f"pillow=={PILLOW_VERSION}") if "PIL" in sys.modules: import PIL if getattr(PIL, "__version__", "") != PILLOW_VERSION: print(f" Pinned Pillow {PILLOW_VERSION} on disk, but a stale Pillow " f"({getattr(PIL, '__version__', '?')}) is already loaded in memory.") print(" Restarting the runtime now — just re-run the cell(s) after it reconnects.") os.kill(os.getpid(), 9) print(" …install finished.\n") import torch We configure the tutorial runtime by defining the main execution knobs for corpus size, precision mode, preview rendering, and optional real-PDF extraction. We also install the core dependencies required for PDF generation, rendering, plotting, and Lift’s Hugging Face backend. The Pillow pinning logic is important because it prevents a known Colab compatibility issue in which newer Pillow builds can break downstream imports via torchvision and transformers. Loading Lift 4-bit Backend Copy CodeCopiedUse a different Browserdef detect_gpu(): if not torch.cuda.is_available(): raise SystemExit( "\n✗ No CUDA GPU found. In Colab: Runtime ▸ Change runtime type ▸ GPU " "(A100 is best; L4/T4 also work).\n" ) p = torch.cuda.get_device_properties(0) cc = torch.cuda.get_device_capability(0) return p.name, p.total_memory / 1e9, cc def enable_4bit(compute_dtype): """ Load lift's weights in 4-bit NF4 no matter which transformers Auto* class it uses internally. We inject a quantization_config + on-GPU device_map, and neutralize any later model.to()/.cuda() (which is illegal on a bnb-quantized model). This is what lets a ~10 B model fit on a 16 GB T4 / 24 GB L4. """ import inspect, functools, transformers from transformers import BitsAndBytesConfig bnb = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=compute_dtype, ) def patch(cls): try: cm = inspect.getattr_static(cls, "from_pretrained") orig = cm.__func__ if isinstance(cm, (classmethod, staticmethod)) else cm except Exception: return @functools.wraps(orig) def inner(cls_, *args, **kwargs): kwargs.setdefault("quantization_config", bnb) kwargs.setdefault("device_map", {"": 0}) model = orig(cls_, *args, **kwargs) try: model.to = lambda *a, **k: model model.cuda = lambda *a, **k: model except Exception: pass return model cls.from_pretrained = classmethod(inner) for name in ["AutoModelForImageTextToText", "AutoModelForMultimodalLM", "AutoModelForVision2Seq", "AutoModelForCausalLM", "AutoModel"]: c = getattr(transformers, name, None) if c is not None: patch(c) try: from transformers.modeling_utils import PreTrainedModel patch(PreTrainedModel) except Exception: pass print("STEP 2/7 · Preparing the model backend…") gpu_name, vram, cc = detect_gpu() use_4bit = FORCE_4BIT or (vram < 34 and not FORCE_FULL_PRECISION) compute_dtype = torch.bfloat16 if cc[0] >= 8 else torch.float16 print(f" GPU: {gpu_name} | ~{vram:.0f} GB | compute capability {cc[0]}.{cc[1]}") print(f" Load mode: {'4-bit NF4' if use_4bit else 'full bf16'} (compute dtype {compute_dtype})") os.environ.setdefault("TORCH_DEVICE", "cuda:0") os.environ.setdefault("MODEL_CHECKPOINT", "datalab-to/lift") if use_4bit: enable_4bit(compute_dtype) from lift import extract from lift.model import InferenceManager print(" Loading lift weights (≈20 GB download on first run)…") _t = time.time() MODEL = InferenceManager(method="hf") print(f" ✓ model ready in {time.time() - _t:.0f}s\n") def run_lift(pdf_path, schema, page_range=None): kw = {"model": MODEL} if page_range: kw["page_range"] = page_range result = extract(pdf_path, schema, **kw) return getattr(result, "extraction", None) We prepare the Lift inference backend by detecting available CUDA GPUs, estimating VRAM usage, and choosing between full-precision and 4-bit NF4 loading. The 4-bit patch injects a BitsAndBytes quantization configuration into compatible Transformers model loaders, allowing the model to fit on smaller GPUs such as T4 or L4. We then initialize a reusable InferenceManager that avoids reloading the model for each document and makes the extraction pipeline practical for batch processing. Building the Synthetic Corpus Copy CodeCopiedUse a different BrowserDOCS = [ dict( title="SolarNet: Efficient Land-Cover Classification from Multispectral Satellite Imagery", authors=[("Maya Okafor", "TU Delft"), ("Liang Wei", "TU Delft"), ("Priya Ramachandran", "European Space Research Institute")], task="satellite image land-cover classification", method="SolarNet", datasets=["EuroSAT", "BigEarthNet", "So2Sat"], primary_benchmark="EuroSAT", metric_name="Top-1 accuracy", test_acc=96.4, val_acc=97.1, baseline_name="ResNet-50", baseline_val=92.0, baseline_test=91.2, params_m=42.7, optimizer="AdamW", lr=0.0003, batch=128, epochs=90, beats_sota=True, prior_best=95.1, code_url=None, funding_note="This work was supported by the Open Earth Initiative. " "The authors do not release source code for the trained models.", limitations=["Accuracy degrades on scenes with heavy cloud cover.", "Trained only on imagery at 10 m spatial resolution."], ), dict( title="GraphMoE: Mixture-of-Experts Message Passing for Molecular Property Prediction", authors=[("Sofia Álvarez", "ETH Zürich"), ("Daniel Kim", "ETH Zürich"), ("Yara Haddad", "Genentech"), ("Tom Becker", "ETH Zürich")], task="molecular property prediction", method="GraphMoE", datasets=["OGB-MolHIV", "QM9", "ZINC"], primary_benchmark="OGB-MolHIV", metric_name="ROC-AUC", test_acc=0.812, val_acc=0.828, baseline_name="GIN", baseline_val=0.784, baseline_test=0.771, params_m=8.3, optimizer="Adam", lr=0.001, batch=256, epochs=120, beats_sota=True, prior_best=0.799, code_url="https://github.com/mol-ai/graphmoe", funding_note="Funded by the Swiss NSF. Code and pretrained checkpoints are available " "at https://github.com/mol-ai/graphmoe.", limitations=["Expert routing adds ~15% inference latency versus a dense GNN.", "Evaluated only on small-molecule datasets under 50 heavy atoms."], ), dict( title="AcoustiFormer: A Compact Transformer for Environmental Sound Classification", authors=[("Noah Fischer", "University of Edinburgh"), ("Aisha Bello", "University of Edinburgh"), ("Kenji Watanabe", "Sony CSL")], task="environmenta

Related

相關文章

IT之家AI Agent

谷歌 AI 智能體 Gemini Spark 登陸蘋果 Mac,可實時追蹤資訊動態

谷歌 AI 智能助手 Gemini Spark 正式登陸蘋果 Mac 設備,整合進現有桌面客戶端。它不僅能讀取本地文件、整理發票生成表格,還接入了 Google Tasks、Keep 及 Canva、Dropbox 等第三方應用,可執行訂餐、購物、設計等複雜任務。新增實時話題追蹤能力。#谷歌Gemini##AI智能體#

1 小時前

邢波再出手:上次「罵」完世界模型,這次輪到智能體了

這篇消息聚焦「邢波再出手:上次「罵」完世界模型,這次輪到智能體了」。原始導語提到:邢波團隊拆解五大軟肋後,開出了智能體的「藥方」:GIC 架構 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

4 小時前

Claude有“編制”了,Anthropic發的

Anthropic 推出 Claude Tag,讓 AI 擁有專屬工牌和權限,能以獨立身份直接入駐 Slack 頻道,無需再借用他人帳號。此舉使 Claude 成為團隊中的固定 AI 同事,強化協作便利性。

7 小時前
雷峰網AI Agent

即夢AI攜創作者亮相法國昂西動畫節,3000份全球投稿勾勒AI動畫創作新圖景

2026年6月23日,法國昂西國際動畫電影節期間,由即夢AI與Dreamina AI主辦的“AI動畫國際峰會”在MIFA市場舉行。這場以“技術與藝術的融合”為主題的峰會,匯聚了來自全球的獲獎創作者、資深動畫製片人與行業專家,通過作品展映與深度對談,共同探索AI時代動畫創作的新可能。今年5月,即夢AI和Dreamina AI面向全球發起AI動畫作品及項目徵集,共收到來自40多個國家和地區的近3000份投稿,風格涵蓋定格動畫、水墨畫風、超現實主義、都市寫實等多種類型。經國際評審團遴選,25部短片與5個項目最終入圍,並從中評選出4部榮譽作品。剛剛結束的昂西動畫節上,這些榮譽作品的創作者受邀到現場展映自己的短片,在全球動畫行業的最高舞臺上分享創作歷程與技術探索。四位創作者,四種AI創作路徑此次展映的4部榮譽作品,分別代表了AI動畫創作的四種典型路徑,展現了技術與藝術結合的多元可能性。來自葡萄牙的資深動畫導演Cláudio Sá帶來了改編自果戈裡經典的超現實短片《THE NOSE》。擁有18年2D、3D及傳統動畫經驗的Cláudio,此次以個人身份完成了這部作品。他將即夢AI作為核心視頻與渲染引擎,重度使用Seedance 2.0的全能參考模式,“這讓我能夠擁有眾多角色、眾多服裝、眾多場景,這在傳統製作中是不現實的,就好像擁有了無限預算。”為了精確控制主角“鼻子”的運動,他甚至親手雕刻了一個物理版本的鼻子模型作為動畫參考,這種“實體參考+AI生成”的混合工作流,讓他在保持藝術控制力的同時大幅提升了製作效率。現場展映榮譽作品及即夢ai最新技術能力美國創作者C.E. Whitmore的黑色懸疑作品《The Last Waltz》,則呈現了另一種創作哲學。這部設定在1948年英國哥特式莊園的謀殺謎案,圍繞四位女性的故事展開,探討謊言、表演性奉獻與真相的本質。與許多創作者將AI視為工具不同,

13 小時前