使用 lift-pdf 設計基於綱領引導的發票智慧管線:應付帳款提取、驗證與分類帳生成
重點摘要
在本教學中,我們使用 lift-pdf 建立一個端對端的應付帳款提取管線,採用合成發票 PDF 作為受控測試文件,並以結構化 JSON 綱要作為目標輸出格式。我們不將發票解析視為單純的 OCR 任務,而是將其定位為綱領引導的文件理解:我們生成逼真的發票,定義供應商識別、收款方、採購單號、明細項目、稅金、總金額、應付餘額與付款狀態等欄位,然後要求模型直接從渲染後的 PDF 版面中提取這些數值。我們也包含了實際財務流程中常見的提取陷阱,例如區分帳單地址與送貨地址、區分小計與含稅總額、對缺失值回傳 null,以及正確處理部分付款的發票。
In this tutorial, we build an end-to-end accounts-payable extraction pipeline with lift-pdf, using synthetic invoice PDFs as controlled test documents and a structured JSON schema as the target output format. Instead of treating invoice parsing as a simple OCR task, we frame it as schema-guided document understanding: we generate realistic invoices, define fields such as vendor identity, billing party, PO number, line items, tax, total amount, balance due, and payment status, and then ask the model to extract those values directly from the rendered PDF layout. We also include practical extraction traps that appear in real finance workflows, such as distinguishing bill-to from ship-to, separating subtotal from after-tax total, returning null for absent values, and correctly marking partially paid invoices as unpaid when a balance remains. Through GPU-aware model loading, optional 4-bit quantization, PDF generation and extraction, scoring, and ledger construction, we turn this tutorial into a compact yet realistic demonstration of document intelligence for invoice mining. Copy CodeCopiedUse a different BrowserN_DOCS = 3 FORCE_FULL_PRECISION = False FORCE_4BIT = False SHOW_FIRST_PAGE = True RUN_ON_REAL_PDF = False REAL_PDF_URL = "" REAL_PDF_PAGES = "0-1" PIN_PILLOW = True PILLOW_VERSION = "11.3.0" import os, sys, subprocess, json, re, time, warnings warnings.filterwarnings("ignore") os.environ["TOKENIZERS_PARALLELISM"] = "false" def pip(*pkgs, upgrade=False): """Install without invoking a shell (so '[hf]' is never glob-expanded).""" args = [sys.executable, "-m", "pip", "install", "-q"] + (["-U"] if upgrade else []) + list(pkgs) print(" pip install", *pkgs) subprocess.run(args, check=False) print("STEP 1/7 · Installing lift + light dependencies (first run is the slow one)…") pip("reportlab", "pypdfium2", "pandas", "matplotlib") pip("lift-pdf[hf]") pip("bitsandbytes", "accelerate", upgrade=True) if PIN_PILLOW: pip(f"pillow=={PILLOW_VERSION}") if "PIL" in sys.modules: import PIL if getattr(PIL, "__version__", "") != PILLOW_VERSION: print(f" Pinned Pillow {PILLOW_VERSION} on disk, but a stale " f"{getattr(PIL, '__version__', '?')} is loaded in memory — restarting runtime.") print(" Just re-run the cell(s) after Colab reconnects.") os.kill(os.getpid(), 9) print(" …install finished.\n") import torch We begin by defining the runtime controls that decide how many invoices we process, whether we use 4-bit loading, whether we preview the generated PDF, and whether we later test a real invoice. We install the core dependencies for PDF generation, rendering, tabular analysis, plotting, and lift-pdf inference. We also pin Pillow to a stable version because the tutorial addresses a known Colab compatibility issue among Pillow, torchvision, and Transformers. This setup gives us a reproducible environment before we load any model or generate any document. Copy CodeCopiedUse a different Browserdef detect_gpu(): if not torch.cuda.is_available(): raise SystemExit( "\n✗ No CUDA GPU found. In Colab: Runtime ▸ Change runtime type ▸ GPU " "(A100 is best; L4/T4 also work).\n" ) p = torch.cuda.get_device_properties(0) cc = torch.cuda.get_device_capability(0) return p.name, p.total_memory / 1e9, cc def enable_4bit(compute_dtype): """Load lift's weights in 4-bit NF4 whatever transformers Auto* class it uses internally.""" import inspect, functools, transformers from transformers import BitsAndBytesConfig bnb = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=compute_dtype, ) def patch(cls): try: cm = inspect.getattr_static(cls, "from_pretrained") orig = cm.__func__ if isinstance(cm, (classmethod, staticmethod)) else cm except Exception: return @functools.wraps(orig) def inner(cls_, *args, **kwargs): kwargs.setdefault("quantization_config", bnb) kwargs.setdefault("device_map", {"": 0}) model = orig(cls_, *args, **kwargs) try: model.to = lambda *a, **k: model model.cuda = lambda *a, **k: model except Exception: pass return model cls.from_pretrained = classmethod(inner) for name in ["AutoModelForImageTextToText", "AutoModelForMultimodalLM", "AutoModelForVision2Seq", "AutoModelForCausalLM", "AutoModel"]: c = getattr(transformers, name, None) if c is not None: patch(c) try: from transformers.modeling_utils import PreTrainedModel patch(PreTrainedModel) except Exception: pass print("STEP 2/7 · Preparing the model backend…") gpu_name, vram, cc = detect_gpu() use_4bit = FORCE_4BIT or (vram < 34 and not FORCE_FULL_PRECISION) compute_dtype = torch.bfloat16 if cc[0] >= 8 else torch.float16 print(f" GPU: {gpu_name} | ~{vram:.0f} GB | compute capability {cc[0]}.{cc[1]}") print(f" Load mode: {'4-bit NF4' if use_4bit else 'full bf16'} (compute dtype {compute_dtype})") os.environ.setdefault("TORCH_DEVICE", "cuda:0") os.environ.setdefault("MODEL_CHECKPOINT", "datalab-to/lift") if use_4bit: enable_4bit(compute_dtype) from lift import extract from lift.model import InferenceManager print(" Loading lift weights (≈20 GB download on first run)…") _t = time.time() MODEL = InferenceManager(method="hf") print(f" ✓ model ready in {time.time() - _t:.0f}s\n") def run_lift(pdf_path, schema, page_range=None): kw = {"model": MODEL} if page_range: kw["page_range"] = page_range result = extract(pdf_path, schema, **kw) return getattr(result, "extraction", None) We prepare the GPU-aware inference backend and decide whether the model should run in full precision or 4-bit NF4 quantization based on available VRAM. We patch the Hugging Face model-loading path so lift can transparently load the checkpoint with a BitsAndBytes quantization configuration when needed. We initialize the InferenceManager once and reuse it across all invoices, avoiding repeated model-loading overhead. Finally, we wrap lift.extract() inside a small helper so each PDF can be mined with the same schema and optional page range. Copy CodeCopiedUse a different BrowserDOCS = [ dict( invoice_number="INV-2026-0412", invoice_date="2026-05-04", due_date="2026-06-03", vendor_name="Cloudworks Inc.", vendor_address="500 Market St, Suite 900, San Francisco, CA 94105, USA", bill_to_name="Acme Robotics LLC", bill_to_address="12 Foundry Rd, Pittsburgh, PA 15222, USA", ship_to_name="Acme Robotics — Warehouse 4", ship_to_address="88 Dockside Blvd, Newark, NJ 07114, USA", po_number=None, discount_amount=None, currency_code="USD", currency_symbol="$", tax_rate=0.085, amount_paid=0.00, line_items=[ ("Cloud Compute — Standard tier (monthly)", 3, 240.00), ("Object Storage — 2 TB", 1, 46.00), ("Priority Support add-on", 1, 99.00), ], notes="Payment due within 30 days. Late payments accrue 1.5% monthly interest.", ), dict( invoice_number="INV-ND-2026-118", invoice_date="2026-04-18", due_date="2026-05-18", vendor_name="Nordic Design Studio Oy", vendor_address="Eteläranta 12, 00130 Helsinki, Finland", bill_to_name="Helsinki Media Oy", bill_to_address="Mannerheimintie 4, 00100 Helsinki, Finland", ship_to_name=None, ship_to_address=None, po_number="PO-HM-5589", discount_amount=785.00, currency_code="EUR", currency_symbol="€", tax_rate=0.24, amount_paid=8760.60, line_items=[ ("Brand identity design package", 1, 4200.00), ("Web UI design — 12 screens", 12, 180.00), ("Custom illustration set", 1, 850.00), ("Design-system documentation", 1, 640.00), ], notes="Paid in full — thank you. All amounts in EUR.", ), dict( invoice_number="INV-BR-4471", invoice_date="2026-06-01", due_date="2026-07-15", vendor_name="BuildRight Contractors Inc.", vendor_address="740 Industrial Way, Austin, TX 78744, USA", bill_to_name="Sunrise Property Group", bill_to_address="9 Lakeview Terrace, Austin, TX 78703, USA", ship_to_name="Sunrise Property Group — Lot 14 site office", ship_to_address="Parcel 14, Mesa Ridge Development, Austin, TX 78737, USA", po_number="PO-SPG-2211", discount_amount=None, currency_code="USD", currency_symbol="$", tax_rate=0.07, amount_paid=15000.00, line_items=[ ("Site preparation and grading", 1, 18500.0
Related
相關文章

從LLM到JEPA,中國團隊正在把“世界模型”搬進細胞內部
中國團隊將原本用於AI的JEPA架構應用於細胞內部,把「世界模型」的概念從大語言模型(LLM)延伸到生物學領域。這項研究近期取得突破,為理解細胞運作機制提供了新途徑。團隊正在嘗試以AI建模方式,預測細胞內部的動態行為。

蘋果在印度成了“開源手機”,但印度AI為什麼還是扶不起來?
蘋果在印度因改裝維修盛行而被稱為「開源手機」,但此現象與印度AI產業發展困境形成對比。儘管大量印度裔人才主導矽谷科技巨頭,但印度本土因基礎建設不足、人才外流及缺乏原創研究,AI創新難以起飛。未來印度需改革教育與產業政策,建立完整生態系,才能縮小技術鴻溝。
具身智能自主控制成果
具身智能自主控制成果。 智能系統 ASPIRE 具備動作探索能力。機器人 ��� 在家務操作中成功率提升三成。新算法實現了 零樣本遷移 至真實世界。詳情可 閱讀具身智能論文 獲得第一手資料。機器人編程門檻 (✿◡‿◡) 將會大幅降低。
十年榜單首迎中國雙料冠軍:這次贏的不只是性能
6月,在德國漢堡ISC高性能計算大會的展臺上,GPU、液冷、量子計算的聲浪依舊洶湧,但今年,會場的主角悄悄換了人。IO500榜單——全球高性能計算存儲領域最權威的評測體系——公佈了最新一期結果:中科曙光ParaStor F9000分佈式全閃存儲系統,同時拿下生產型全節點和10節點兩大榜單的第一名。
OpenAI 發佈 GeneBench-Pro 基準測試,提升 AI 模型生物學分析能力!
OpenAI推出GeneBench-Pro基準,聚焦評估AI在基因組學、蛋白質組學等複雜生物數據分析中的實際研究能力,尤其檢驗模型處理混亂、不完整數據時的判斷與決策水平,與傳統基準截然不同。
BlockPilot解碼加速技術發佈
BlockPilot解碼加速技術發佈。 這套創新算法 ✨ 能夠自動預測推理過程的最佳分塊。研究團隊採用自適應生成策略來具體實現。它的推理速度 ⚡️ 竟然直接飆升了四倍多。這套新架構極其輕量並且支持無縫嵌入現有系統。