使用 GEPA 建立反思性提示優化:多組件提示、結構化反饋與保留驗證
重點摘要
在本教程中,我們使用 GEPA 作為反思性提示演化框架,以改善語言模型解決算術文字題的方式。我們從一個弱的初始提示開始,建立一個小型確定性基準,定義結構化評估器,並向 GEPA 傳遞可操作的反饋,使其理解候選提示為何失敗。我們還使用多組件提示設定,其中指令欄位和輸出格式規則共同演化。最後,我們在保留的驗證集上比較基線提示與優化後的提示,並檢視演化過程如何提升效能。安裝 GEPA 和 LiteLLM,並配置任務與反思模型(複製程式碼:pip install -q gepa litellm,然後導入 os, re, json, random, getpass)。
In this tutorial, we use GEPA as a reflective prompt-evolution framework to improve the way a language model solves arithmetic word problems. We begin with a weak seed prompt, create a small deterministic benchmark, define a structured evaluator, and pass actionable feedback to GEPA so it can understand why a candidate prompt fails. We also use a multi-component prompt setup in which both the instruction field and the output-format rules evolve together. By the end, we compare the baseline prompt with the optimized prompt on a held-out validation set and inspect how the evolutionary process improves performance. Installing GEPA and LiteLLM and Configuring the Task and Reflection Models Copy CodeCopiedUse a different Browser!pip install -q gepa litellm import os, re, json, random, getpass, textwrap import litellm import gepa.optimize_anything as oa from gepa.optimize_anything import ( optimize_anything, GEPAConfig, EngineConfig, ReflectionConfig, ) litellm.suppress_debug_info = True if not os.environ.get("OPENAI_API_KEY"): os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ") TASK_LM = "openai/gpt-4o-mini" REFLECTION_LM = "openai/gpt-4.1" MAX_METRIC_CALLS = 100 We install GEPA and LiteLLM, then import the required libraries for prompt optimization and model calls. We securely set up the OpenAI API key and define two models: a task model that solves the problem and a reflection model that improves the prompt. We also set the maximum metric-call budget to keep the optimization process under control. Building a Deterministic Arithmetic Benchmark Dataset Copy CodeCopiedUse a different Browserdef make_problems(n, seed=0): rng = random.Random(seed) out = [] for _ in range(n): t = rng.choice(["discount", "travel", "wallet", "chain"]) if t == "discount": unit = rng.choice([40, 60, 80, 120]) qty = rng.choice([5, 6, 8, 10]) disc = rng.choice([10, 20, 25, 50]) total = unit * qty gold = total - total * disc // 100 q = (f"A shop sells notebooks at {unit} rupees each. You buy {qty} " f"notebooks and get a {disc}% discount on the total bill. " f"How many rupees do you pay in total?") elif t == "travel": s1, h1 = rng.choice([40, 50, 60]), rng.choice([2, 3]) s2, h2 = rng.choice([30, 45, 70]), rng.choice([1, 2, 3]) gold = s1 * h1 + s2 * h2 q = (f"A car drives at {s1} km/h for {h1} hours, then at {s2} km/h " f"for {h2} hours. What is the total distance travelled, in km?") elif t == "wallet": tens = rng.choice([3, 5, 7, 9]) fifties= rng.choice([2, 4, 6]) spent = rng.choice([50, 80, 110, 150]) gold = tens * 10 + fifties * 50 - spent q = (f"You have {tens} ten-rupee notes and {fifties} fifty-rupee " f"notes. You spend {spent} rupees. How many rupees are left?") else: x = rng.choice([6, 9, 12, 15]); y = rng.choice([4, 7, 10]); z = rng.choice([3, 8, 11]) gold = x * 2 - y + z q = (f"Start with the number {x}. Double it, then subtract {y}, " f"then add {z}. What number do you end with?") out.append({"question": q, "answer": gold}) return out all_problems = make_problems(18, seed=42) random.Random(1).shuffle(all_problems) trainset = all_problems[:12] valset = all_problems[12:] print(f"Dataset: {len(trainset)} train / {len(valset)} val problems\n") We create a small deterministic dataset of arithmetic word problems covering discounts, travel distance, wallet calculations, and chained operations. We generate the correct answer for each problem programmatically, which keeps the benchmark reliable and easy to evaluate. We then shuffle the examples and split them into a training set for optimization and a validation set for testing generalization. Defining the Evaluator and Structured Feedback for GEPA Copy CodeCopiedUse a different Browserdef build_system_prompt(candidate: dict) -> str: return (f"{candidate['instructions']}\n\n" f"OUTPUT FORMAT RULES:\n{candidate['format_rules']}") def call_task_lm(system_prompt: str, question: str) -> str: for attempt in range(3): try: r = litellm.completion( model=TASK_LM, messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": question}], temperature=0, max_tokens=600, timeout=60, ) return r["choices"][0]["message"]["content"] or "" except Exception as e: if attempt == 2: return f"[LM_ERROR] {e}" return "" def parse_answers(text: str): formatted = re.search(r"####\s*(-?\d+)", text) all_nums = re.findall(r"-?\d+", text) fmt_val = int(formatted.group(1)) if formatted else None last_val = int(all_nums[-1]) if all_nums else None return fmt_val, last_val def evaluate(candidate: dict, example: dict): system = build_system_prompt(candidate) raw = call_task_lm(system, example["question"]) gold = example["answer"] fmt_val, last_val = parse_answers(raw) if fmt_val is not None and fmt_val == gold: score, fb = 1.0, "Correct and correctly formatted." elif fmt_val is not None and fmt_val != gold: score, fb = 0.0, (f"WRONG ANSWER. You output '#### {fmt_val}' but the " f"correct answer is {gold}. Re-check the arithmetic and " f"the order of the steps.") elif last_val == gold: score, fb = 0.5, (f"Right number ({gold}) but FORMAT VIOLATION: the final " f"line was not exactly '#### {gold}'. Always end with a " f"line of the form '#### <integer>' and nothing else.") else: score, fb = 0.0, (f"WRONG. Correct answer is {gold}. The model's final " f"number was {last_val}. Likely a multi-step reasoning " f"slip; show each step and verify before answering.") oa.log(f"score={score} gold={gold} parsed_fmt={fmt_val} parsed_last={last_val}") side_info = { "feedback": fb, "problem": example["question"], "gold_answer": gold, "model_output": raw[:500], } return score, side_info def eval_set(candidate, dataset, label=""): scores, exact, formatted = [], 0, 0 for ex in dataset: s, info = evaluate(candidate, ex) scores.append(s) if s == 1.0: exact += 1; formatted += 1 elif s == 0.5: formatted += 0 acc = exact / len(dataset) avg = sum(scores) / len(dataset) print(f" [{label}] avg_score={avg:.3f} exact_correct+formatted={exact}/{len(dataset)}") return avg, acc We define how the candidate prompt is converted into a system prompt and how the task model receives each question. We also create the evaluator that parses the model output, checks whether the final answer follows the required #### <integer> format, and assigns a score. We return structured feedback as actionable side information so that GEPA can determine whether the issue is incorrect reasoning, poor formatting, or both. Configuring GEPA and Running the Prompt Optimization Copy CodeCopiedUse a different Browserseed_candidate = { "instructions": "Solve the math problem.", "format_rules": "Give the answer.", } print("=== BASELINE (seed prompt) ===") print("Train:"); base_train = eval_set(seed_candidate, trainset, "train") print("Val: "); base_val = eval_set(seed_candidate, valset, "val") print() objective = ( "Evolve a system prompt (the 'instructions' and 'format_rules' fields) so a " "small LLM reliably solves multi-step arithmetic word problems AND always " "ends with a line of exactly the form '#### <integer>'. Maximize the score." ) background = ( "Scoring: 1.0 = correct number in the exact '#### <int>' format; 0.5 = correct " "number but wrong/missing format; 0.0 = wrong number. Common failures are (a) not " "emitting the '####' line, and (b) order-of-operations or multi-step slips. The " "winning prompt should force explicit step-by-step work, a verification step, and " "a strict final-answer line." ) config = GEPAConfig( engine=EngineConfig( max_metric_calls=MAX_METRIC_CALLS, max_workers=4, parallel=True, display_progress_bar=True, seed=0, ), reflection=ReflectionConfig( reflection_lm=REFLECTION_LM, ), ) print("=== RUNNING GEPA (this calls the LLMs; ~1-4 min) ===") result = optimize_anything( seed_candidate=seed_candidate, evaluator=evaluate, dataset=trainset, valset=valset, objective=objective, background=background, config=config, ) We start with a weak seed prompt and evaluate its baseline performance on both the training and validation sets. We then define the o
Related
相關文章

Edge AI Daily 早報(6月19日)
AI Engineer World's Fair 2026規模再創新高,標誌AI工程從幕後走向舞臺中央。行業面臨結構性調整:楊立昆警示OpenAI年虧210億美元揭示商業模式脆弱性,Transformer之父轉投OpenAI反映人才爭奪白熱化。Anthropic多線佈局——語音支持七種語言、加入碳清除聯盟、落子首爾辦事處,展現生態擴張野心。監管壓力加劇,意大利依據DMA調查蘋果iCloud,巴西開放iOS側載佣金降至5%,蘋果圍牆花園持續崩塌。

今天起,Claude Design要把設計師和程序員變成同一種人了
猝不及防!Anthropic深夜甩出Claude Design大更新,設計系統一鍵導入,代碼雙向同步,9大平臺一鍵導出。Anthropic設計師親自下場錄屏:AI跑了八輪自查,才敢把設計稿給你看。

OpenAI 成為 Rust 基金會白金會員,合計贊助 60 萬美元
OpenAI 正式成為 Rust 基金會白金會員,將提供總計 60 萬美元資金,用於支持 Rust 開源項目維護者及 Rust 創新實驗室等計劃。這標誌著 AI 巨頭對安全、高效系統編程語言的重視。 #OpenAI #Rust #開源

Claude Design 上線首周用戶破百萬,和 Claude Code 共享 AI 配額
Anthropic 今天(6 月 18 日)發佈公告,在宣佈 Claude Design 上線首周用戶規模突破 100 萬後,進一步強化和 Claude Code 的雙向聯動,實現從設計到編程的無縫工作流。
谷歌時隔6年再發智能音箱,Gemini上桌,售價不到700元
智東西 編譯 | 劉煜 編輯 | 陳駿達 智東西6月18日消息,谷歌昨日宣佈,其首款搭載居家版Gemini語音助手的智能音箱(Google Home Speaker)已開啟預售,將於當地時間6月25日正式上市,售價為99.99美元(約合人民幣677.03元)。在此之前,谷歌已有6年沒有推出過獨立智能音箱產品。 谷歌這款智能音箱外觀近似球形,風格類似亞馬遜新一代Echo音箱與蘋果舊款音箱HomePod Mini。 ▲谷歌智能音箱(圖源:谷歌官網) 使用音箱時,用戶只需通過口令“Hey Google”或“OK Google”喚醒Gemini,就可以繼續下達相應指令。這與谷歌舊款音箱、智能顯示屏等喚醒語音助手的方式相同。此外,用戶只要按照日常說話習慣下達命令,Gemini便能理解用戶意圖,相比之前大大提升溝通效率。 一、加強短時對話記憶,會員可與Gemini不限次數對話 谷歌此次推出的全新音箱升級諸多功能。其中,音箱搭載的Gemini語音助手擁有10款全新擬人化語音音色,用戶可以根據喜好自行選擇聲線。音箱還可支持用戶一次性下達多條語音指令,即使指令未能說對、說完整,用戶中途改口Gemini也能識別。 Gemini還具備多鏈路推理能力,落地到實際生活場景中比較實用。例如,用戶問:“我支持的足球隊下場比賽天氣如何?”Gemini收到指令後,會自動查詢賽事時間、舉辦地點,同時匹配相應時段天氣,再給出答覆。 同時,Gemini加強了短時對話記憶,能承接上下文實現連續對話功能。即使用戶連續追問、甚至串聯多項任務、不重複交代前置條件,該語音助手也能實現來回連貫交流。 ▲谷歌Gemini對話場景(圖源:谷歌官網) 不僅如此,Gemini搭配的連續對話功能,能讓應答後的音箱麥克風保持短暫收音,用戶無需重複喊“OK Google”就能繼續提問。該功能現已全面支持所有Gemini原生適配的語言,包括

微軟,考慮接入DeepSeek
這篇消息聚焦「微軟,考慮接入DeepSeek」。原始導語提到:Copilot Cowork轉為按量計費。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。