GLM-5.2 OpenAI-Compatible API: A Hands-On Guide to Reasoning Effort, Function Calling, and Long-Context Retrieval
重點摘要
In this tutorial, we work with GLM-5.2 and use its hosted, OpenAI-compatible API instead of running the full model locally. We begin by setting up multiple provider options, securely loading the API key, and creating a reusable chat wrapper that supports normal chat, thinking mode, streaming, tool calling, and token tracking. Then we move beyond a simple chatbot example and test the model in more practical situations, including reasoning-effort control, streamed reasoning and answers, function calling, a small tool-using agent, structured JSON output, long-context retrieval, and cost estimation. Setting Up the GLM-5.2 OpenAI-Compatible Client and Reusable Chat Wrapper Copy CodeCopiedUse a different Browserimport sys, subprocess subprocess.run([sys.executable, "-m", "pip", "install", "-q",
In this tutorial, we work with GLM-5.2 and use its hosted, OpenAI-compatible API instead of running the full model locally. We begin by setting up multiple provider options, securely loading the API key, and creating a reusable chat wrapper that supports normal chat, thinking mode, streaming, tool calling, and token tracking. Then we move beyond a simple chatbot example and test the model in more practical situations, including reasoning-effort control, streamed reasoning and answers, function calling, a small tool-using agent, structured JSON output, long-context retrieval, and cost estimation. Setting Up the GLM-5.2 OpenAI-Compatible Client and Reusable Chat Wrapper Copy CodeCopiedUse a different Browserimport sys, subprocess subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U", "openai"], check=False) import os, re, json, time, getpass from openai import OpenAI PROVIDERS = { "zai": {"base_url": "https://api.z.ai/api/paas/v4/", "model": "glm-5.2", "env": "ZAI_API_KEY"}, "openrouter": {"base_url": "https://openrouter.ai/api/v1", "model": "z-ai/glm-5.2", "env": "OPENROUTER_API_KEY"}, "together": {"base_url": "https://api.together.xyz/v1", "model": "zai-org/GLM-5.2","env": "TOGETHER_API_KEY"}, "requesty": {"base_url": "https://router.requesty.ai/v1", "model": "zai/glm-5.2", "env": "REQUESTY_API_KEY"}, "huggingface": {"base_url": "https://router.huggingface.co/v1","model": "zai-org/GLM-5.2","env": "HF_TOKEN"}, } PROVIDER = "zai" CFG = PROVIDERS[PROVIDER] MODEL = CFG["model"] def load_api_key(env_name): try: from google.colab import userdata v = userdata.get(env_name) if v: return v except Exception: pass if os.environ.get(env_name): return os.environ[env_name] return getpass.getpass(f"Enter your {env_name}: ") client = OpenAI(api_key=load_api_key(CFG["env"]), base_url=CFG["base_url"]) PRICE_IN_PER_M, PRICE_OUT_PER_M = 1.40, 4.40 _USAGE = {"in": 0, "out": 0, "calls": 0} def _track(usage): if usage: _USAGE["in"] += getattr(usage, "prompt_tokens", 0) or 0 _USAGE["out"] += getattr(usage, "completion_tokens", 0) or 0 _USAGE["calls"] += 1 def get_reasoning(obj): """Pull GLM's hidden reasoning trace from a message/delta (a provider-extra field).""" val = getattr(obj, "reasoning_content", None) if val: return val extra = getattr(obj, "model_extra", None) or {} if extra.get("reasoning_content"): return extra["reasoning_content"] try: return obj.to_dict().get("reasoning_content") except Exception: return None def chat(messages, effort=None, thinking=True, tools=None, tool_choice="auto", stream=False, max_tokens=2048, temperature=1.0, tool_stream=False): """ effort: None | "high" | "max" (GLM-5.2 thinking-effort level; max is the model default) thinking: True -> deep thinking on; False -> off (fast, cheap, low-latency) GLM-specific params go through extra_body so any OpenAI client works. """ extra = {"thinking": {"type": "enabled" if thinking else "disabled"}} if effort and thinking: extra["reasoning_effort"] = effort if tool_stream: extra["tool_stream"] = True kwargs = dict(model=MODEL, messages=messages, max_tokens=max_tokens, temperature=temperature, stream=stream, extra_body=extra) if tools: kwargs.update(tools=tools, tool_choice=tool_choice) if stream: kwargs["stream_options"] = {"include_usage": True} return client.chat.completions.create(**kwargs) We set up the complete foundation for using GLM-5.2 through an OpenAI-compatible API. We define multiple provider options, load the API key securely, create the OpenAI client, and set up token-cost tracking for the entire notebook. We also build a reusable chat wrapper so that every subsequent demo can use thinking mode, reasoning effort, streaming, tool calling, and provider-specific parameters cleanly. Basic Chat, Thinking-Effort Control, and Streamed Reasoning with GLM-5.2 Copy CodeCopiedUse a different Browserdef demo_basic(): print("\n=== 1. BASIC CHAT / SANITY CHECK =========================") resp = chat([{"role": "system", "content": "You are a concise technical assistant."}, {"role": "user", "content": "In one sentence, what is GLM-5.2 best at?"}], thinking=False, max_tokens=200) _track(resp.usage) print(resp.choices[0].message.content.strip()) def demo_effort(): print("\n=== 2. THINKING-EFFORT CONTROL (off / high / max) ========") problem = ("Train A leaves city A at 9:00 going 60 km/h toward city B. " "Train B leaves B (420 km away) at 9:30 going 90 km/h toward A. " "At what clock time do they meet? Show the key steps briefly.") for label, kw in [("thinking OFF", dict(thinking=False)), ("effort=high", dict(thinking=True, effort="high")), ("effort=max", dict(thinking=True, effort="max"))]: t0 = time.time() resp = chat([{"role": "user", "content": problem}], max_tokens=2000, **kw) dt = time.time() - t0 _track(resp.usage) msg, u = resp.choices[0].message, resp.usage print(f"\n--- {label} | {dt:0.1f}s | out_tokens={getattr(u,'completion_tokens',0)} ---") r = get_reasoning(msg) if r: print(" [reasoning, first 220 chars]: " + " ".join(r.split())[:220] + " ...") print(" <div><i>Please view this post in your web browser to complete the quiz.</i></div>: " + " ".join((msg.content or '').split())[:350]) def demo_streaming(): print("\n=== 3. STREAMING: reasoning channel vs answer channel ====") stream = chat([{"role": "user", "content": "Explain why the sky is blue, then give a one-line TL;DR."}], thinking=True, effort="high", stream=True, max_tokens=1200) saw_r = saw_a = False usage = None for chunk in stream: if getattr(chunk, "usage", None): usage = chunk.usage if not chunk.choices: continue delta = chunk.choices[0].delta r = get_reasoning(delta) if r: if not saw_r: print("\n[thinking] ", end="", flush=True); saw_r = True print(r, end="", flush=True) if getattr(delta, "content", None): if not saw_a: print("\n\n ", end="", flush=True); saw_a = True print(delta.content, end="", flush=True) print() _track(usage) We start testing GLM-5.2 with basic chat, reasoning-effort control, and streaming output. We first run a simple sanity check, then compare the same problem across thinking-off, high-effort, and max-effort modes to observe changes in latency and output tokens. We also stream the model response so we can view the reasoning channel and the final answer separately as the response is being generated. Function Calling and a Multi-Step Tool-Using GLM-5.2 Agent Copy CodeCopiedUse a different Browserdef tool_calculator(expression: str): if not re.fullmatch(r"[0-9+\-*/(). %]+", expression or ""): return {"error": "unsupported characters"} try: return {"result": eval(expression, {"__builtins__": {}}, {})} except Exception as e: return {"error": str(e)} _CITY_POP = {"tokyo": 37_400_068, "delhi": 32_900_000, "shanghai": 28_500_000, "sao paulo": 22_400_000, "mexico city": 21_800_000} def tool_city_population(city: str): return {"city": city, "population": _CITY_POP.get((city or "").strip().lower())} TOOLS = [ {"type": "function", "function": { "name": "calculator", "description": "Evaluate basic arithmetic like '37400068/21800000'.", "parameters": {"type": "object", "properties": {"expression": {"type": "string"}}, "required": ["expression"]}}}, {"type": "function", "function": { "name": "city_population", "description": "Look up the metro population of a city.", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}}, ] TOOL_IMPLS = {"calculator": tool_calculator, "city_population": tool_city_population} def run_tool_loop(messages, max_rounds=6, effort="max"): """Full loop: model -> tool_calls -> execute -> feed results back -> repeat.""" for _ in range(max_rounds): resp = chat(messages, tools=TOOLS, thinking=True, effort=effort, max_tokens=1500, temperature=0.3) _track(resp.usage) m = resp.choices[0].message if not getattr(m, "tool_calls", None): return m.content messages.append({ "role": "assistant", "content": m.content or "", "tool_calls": [{"id": tc.id, "type": "function", "function": {"name": tc.function.name, "arguments": tc.function.arguments}} for tc
Related
相關文章

日本 Sakana AI 推出 Fugu:智能調用最佳模型,部分場景優於 Fable 5
{"id":"fbd5bdcd-e832-45ce-ae26-45bcf3b51f7b","object":"response","model":"deepseek-v4-flash","output":[],"stop_reason":"max_output_tokens","usage":{"input_tokens":266,"output_tokens":200,"total_tokens":466}}

字節豆包 Seed 2.1 Pro 和 Turbo 深度思考模型發佈,三大能力比肩 GPT-5.5
{"id":"35c250c3-2a27-4547-94e1-24a2d7a97057","object":"response","model":"deepseek-v4-flash","output":[],"stop_reason":"max_output_tokens","usage":{"input_tokens":160,"output_tokens":200,"total_tokens":360}}

大模型“倒爺”微軟:GPT帶到東方,DeepSeek賣給西方
微軟在中國的AI營收成長迅速,成為所有銷售區域中增速最快的市場。該公司將OpenAI的GPT模型導入東方市場,同時將中國開發的DeepSeek模型銷售給西方客戶。這種「倒爺」模式讓微軟同時扮演技術引進與輸出的雙重角色。

超越Claude Mythos的AI模型,誕生了?
這篇消息聚焦「超越Claude Mythos的AI模型,誕生了?」。原始導語提到:不怕封鎖的Fable級模型 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。
火山引擎發佈豆包2.1Pro:日常功能確認免費,將推專業版辦公模式
2026年6月23日火山引擎FORCE大會,字節跳動發佈豆包大模型重大升級:推出2.1Pro,聚焦編程、智能體與視覺語言模型;同時發佈視頻生成模型Seedance 2.5及2.04K版、圖像生成模型Seedream 5.0 Pro、音頻生成模型1.0。2.1Pro已開放API並接入釦子生態。火山引擎總裁譚待表示,豆包面向用戶的日常基礎功能(如搜索問答)將持續免費。
豆包2.1 Pro版本發佈,劍指行業“生產級”巔峰
火山引擎Force大會上,豆包大模型發佈旗艦版本Doubao-Seed-2.1 Pro。基準測試顯示其性能已比肩GPT-5.5、Claude Opus 4.7與Gemini 3.1 Pro等國際頂尖模型,標誌著國產大模型在核心性能和工程化應用上取得關鍵突破,非僅參數堆疊。