GLM-5.2 OpenAI-Compatible API: A Hands-On Guide to Reasoning Effort, Function Calling, and Long-Context Retrieval
In this tutorial , we work with GLM-5.2 and use its hosted, OpenAI-compatible API instead of running the full model locally.

In this tutorial , we work with GLM-5.2 and use its hosted, OpenAI-compatible API instead of running the full model locally. We begin by setting up multiple provider options, securely loading the API key, and creating a reusable chat wrapper that supports normal chat, thinking mode, streaming, tool calling, and token tracking. Then we move beyond a simple chatbot example and test the model in more practical situations, including reasoning-effort control, streamed reasoning and answers, function calling, a small tool-using agent, structured JSON output, long-context retrieval, and cost estimation.
We set up the complete foundation for using GLM-5.2 through an OpenAI-compatible API. We define multiple provider options, load the API key securely, create the OpenAI client, and set up token-cost tracking for the entire notebook. We also build a reusable chat wrapper so that every subsequent demo can use thinking mode, reasoning effort, streaming, tool calling, and provider-specific parameters cleanly.
We start testing GLM-5.2 with basic chat, reasoning-effort control, and streaming output. We first run a simple sanity check, then compare the same problem across thinking-off, high-effort, and max-effort modes to observe changes in latency and output tokens. We also stream the model response so we can view the reasoning channel and the final answer separately as the response is being generated.
We connect GLM-5.2 to external tools and build a small tool-using workflow. We define a calculator and a city-population lookup tool, register them in an OpenAI-style tool schema, and create a loop in which the model requests tool calls and receives tool results. We then use this setup for a direct function-calling task and a small multi-step agent that looks up populations, ranks cities, and performs calculations without guessing.
We focus on reliable, structured output and long-context retrieval. We create a JSON extraction helper, ask the model to return a strict JSON object, and retry once if the first response is not valid JSON. We also build a synthetic long document with a hidden “needle” and send it to GLM-5.2 to check whether the model retrieves the exact launch code from the provided context.
We finish the tutorial by collecting usage information and running all demos from top to bottom. We calculate the estimated cost from total input and output tokens, then print a compact summary of calls, token counts, and spend. We also use a driver loop so that a single failed demo does not halt the entire notebook, making the tutorial easier to run, debug, and reuse.
In conclusion, we have a practical and reusable workflow for using GLM-5.2 in Python applications. We learned how to control its reasoning behavior, compare different thinking modes, connect it with tools, validate structured outputs, test long-context inputs, and monitor token usage with estimated cost. It provides us a strong starting point for building more advanced systems such as research assistants, document analysis tools, coding agents, long-context retrieval workflows, or API-based reasoning pipelines. We finished with a setup that is lightweight enough for Colab but still close to how we would build with GLM-5.2 in a real project.
Source: MarkTechPost