By Victor Gerbrands

Managing context windows in production ADK agents

Context window management is one of those problems that doesn't show up in tutorials. You build a multi-agent system, it works in demos, and then in production you notice token usage is 10x what you expected, latency is creeping up, and the model is making mistakes that seem like it's not reading what you sent it.

Most of the time, the problem is that too much is going into the context window. This post covers the specific techniques I use in Wolfy — a production Google Ads analyst agent built on Google's Agent Development Kit — to keep context lean across a multi-agent pipeline. Some patterns are ADK-specific. Others apply to any LLM agent architecture.

Why context bloat is worse in multi-agent systems

In a single-agent setup, context growth is linear and mostly predictable. In a multi-agent system, it compounds. Each agent in the chain can add to the context that downstream agents see. Sub-agent tool calls, intermediate results, "For context:" injection blocks — all of it accumulates, and by the time you reach the final agent in a pipeline, the context window can be carrying a lot of weight that nobody asked for.

ADK makes this worse in one specific way that isn't documented well, which is worth understanding before anything else.

The "For context:" problem in ADK

When a sub-agent finishes and control returns to the parent agent, ADK injects a block into the parent's next model call summarizing what the sub-agent did. It looks like this:

CODE
For context:
[wolfy_gaql_generation] called tool `upsert_gaql_query` with parameters: {"query_id": "query-1", "query": "SELECT campaign.id, metrics.cost_micros FROM campaign WHERE ..."}
[wolfy_gaql_generation] `upsert_gaql_query` tool returned result: {"status": "success", "message": "Query stored."}
[wolfy_gaql_execution] called tool `execute_gaql_query` with parameters: {...}
[wolfy_gaql_execution] `execute_gaql_query` tool returned result: {"status": "success", "rows": [...thousands of rows...]}

ADK is trying to give the parent agent visibility into what sub-agents did. The intention is reasonable. The problem is that these blocks can be enormous — raw execution results with thousands of rows, serialized into the context of every subsequent model call. In Wolfy, the GAQL execution agent can return tens of thousands of tokens worth of API results. Without intervention, all of that ends up in the parent agent's context window on every subsequent turn.

The fix is a before_model_callback that strips these blocks before they reach the model:

PYTHON
_FOR_CONTEXT_ADK_PART_RE = re.compile(
    r'^\[[\w]+\] (?:called tool `[\w]+` with parameters:|`[\w]+` tool returned result:)'
)

def _is_for_context_block(content) -> bool:
    if not content or content.role != 'user':
        return False
    parts = getattr(content, 'parts', None)
    if not parts or len(parts) < 2:
        return False
    if getattr(parts[0], 'text', None) != 'For context:':
        return False
    return any(
        getattr(p, 'text', None) and _FOR_CONTEXT_ADK_PART_RE.match(p.text)
        for p in parts[1:]
    )

def strip_subagent_tool_responses_callback(
    callback_context: CallbackContext, llm_request: LlmRequest
) -> None:
    filtered_contents = []
    for content in llm_request.contents:
        if _is_for_context_block(content):
            content.parts = [
                p for p in content.parts
                if not (getattr(p, 'text', None) and _FOR_CONTEXT_ADK_PART_RE.match(p.text))
            ]
            if len(content.parts) <= 1:
                continue
        filtered_contents.append(content)
    llm_request.contents = filtered_contents

The callback identifies "For context:" blocks by their structure: a user-role content item where the first part is the literal string "For context:" and at least one subsequent part matches ADK's known injection format. It strips the raw tool call and response parts, keeping only text parts — which contain whatever the sub-agent actually said to the user. If the block becomes empty after stripping, it's dropped entirely.

The parent agent still knows what the sub-agents accomplished through session state, which is where results are stored explicitly. It doesn't need the raw tool call log in its context window.

This callback runs on every agent in the pipeline:

PYTHON
before_model_callback=[strip_subagent_tool_responses_callback]

It should probably be one of the first things you add to any ADK multi-agent system.

Excluding conversation history from specialist agents

ADK's default behavior is to include the full conversation history in every model call. For the root conversational agent, that's appropriate. For specialist agents deep in a pipeline, it's wasteful and often counterproductive.

The GAQL execution agent doesn't need to know what the user said three turns ago. The report writer doesn't need to see the full back-and-forth that led to the data retrieval request. These agents have a specific job, and their inputs come from session state rather than conversation history.

ADK supports this with include_contents="none":

PYTHON
gaql_execution_agent = LlmAgent(
    name="wolfy_gaql_execution",
    model=GEMINI_FLASH_MODEL,
    include_contents='none',
    instruction=GAQL_EXECUTION_PROMPT,
    ...
)

With include_contents="none", the agent only sees its system prompt and whatever is injected via state variables into the instruction template. This is the right default for any agent that operates on structured inputs from state rather than conversational context. In Wolfy it's used on the GAQL execution agent, both report writers, and the audit writer.

Compressing state before model calls

Session state is how agents in a pipeline communicate in ADK. But state can grow large, and if you inject large state values into a prompt template, they land in the context window in full.

The GAQL query state is a good example. During data retrieval, Wolfy tracks every query with its full metadata: the query string, description, status, error messages, result row counts, and dependencies. By the time the loop analyst needs to review query state, this dict can be substantial.

The analyst doesn't need the full query strings to do its job — it needs to know which queries succeeded, which failed, and what the errors were. A before_model_callback compresses the state to just that before the model call:

PYTHON
def compact_gaql_queries_callback(callback_context: CallbackContext, llm_request: LlmRequest):
    queries = callback_context.state.get("gaql_queries") or {}
    compact = {}
    for qid, q in queries.items():
        status = q.get("status")
        if status == "success":
            compact[qid] = {"id": qid, "title": q.get("query_title"), "status": status, "row_count": q.get("row_count")}
        elif status == "error":
            compact[qid] = {"id": qid, "title": q.get("query_title"), "status": status,
                            "error_message": q.get("error_message"), "query": q.get("query")}
        else:
            compact[qid] = {"id": qid, "title": q.get("query_title"), "status": status,
                            "depends_on": q.get("depends_on")}
    callback_context.state["gaql_queries_compact"] = compact

The compact version is what gets injected into the analyst's prompt via {gaql_queries_compact}. The full state remains available in gaql_queries for agents that need it. The analyst sees a tight summary; the execution agent sees the full structure it needs to operate.

The pattern generalizes: before any model call, ask what the model actually needs from state versus what's just there. Write a callback that produces a reduced version for the prompt.

Caching static documentation with static_instruction

Some agents need large reference documents to do their job. In Wolfy, the GAQL generation and analyst agents need the full Google Ads API field documentation: every valid field, its type, its valid enum values, what resources it can be combined with.

Injecting this into the regular instruction means it goes into every model call, and it can't be cached because the instruction changes each turn as state variables are interpolated.

ADK has a static_instruction parameter for exactly this case:

PYTHON
analyst_agent = LlmAgent(
    name="wolfy_analyst",
    static_instruction=STATIC_GAQL_DOCS,
    instruction=ANALYST_PROMPT,
    ...
)

static_instruction is fixed content that doesn't change between calls. The model provider can cache it as a system prompt prefix, meaning it doesn't count against per-call token costs and doesn't need to be re-processed on each turn. The dynamic instruction (which contains state interpolations like {date}, {gaql_queries_compact}, {analyst_output}) sits alongside it.

For Wolfy, STATIC_GAQL_DOCS contains the full GAQL field reference. It's large, but it only needs to be processed once. Without static_instruction, it would be re-sent and re-processed on every analyst model call in the loop.

Keeping raw data out of LLM context entirely

The most effective context optimization is preventing data from entering the context window in the first place. Two patterns handle this in Wolfy.

The first is the DataSummarizer — a BaseAgent subclass with no model, just Python:

PYTHON
class DataSummarizer(BaseAgent):
    async def _run_async_impl(self, ctx: InvocationContext) -> AsyncGenerator[Event, None]:
        summarized: Dict[str, Any] = {}

        gaql_results = state.get("gaql_execution_results")
        if gaql_results and isinstance(gaql_results, dict):
            for qid, result in gaql_results.items():
                summarized[qid] = summarize_gaql_result(qid, result)

        state[SUMMARIZED_RESULTS_KEY] = summarized

        yield Event(
            invocation_id=ctx.invocation_id,
            author=self.name,
            actions=EventActions(skip_summarization=True),
        )

After the GAQL loop finishes, the data summarizer runs before the report writer. It computes grand totals, KPIs (CTR, ROAS, CPA, CVR), and enriched per-row metrics from the raw API results, and stores the computed summary in state. The report writer calls get_summarized_results() and gets a compact, semantically rich payload. It never sees raw cost_micros values or individual API rows.

The second pattern is row capping at the tool boundary. Fixed data retrieval tools can return thousands of rows from the Google Ads API. The full results are stored in state for correct aggregate calculations. But what gets returned to the model is capped:

PYTHON
MAX_FIXED_ROWS_FOR_LLM = 100

def truncate_fixed_result_for_llm(result: Dict[str, Any]) -> Dict[str, Any]:
    out = dict(result)
    row_limits: Dict[str, Dict[str, int]] = {}
    for key, value in list(out.items()):
        if isinstance(value, list) and len(value) > MAX_FIXED_ROWS_FOR_LLM and value and isinstance(value[0], dict):
            total = len(value)
            out[key] = value[:MAX_FIXED_ROWS_FOR_LLM]
            row_limits[key] = {"shown": MAX_FIXED_ROWS_FOR_LLM, "total": total}
    if row_limits:
        out["_row_limits"] = row_limits
    return out

The tool returns at most 100 rows to the model, with a _row_limits metadata field so the model knows the full dataset is larger. Totals and aggregates are calculated from the full dataset before truncation, so the model gets accurate numbers alongside a representative sample of rows. The report writer can request full rows from state when it specifically needs them, but in practice the summary is almost always sufficient.

Forcing compact output with structured schemas

One more ADK-specific pattern worth mentioning: using output_schema to force structured output on agents that would otherwise produce verbose text.

The GAQL execution agent's job is to run queries and report what happened. Without constraints, a model might produce a multi-paragraph summary of each query's results. With an output schema, it produces exactly what's needed:

PYTHON
class GAQLExecutionOutput(BaseModel):
    status: str
    total_queries: int
    successful_queries: int
    failed_queries: int

gaql_execution_agent = LlmAgent(
    name="wolfy_gaql_execution",
    output_schema=GAQLExecutionOutput,
    output_key="gaql_execution_output",
    ...
)

The model is forced to produce a four-field JSON object. That output gets stored in gaql_execution_output in state and never needs to enter another agent's context window. The actual query results are in gaql_execution_results, stored directly by the tool, not via model output.

Putting it together

These techniques stack. In the Wolfy GAQL loop, a single data retrieval cycle applies most of them in combination:

  1. strip_subagent_tool_responses_callback removes "For context:" injection blocks before every model call across all agents
  2. Specialist agents use include_contents="none" so conversation history never enters their context
  3. compact_gaql_queries_callback compresses query state before the analyst model call
  4. static_instruction loads the GAQL field reference once and caches it, separate from the dynamic instruction
  5. DataSummarizer pre-computes everything before the report writer runs, so raw rows never enter a model context
  6. truncate_fixed_result_for_llm caps tool return values at 100 rows while keeping full data available in state
  7. output_schema on the execution agent forces a four-field summary instead of verbose text

None of these are complicated individually. The compounding effect is what matters. Each one reduces the token footprint of a model call, which reduces cost, reduces latency, and reduces the likelihood of the model getting confused by irrelevant content buried in a long context.

The general principle behind all of them: a model should only see what it needs to do its specific job. Everything else should live in state, get computed in Python, or be stripped by a callback before the model ever sees it.