Managing context windows in production ADK agents
Context window management is one of those problems that doesn't show up in tutorials. You build a multi-agent system, it works in demos, and then in production you notice token usage is 10x what you expected, latency is creeping up, and the model is making mistakes that seem like it's not reading what you sent it.
Most of the time, the problem is that too much is going into the context window. This post covers the specific techniques I use in Wolfy — a production Google Ads analyst agent built on Google's Agent Development Kit — to keep context lean across a multi-agent pipeline. Some patterns are ADK-specific. Others apply to any LLM agent architecture.
Why context bloat is worse in multi-agent systems
In a single-agent setup, context growth is linear and mostly predictable. In a multi-agent system, it compounds. Each agent in the chain can add to the context that downstream agents see. Sub-agent tool calls, intermediate results, "For context:" injection blocks — all of it accumulates, and by the time you reach the final agent in a pipeline, the context window can be carrying a lot of weight that nobody asked for.
ADK makes this worse in one specific way that isn't documented well, which is worth understanding before anything else.
The "For context:" problem in ADK
When a sub-agent finishes and control returns to the parent agent, ADK injects a block into the parent's next model call summarizing what the sub-agent did. It looks like this:
For context:
[wolfy_gaql_generation] called tool `upsert_gaql_query` with parameters: {"query_id": "query-1", "query": "SELECT campaign.id, metrics.cost_micros FROM campaign WHERE ..."}
[wolfy_gaql_generation] `upsert_gaql_query` tool returned result: {"status": "success", "message": "Query stored."}
[wolfy_gaql_execution] called tool `execute_gaql_query` with parameters: {...}
[wolfy_gaql_execution] `execute_gaql_query` tool returned result: {"status": "success", "rows": [...thousands of rows...]}
ADK is trying to give the parent agent visibility into what sub-agents did. The intention is reasonable. The problem is that these blocks can be enormous — raw execution results with thousands of rows, serialized into the context of every subsequent model call. In Wolfy, the GAQL execution agent can return tens of thousands of tokens worth of API results. Without intervention, all of that ends up in the parent agent's context window on every subsequent turn.
The fix is a before_model_callback that strips these blocks before they reach the model:
_FOR_CONTEXT_ADK_PART_RE = re.compile(
r'^\[[\w]+\] (?:called tool `[\w]+` with parameters:|`[\w]+` tool returned result:)'
)
def _is_for_context_block(content) -> bool:
if not content or content.role != 'user':
return False
parts = getattr(content, 'parts', None)
if not parts or len(parts) < 2:
return False
if getattr(parts[0], 'text', None) != 'For context:':
return False
return any(
getattr(p, 'text', None) and _FOR_CONTEXT_ADK_PART_RE.match(p.text)
for p in parts[1:]
)
def strip_subagent_tool_responses_callback(
callback_context: CallbackContext, llm_request: LlmRequest
) -> None:
filtered_contents = []
for content in llm_request.contents:
if _is_for_context_block(content):
content.parts = [
p for p in content.parts
if not (getattr(p, 'text', None) and _FOR_CONTEXT_ADK_PART_RE.match(p.text))
]
if len(content.parts) <= 1:
continue
filtered_contents.append(content)
llm_request.contents = filtered_contents
The callback identifies "For context:" blocks by their structure: a user-role content item where the first part is the literal string "For context:" and at least one subsequent part matches ADK's known injection format. It strips the raw tool call and response parts, keeping only text parts — which contain whatever the sub-agent actually said to the user. If the block becomes empty after stripping, it's dropped entirely.
The parent agent still knows what the sub-agents accomplished through session state, which is where results are stored explicitly. It doesn't need the raw tool call log in its context window.
This callback runs on every agent in the pipeline:
before_model_callback=[strip_subagent_tool_responses_callback]
It should probably be one of the first things you add to any ADK multi-agent system.
Excluding conversation history from specialist agents
ADK's default behavior is to include the full conversation history in every model call. For the root conversational agent, that's appropriate. For specialist agents deep in a pipeline, it's wasteful and often counterproductive.
The GAQL execution agent doesn't need to know what the user said three turns ago. The report writer doesn't need to see the full back-and-forth that led to the data retrieval request. These agents have a specific job, and their inputs come from session state rather than conversation history.
ADK supports this with include_contents="none":
gaql_execution_agent = LlmAgent(
name="wolfy_gaql_execution",
model=GEMINI_FLASH_MODEL,
include_contents='none',
instruction=GAQL_EXECUTION_PROMPT,
...
)
With include_contents="none", the agent only sees its system prompt and whatever is injected via state variables into the instruction template. This is the right default for any agent that operates on structured inputs from state rather than conversational context. In Wolfy it's used on the GAQL execution agent, both report writers, and the audit writer.
Compressing state before model calls
Session state is how agents in a pipeline communicate in ADK. But state can grow large, and if you inject large state values into a prompt template, they land in the context window in full.
The GAQL query state is a good example. During data retrieval, Wolfy tracks every query with its full metadata: the query string, description, status, error messages, result row counts, and dependencies. By the time the loop analyst needs to review query state, this dict can be substantial.
The analyst doesn't need the full query strings to do its job — it needs to know which queries succeeded, which failed, and what the errors were. A before_model_callback compresses the state to just that before the model call:
def compact_gaql_queries_callback(callback_context: CallbackContext, llm_request: LlmRequest):
queries = callback_context.state.get("gaql_queries") or {}
compact = {}
for qid, q in queries.items():
status = q.get("status")
if status == "success":
compact[qid] = {"id": qid, "title": q.get("query_title"), "status": status, "row_count": q.get("row_count")}
elif status == "error":
compact[qid] = {"id": qid, "title": q.get("query_title"), "status": status,
"error_message": q.get("error_message"), "query": q.get("query")}
else:
compact[qid] = {"id": qid, "title": q.get("query_title"), "status": status,
"depends_on": q.get("depends_on")}
callback_context.state["gaql_queries_compact"] = compact
The compact version is what gets injected into the analyst's prompt via {gaql_queries_compact}. The full state remains available in gaql_queries for agents that need it. The analyst sees a tight summary; the execution agent sees the full structure it needs to operate.
The pattern generalizes: before any model call, ask what the model actually needs from state versus what's just there. Write a callback that produces a reduced version for the prompt.
Caching static documentation with static_instruction
Some agents need large reference documents to do their job. In Wolfy, the GAQL generation and analyst agents need the full Google Ads API field documentation: every valid field, its type, its valid enum values, what resources it can be combined with.
Injecting this into the regular instruction means it goes into every model call, and it can't be cached because the instruction changes each turn as state variables are interpolated.
ADK has a static_instruction parameter for exactly this case:
analyst_agent = LlmAgent(
name="wolfy_analyst",
static_instruction=STATIC_GAQL_DOCS,
instruction=ANALYST_PROMPT,
...
)
static_instruction is fixed content that doesn't change between calls. The model provider can cache it as a system prompt prefix, meaning it doesn't count against per-call token costs and doesn't need to be re-processed on each turn. The dynamic instruction (which contains state interpolations like {date}, {gaql_queries_compact}, {analyst_output}) sits alongside it.
For Wolfy, STATIC_GAQL_DOCS contains the full GAQL field reference. It's large, but it only needs to be processed once. Without static_instruction, it would be re-sent and re-processed on every analyst model call in the loop.
Keeping raw data out of LLM context entirely
The most effective context optimization is preventing data from entering the context window in the first place. Two patterns handle this in Wolfy.
The first is the DataSummarizer — a BaseAgent subclass with no model, just Python:
class DataSummarizer(BaseAgent):
async def _run_async_impl(self, ctx: InvocationContext) -> AsyncGenerator[Event, None]:
summarized: Dict[str, Any] = {}
gaql_results = state.get("gaql_execution_results")
if gaql_results and isinstance(gaql_results, dict):
for qid, result in gaql_results.items():
summarized[qid] = summarize_gaql_result(qid, result)
state[SUMMARIZED_RESULTS_KEY] = summarized
yield Event(
invocation_id=ctx.invocation_id,
author=self.name,
actions=EventActions(skip_summarization=True),
)
After the GAQL loop finishes, the data summarizer runs before the report writer. It computes grand totals, KPIs (CTR, ROAS, CPA, CVR), and enriched per-row metrics from the raw API results, and stores the computed summary in state. The report writer calls get_summarized_results() and gets a compact, semantically rich payload. It never sees raw cost_micros values or individual API rows.
The second pattern is row capping at the tool boundary. Fixed data retrieval tools can return thousands of rows from the Google Ads API. The full results are stored in state for correct aggregate calculations. But what gets returned to the model is capped:
MAX_FIXED_ROWS_FOR_LLM = 100
def truncate_fixed_result_for_llm(result: Dict[str, Any]) -> Dict[str, Any]:
out = dict(result)
row_limits: Dict[str, Dict[str, int]] = {}
for key, value in list(out.items()):
if isinstance(value, list) and len(value) > MAX_FIXED_ROWS_FOR_LLM and value and isinstance(value[0], dict):
total = len(value)
out[key] = value[:MAX_FIXED_ROWS_FOR_LLM]
row_limits[key] = {"shown": MAX_FIXED_ROWS_FOR_LLM, "total": total}
if row_limits:
out["_row_limits"] = row_limits
return out
The tool returns at most 100 rows to the model, with a _row_limits metadata field so the model knows the full dataset is larger. Totals and aggregates are calculated from the full dataset before truncation, so the model gets accurate numbers alongside a representative sample of rows. The report writer can request full rows from state when it specifically needs them, but in practice the summary is almost always sufficient.
Forcing compact output with structured schemas
One more ADK-specific pattern worth mentioning: using output_schema to force structured output on agents that would otherwise produce verbose text.
The GAQL execution agent's job is to run queries and report what happened. Without constraints, a model might produce a multi-paragraph summary of each query's results. With an output schema, it produces exactly what's needed:
class GAQLExecutionOutput(BaseModel):
status: str
total_queries: int
successful_queries: int
failed_queries: int
gaql_execution_agent = LlmAgent(
name="wolfy_gaql_execution",
output_schema=GAQLExecutionOutput,
output_key="gaql_execution_output",
...
)
The model is forced to produce a four-field JSON object. That output gets stored in gaql_execution_output in state and never needs to enter another agent's context window. The actual query results are in gaql_execution_results, stored directly by the tool, not via model output.
Putting it together
These techniques stack. In the Wolfy GAQL loop, a single data retrieval cycle applies most of them in combination:
strip_subagent_tool_responses_callbackremoves "For context:" injection blocks before every model call across all agents- Specialist agents use
include_contents="none"so conversation history never enters their context compact_gaql_queries_callbackcompresses query state before the analyst model callstatic_instructionloads the GAQL field reference once and caches it, separate from the dynamic instructionDataSummarizerpre-computes everything before the report writer runs, so raw rows never enter a model contexttruncate_fixed_result_for_llmcaps tool return values at 100 rows while keeping full data available in stateoutput_schemaon the execution agent forces a four-field summary instead of verbose text
None of these are complicated individually. The compounding effect is what matters. Each one reduces the token footprint of a model call, which reduces cost, reduces latency, and reduces the likelihood of the model getting confused by irrelevant content buried in a long context.
The general principle behind all of them: a model should only see what it needs to do its specific job. Everything else should live in state, get computed in Python, or be stripped by a callback before the model ever sees it.