Skip to content

feat(plugins-google): add cached_content option for explicit context caching#5661

Closed
kamil-bidus wants to merge 1 commit intolivekit:mainfrom
kamil-bidus:kamdibus/gemini-cached-content-support
Closed

feat(plugins-google): add cached_content option for explicit context caching#5661
kamil-bidus wants to merge 1 commit intolivekit:mainfrom
kamil-bidus:kamdibus/gemini-cached-content-support

Conversation

@kamil-bidus
Copy link
Copy Markdown

@kamil-bidus kamil-bidus commented May 6, 2026

Motivation

The Gemini plugin's LLM class supports many GenerateContentConfig options (thinking_config, retrieval_config, safety_settings, etc.) but not cached_content. The plugin already reads cached_content_token_count from response usage in LLMStream._parse_part, so cache hits surface in metrics — there's just no way to attach a CachedContent resource to outgoing requests.

Change

Add cached_content: NotGivenOr[str] = NOT_GIVEN to LLM.__init__, propagated through _LLMOptionschat()extra["cached_content"]GenerateContentConfig via **self._extra_kwargs.

Request-side suppression

Gemini's API rejects generateContent requests that pass cached_content together with system_instruction, tools, or tool_config — those fields belong inside the CachedContent resource itself, and the API returns a 400 instructing callers to move them.

Without handling that, the parameter would 400 on any realistic agent. So LLMStream._run strips system_instruction, tools, and tool_config from the outgoing request whenever cached_content is attached. Behaviour is unchanged when cached_content is unset.

Cache lifecycle (creation, TTL refresh, deletion) and the choice of what to bake into the cache stay the application's responsibility.

Compatibility

Default NOT_GIVEN keeps existing behaviour unchanged — verified by tests covering both the omission case (no key in _extra_kwargs) and the no-cache request path (system_instruction and tools propagate as before).

Works with both Gemini Developer API (cachedContents/{id}) and Vertex AI (projects/{p}/locations/{l}/cachedContents/{id}); the plugin passes the string through unmodified.

Tests

tests/test_plugin_google_llm.py — 6 cases:

  • Propagation (3)cached_content round-trips through _LLMOptions and reaches _extra_kwargs; default NOT_GIVEN produces no key.
  • Suppression (3) — patching client.aio.models.generate_content_stream to capture the GenerateContentConfig, the request omits system_instruction / tools / tool_config when cached_content is set, and includes them when it isn't (backward compat).

Existing google-plugin tests still pass. ruff check / ruff format clean.

Refs

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 6, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 1 additional finding.

Open in Devin Review

@kamil-bidus kamil-bidus marked this pull request as draft May 7, 2026 14:35
…caching

The plugin currently relies on Gemini's implicit cache, which is
heuristic. In voice-agent workloads where the system prompt is large
and stable across calls, implicit caching often misses on turn 1 of
a conversation, paying the full cold-start cost.

Explicit caching is the documented alternative: the application
creates a CachedContent resource via client.caches.create(...) and
references it by name on subsequent generateContent calls. Cached
prefix tokens are billed at a discount and processed in under 100ms.

The plugin already reads cached_content_token_count from response
usage but had no way to set cached_content on requests. This adds
the parameter on LLM.__init__, stores it on _LLMOptions, and
propagates it into GenerateContentConfig via extra_kwargs.

End-to-end usability matters: Gemini rejects generateContent
requests that pass cached_content together with system_instruction,
tools, or tool_config — those fields belong inside the CachedContent
resource. Without handling that, setting cached_content on any LLM
that also has a system prompt or function tools would 400. So
LLMStream._run now suppresses system_instruction, tools, and
tool_config from the outgoing request whenever cached_content is
attached. Cache lifecycle (creation, TTL refresh, deletion) and the
choice of what to bake into the cache stay the application's
responsibility — the plugin only consumes the resource name and
ensures the matching fields are absent from the request.

Behaviour is unchanged for callers that don't pass cached_content:
the gating is strictly is-given on that one option. Documented on
the docstring so users know the cache must contain whichever of
system_instruction / tools the model needs.

Tests cover propagation, the omitted-when-not-set default, and the
three suppression branches (system_instruction stripped, tools
stripped, tool_config stripped) plus the unchanged-when-no-cache
backward-compat path.

Refs livekit#2359.
@kamil-bidus kamil-bidus force-pushed the kamdibus/gemini-cached-content-support branch from 853f638 to c57dd80 Compare May 7, 2026 15:06
@kamil-bidus kamil-bidus closed this May 7, 2026
@kamil-bidus kamil-bidus deleted the kamdibus/gemini-cached-content-support branch May 7, 2026 15:09
@kamil-bidus
Copy link
Copy Markdown
Author

Superseded by #5675.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants