fix(core): process all URLs in web_fetch instead of only the first#22212
fix(core): process all URLs in web_fetch instead of only the first#22212
Conversation
The web_fetch tool accepts up to 20 URLs but only processed urls[0] in both execute and fallback paths. Now iterates all URLs for rate-limit checks and private IP validation in execute(), and fetches all URLs in fallback mode via a new executeFallbackForUrl() helper. Each URL receives a fair share of the content budget (MAX_CONTENT_LENGTH / urls.length) rather than the full limit. Abort signal is now propagated to retry logic in fallback mode.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the web_fetch tool to correctly handle multiple URLs, iterating over all of them for rate limiting and private IP checks, and dividing the content budget. However, a high-severity indirect prompt injection vulnerability was identified in the executeFallback method, where untrusted content from fetched URLs is directly concatenated into a prompt for the fallback LLM, potentially allowing an attacker to manipulate its behavior. This issue aligns with the rule to avoid including untrusted input in LLM content. Additionally, consider improving performance by fetching URLs in parallel instead of sequentially to enhance responsiveness when multiple URLs are provided.
| const fallbackPrompt = `The user requested the following: "${this.params.prompt}". | ||
|
|
||
| I was unable to access the URL directly. Instead, I have fetched the raw content of the page. Please use the following content to answer the request. Do not attempt to access the URL again. | ||
| I was unable to access the URL(s) directly. Instead, I have fetched the raw content. Please use the following content to answer the request. Do not attempt to access the URLs again. | ||
|
|
||
| --- | ||
| ${textContent} | ||
| --- | ||
| `; | ||
| ${contentParts.join('\n\n')} | ||
| ${errors.length > 0 ? `\nNote: Some URLs could not be fetched: ${errors.join('; ')}` : ''}`; | ||
| const result = await geminiClient.generateContent( | ||
| { model: 'web-fetch-fallback' }, | ||
| [{ role: 'user', parts: [{ text: fallbackPrompt }] }], |
There was a problem hiding this comment.
The executeFallback method is vulnerable to indirect prompt injection. It constructs a prompt for the LLM by concatenating untrusted data—specifically the content fetched from external URLs and error messages from failed fetch attempts—directly into the prompt string. An attacker who controls the content of a fetched URL or can manipulate the HTTP response (e.g., the status text) can inject malicious instructions that the LLM might follow. This could lead to the LLM outputting misleading information, performing unauthorized actions if the output is used by other tools, or exfiltrating data.
To remediate this, consider the following:
- Use Structured Delimiters: Wrap untrusted content in clear, hard-to-spoof delimiters and instruct the LLM to treat everything within those delimiters as data, not instructions.
- Sanitize Input: Sanitize the fetched content and error messages to remove or escape potential injection sequences.
- Constrain the LLM: Use a separate, highly-constrained LLM call to summarize or extract information from the untrusted content before including it in the main prompt.
- Escape User Input: Ensure that
this.params.promptis properly escaped when included in thefallbackPromptto prevent direct injection if the user prompt contains quotes.
References
- To prevent prompt injection, avoid including user-provided input in content passed to the LLM (
llmContent). This principle extends to any untrusted external data, which should be handled withreturnDisplayif needed for display, or sanitized/constrained if used in prompts.
| for (const rawUrl of urls) { | ||
| try { | ||
| const textContent = await this.executeFallbackForUrl( | ||
| rawUrl, | ||
| perUrlContentBudget, | ||
| signal, | ||
| ); | ||
| contentParts.push( | ||
| `--- Content from ${rawUrl} ---\n${textContent}\n---`, | ||
| ); | ||
| fetchedUrls.push(rawUrl); | ||
| } catch (e) { | ||
| // eslint-disable-next-line @typescript-eslint/no-unsafe-type-assertion | ||
| const error = e as Error; | ||
| errors.push(`Error fetching ${rawUrl}: ${error.message}`); | ||
| } | ||
| } |
There was a problem hiding this comment.
While the logic to handle multiple URLs in the fallback is correct, fetching them sequentially in a for...of loop can be inefficient and slow, especially when many URLs are provided. To improve performance, these independent network requests should be executed in parallel.
You can use Promise.allSettled to fire off all fetch requests concurrently and then process the results, which aligns well with the existing logic for handling both successful fetches and errors.
const fetchPromises = urls.map((rawUrl) =>
this.executeFallbackForUrl(rawUrl, perUrlContentBudget, signal),
);
const results = await Promise.allSettled(fetchPromises);
results.forEach((result, index) => {
const rawUrl = urls[index];
if (result.status === 'fulfilled') {
contentParts.push(
`--- Content from ${rawUrl} ---\n${result.value}\n---`,
);
fetchedUrls.push(rawUrl);
} else {
// eslint-disable-next-line @typescript-eslint/no-unsafe-type-assertion
const error = result.reason as Error;
errors.push(`Error fetching ${rawUrl}: ${error.message}`);
}
});|
Size Change: +1.26 kB (0%) Total Size: 26.6 MB
ℹ️ View Unchanged
|
Summary
web_fetchtool accepts up to 20 URLs but only processedurls[0]in bothexecute()andexecuteFallback()pathsexecuteFallback()to iterate all valid URLs via a newexecuteFallbackForUrl()helperexecute()to rate-limit-check and validate (private IP) all URLs, not just the firstMAX_CONTENT_LENGTH / urls.length) rather than the full limitChanges
packages/core/src/tools/web-fetch.tsexecuteFallbackForUrl(url, perUrlContentBudget, signal)executeFallback()now iterates all URLs, collects content from each, and sends combined content to the fallback LLMexecute()rate-limit and private-IP checks now iterate all URLsTest plan
npm run preflight) passing