-
-
Notifications
You must be signed in to change notification settings - Fork 27
Threat mitigation: Runtime inspection of tool outputs for indirect prompt injection #28
Description
The threat model identifies indirect prompt injection via tool outputs
(content agents read from web, email, APIs) as a risk. Current mitigations
focus on sandboxing and permission scoping, which contain the blast radius
but don't detect the attack itself.
I built mlayer-guard, a runtime detection API that inspects tool outputs
for injection before the agent acts on them. Available as an OpenClaw
skill and as a REST API.
Benchmarked on public datasets:
- 98% detection on InjecAgent (ACL 2024, N=300)
- Zero false positives on Deepset (N=343)
- 94.1% on WildGuard (N=971)
The skill adds zero tokens to agent context — detection happens externally.
Demo: https://hidylan.ai/demo
OpenClaw skill: https://github.com/dmilstein-match/mlayer-guard-openclaw
Happy to discuss how this maps to specific threat cards in the model,
or how it could complement the existing mitigations.