Skip to content

Threat mitigation: Runtime inspection of tool outputs for indirect prompt injection #28

@dmilstein-match

Description

@dmilstein-match

The threat model identifies indirect prompt injection via tool outputs
(content agents read from web, email, APIs) as a risk. Current mitigations
focus on sandboxing and permission scoping, which contain the blast radius
but don't detect the attack itself.

I built mlayer-guard, a runtime detection API that inspects tool outputs
for injection before the agent acts on them. Available as an OpenClaw
skill and as a REST API.

Benchmarked on public datasets:

  • 98% detection on InjecAgent (ACL 2024, N=300)
  • Zero false positives on Deepset (N=343)
  • 94.1% on WildGuard (N=971)

The skill adds zero tokens to agent context — detection happens externally.

Demo: https://hidylan.ai/demo
OpenClaw skill: https://github.com/dmilstein-match/mlayer-guard-openclaw

Happy to discuss how this maps to specific threat cards in the model,
or how it could complement the existing mitigations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions