Skip to content

Best-practice: anonymize secrets in VCR test cassettes (+ minor SDK jinja2 trust-boundary comment) #4150

@elfrost

Description

@elfrost

Hi Traceloop team,

While testing AI PatchLab (an open-source local-first security scanner) on a few mid-popularity Python AI projects, I scanned openllmetry at approximately 72fc45e and wanted to flag one best-practice improvement plus one minor SDK code-clarity note. Filing as a single courtesy issue.

Full curated write-up of the scan (with FP analysis, methodology, and the findings AI PatchLab got wrong): https://elfrost.github.io/ai-patchlab/scans/traceloop-openllmetry.html

1. Anonymize secrets in VCR cassettes before recording

Of 26 high-severity findings on the scan, 25 are Gitleaks matches in packages/**/tests/cassettes/**.yaml:

  • 11× aws-access-token matches in opentelemetry-instrumentation-anthropic/tests/cassettes/test_bedrock_*/
  • jwt matches in opentelemetry-instrumentation-watsonx/tests/
  • generic-api-key matches (including PostHog phc_… public keys in haystack cassettes)

None of these are credential leaks today: the AWS findings are access key IDs without their corresponding secret keys (the Sigv4 signature in the cassette is only valid for that one already-replayed request), the JWTs have transparently placeholder claims (sub: noone@ibm.com, account.bss: abc123), and the PostHog phc_ keys are public write-only event-ingestion identifiers by design.

But this is still worth addressing because:

  • Cassettes leak metadata: which AWS account, which Bedrock model, which day, which API surface. For an observability SDK that ships to enterprises, that's worth scrubbing.
  • One bad re-record away from a real secret: if VCR isn't configured to anonymize, the next contributor recording a cassette with a real prod key against a different provider will accidentally land it. Unfiltered cassettes are a recurring source of real-world key leaks across Python OSS.

Recommended fix: configure VCR's filter_headers, filter_query_parameters, and before_record_response in the test base (probably in each package's conftest.py or a shared tests/common/):

import vcr

vcr_config = vcr.VCR(
    filter_headers=[
        ('authorization', 'REDACTED'),
        ('x-api-key', 'REDACTED'),
    ],
    filter_query_parameters=[
        ('api_key', 'REDACTED'),
    ],
    # Optional: response body scrub for tokens/JWTs returned from auth endpoints
    before_record_response=lambda response: response,  # add custom redaction if needed
)

This single change would zero out 25 of the 26 high-severity findings on a re-scan and reduce the per-re-record drift risk to near-zero.

2. packages/traceloop-sdk/traceloop/sdk/prompts/client.py:44 — a comment on the jinja2.Environment() use

obj._jinja_env = Environment()

A Semgrep rule (direct-use-of-jinja2) flags this because Environment() defaults to autoescape=False, which would be a real concern when rendering to HTML. Here the Environment is used to render LLM prompts, where autoescape=True would actively damage the output (escaping <, >, & etc. that may be intentional in the prompt).

So the current code is correct — just suggesting a one-line comment so future contributors and security scanners don't keep flagging this:

# autoescape disabled: rendered output goes to an LLM as a prompt, not to HTML
obj._jinja_env = Environment()

Both items are low-priority. Happy to open separate PRs if useful. Thanks for openllmetry — the rest of the scan turned up only false positives or by-design patterns (token-count logger calls, plugin-discovery dynamic imports, sample-app calculator with whitelisted eval), which is a good sign about the codebase overall.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions