Skip to content

feat(proxy): serve Microsoft-style compressed symbol files (.pd_, .dl_, .ex_)#1951

Open
bosiakov wants to merge 5 commits into
getsentry:masterfrom
bosiakov:feat/compressed-symbol-proxy
Open

feat(proxy): serve Microsoft-style compressed symbol files (.pd_, .dl_, .ex_)#1951
bosiakov wants to merge 5 commits into
getsentry:masterfrom
bosiakov:feat/compressed-symbol-proxy

Conversation

@bosiakov
Copy link
Copy Markdown

@bosiakov bosiakov commented May 16, 2026

Symbolicator can ingest Microsoft style compressed symbol files (.pd_, .dl_, .ex_) from upstream symbol servers but cannot serve them on the /proxy endpoint. Tools that speak the symsrv protocol (WinDbg, Visual Studio, symchk) request the underscore form and expect a CAB body. This PR adds optional CAB output behind a new compressed_proxy config flag, defaulting to off so existing deployments are unchanged.

When the flag is enabled, the proxy serves the underscore form in two ways. If the upstream source delivered a CAB, the bytes are preserved byte for byte in a new raw_compressed cache and returned as is. Otherwise a fresh MSZIP CAB is synthesized from the cached object and stored in a new cab_synth cache for reuse.

Changes

  • Wire up the new request shape in the proxy handler. Detect the .pd_ / .dl_ / .ex_ leaf, rewrite to the uncompressed form for object lookup, attach the vnd.ms-cab-compressed content type on the way out. With the flag off, the underscore form returns 404 so we never accidentally serve decompressed bytes under a misleading filename. (endpoints/proxy.rs)

  • Preserve upstream CAB bytes during download. maybe_decompress_file returns a DecompressOutcome that includes the original payload as a sibling tempfile when the source is CAB. The allocation happens only after the CAB magic is matched, so non CAB downloads pay nothing extra. Other compressed formats (gzip, zstd, zlib, zip) are decompressed as before but not preserved, because their bytes cannot honestly be served as vnd.ms-cab-compressed. (download/compression.rs, download/fetch_file.rs, objects/data_cache.rs)

  • Synthesize CAB on demand for uploads and non CAB upstreams. The new cab_synth_cache.rs wraps the cached object in a single folder, single file MSZIP CAB using the cab crate writer. The compression runs inside tokio::task::spawn_blocking so multi GB inputs do not stall async workers. Results are cached so the cost is paid once per symbol. (objects/cab_synth_cache.rs)

  • Compose the two paths in ObjectsActor::fetch_compressed. Ensure the underlying object is fetched (which populates the upstream mirror as a side effect), look in raw_compressed first, fall back to cab_synth. Plumb the new caches and the config flag through ObjectsActor::new and RequestService::fetch_compressed_object. (objects/mod.rs, objects/raw_compressed_cache.rs, service.rs)

  • Extend the cache layer with two primitives needed for the above. Cacher::store_externally persists a tempfile into the on disk cache from outside the normal compute flow, used by the tee path. Cacher::lookup_only checks the cache without ever invoking compute and without caching negatives, used by the proxy so a raw_compressed entry that appears after a first lookup is picked up by the second. (caching/memory.rs)

  • Register the new caches alongside the existing ones (raw_compressed, cab_synth) with their version constants, cleanup wiring, and config plumbing. Add the compressed_proxy: bool field with a default of false. (caching/mod.rs, caching/config.rs, caches/versions.rs, caching/cleanup.rs, config.rs, services.rs)

Benchmark

CAB synthesis is the only added per request CPU cost. The other paths are filesystem rename or mmap and stay in the microsecond range. Numbers below come from a single threaded run of the cab 0.6 MSZIP writer on Apple Silicon, release build.

Input size PDB realistic content High entropy (worst case)
10 MB 0.17 s (60 MB/s) 0.29 s (34 MB/s)
100 MB 1.70 s (59 MB/s) 2.91 s (34 MB/s)
500 MB 9.26 s (54 MB/s) 13.31 s (38 MB/s)
1 GB (extrapolated) 19 s 27 s
5 GB (extrapolated) 95 s 135 s

Cost is paid once per (scope, symbol) pair on first cache miss. Subsequent hits read from the cab_synth mmap in single digit milliseconds. Concurrent requests for the same symbol deduplicate to one compression via Cacher::compute_memoized. Memory footprint stays bounded: input is a kernel demand paged mmap, output streams through the writer to a tempfile, DEFLATE state fits in a few hundred KB. With the flag off, the codepath is bit for bit identical to before.

@bosiakov bosiakov requested a review from a team as a code owner May 16, 2026 11:24
Comment thread crates/symbolicator-service/src/objects/data_cache.rs
@bosiakov bosiakov force-pushed the feat/compressed-symbol-proxy branch from 107f6e0 to 95205d2 Compare May 16, 2026 11:44
Comment thread crates/symbolicator-service/src/objects/raw_compressed_cache.rs
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 95205d2. Configure here.

Comment thread crates/symbolicator-service/src/caching/memory.rs
Comment thread crates/symbolicator/src/endpoints/proxy.rs
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what is your use-case for the proxy endpoint? This is still an experimental feature and there were some talks about possibly removing it again as we didn't get much/any feedback that this is useful or getting used.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dav1dde Thanks for asking!

We're a C++ engineering org self-hosting Sentry for native crash collection.
Our PDBs are 8 to 12 GB each.

Engineers on the Windows side would like one URL to point both Visual Studio and WinDbg at via _NT_SYMBOL_PATH, so PDBs would flow from CI through sentry-cli upload into the debugger.

The /proxy endpoint handles .pdb URLs but not the .pd_ compressed form that Microsoft symsrv clients ask for first, so we've been running a parallel HTTP mirror just for the compressed form. It works but is more infrastructure than we'd like for something Sentry already holds.

The change is opt-in. This PR would round out the symsrv compatibility story for us.

Thanks for taking a look!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining your usecase, will be discussing this with the team. We had some discussions around the proxy endpoint but we never really made the decision to remove it as it was already there and largely just used existing Symbolicator infrastructure.

The endpoint always had short-comings, it really doesn't work reliably in all cases to translate between different layouts, but it does provide the illusion it should. It never really integrated well with Sentry itself, pulling symbol source definitions from Sentry, sharing a single authentication mechanism etc.

This is quite a large change to endpoint, so we'll need to revive the discussions and make a decision, whether we want to keep the endpoint at all.

It's really unfortunate that you put in all the effort and this is just my vague response, but I hope at least it gives some context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants