feat: pluggable azure blob cache provider with auto-discovery (experimental) by hasethuraman · Pull Request #139 · run-ai/runai-model-streamer

hasethuraman · 2026-04-12T11:50:26Z

Pluggable Azure Blob Cache Provider with Auto-Discovery

Adds a pluggable cache provider for Azure Blob Storage reads with zero-config auto-discovery. When a compatible cache provider package (e.g., tachyon-client) is installed alongside runai-model-streamer, the streamer automatically discovers and loads it at runtime — no environment variable configuration needed.

How it works

Cache provider package installs a shared library (e.g., libStorageDirect.so) into Python site-packages
At startup, the streamer auto-discovers the library via dladdr (walking from the runai .so location)
The library is loaded via dlopen and the blob_read symbol is resolved
All subsequent Azure blob reads are served through the cache provider
If no cache provider is installed, reads go directly to Azure Blob Storage — no regression

Components

runai_azcache_provider.h: C API contract (blob_read) for cache provider .so plugins
azcache_provider_loader.cc: Singleton that auto-discovers via dladdr, falls back to dlopen/dlsym
AsyncAzureClient integration: Routes reads through cache when provider is discovered
simple_file_cache_test.cc: Example cache implementation for reference and unit testing
RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_ENABLED: Disable switch to turn off cache even when provider is installed
Documentation: env-vars.md and usage.md updated with auto-discovery flow and cache provider authoring guide

Testing

C++ unit tests: provider loading, error handling, fallback behavior
Python e2e test: validates cache intercept with simple file-based provider
AKS validated: cold cache and warm cache

Summary by CodeRabbit

New Features
- Experimental pluggable Azure Blob cache provider: runtime-loadable provider redirects blob reads; clients can query if cache is enabled and async downloads use a single request when caching is active.
- Build support added to package the provider and test helper as shared libraries.
Documentation
- Added env-var reference, usage guide, and debugging instructions for the experimental cache provider.
Tests
- Added unit and end-to-end tests covering reads, errors, security checks, and fallback behavior.

noa-neria · 2026-04-16T13:33:48Z

@hasethuraman few questions about this feature:

Have you considered providing local path to the model streamer instead, if the model files are already cached? If the entire model is cached, why not pass the path to the cached model to the engine?
Is the cache intended only for the safetensors files? Asking because pull_files and list_safetensors still look at the blob stoprage.
There is no fallback to the blob storage if reading from cache had failed. Is that by design?

hasethuraman · 2026-04-22T16:17:33Z

@hasethuraman few questions about this feature:

Have you considered providing local path to the model streamer instead, if the model files are already cached? If the entire model is cached, why not pass the path to the cached model to the engine?

Is the cache intended only for the safetensors files? Asking because pull_files and list_safetensors still look at the blob stoprage.

There is no fallback to the blob storage if reading from cache had failed. Is that by design?

Apologies for the confusion @noa-neria Updated the PR description & relevant code. This is mainly for the integration with distributed cache solution which must handle the CACHEMISS scenario which would further speed up the streaming for cached model in the cluster. Please let me know if this makes sense.

greptile-apps · 2026-04-22T16:21:46Z

Greptile Summary

This PR adds an experimental pluggable cache provider for Azure Blob Storage reads, with zero-config auto-discovery via dladdr. When a compatible package (e.g., tachyon-client) is installed alongside the streamer, its shared library is auto-loaded at runtime; if none is found, all reads continue through the Azure SDK unchanged.

New AzCacheProviderLoader (singleton shared handle, ABI version check, dlopen/dlsym) wired into AsyncAzureClient; cache reads replace the chunked Azure SDK path with a single call per range.
C ABI header (runai_azcache_provider.h) and reference implementation (simple_file_cache_test.cc) define the provider contract with path-traversal guards and an optional shutdown symbol.
Tests and docs cover loading, ABI mismatch, path traversal, shared handle lifecycle, and a Python e2e test; several gaps (env-var false/true disable semantics, test_missing_cache_lib_falls_back stub) were noted in prior review rounds.

Confidence Score: 4/5

Safe to merge for an experimental feature; the new code paths are guarded behind a runtime flag and fall back gracefully when no provider is installed.

The only new finding beyond the previously flagged issues is that shutdown() from the plugin and dlopen during construction are both invoked while holding s_handle_mutex, which can block all other loader threads for the duration of plugin shutdown and is a latent deadlock risk if the plugin ever re-enters loader creation.

azcache_provider_loader.cc (mutex-scope around dlopen and shutdown) and azcache_provider_test.cc (open dlopen handle leaks per call).

Important Files Changed

Filename	Overview
cpp/azure/azcache_provider/azcache_provider_loader.cc	Core loader implementation: singleton shared handle, dladdr auto-discovery, ABI version check, `parse_cache_mode` only recognises `"0"` and `"1"` (previously flagged), and `shutdown()` is called while holding `s_handle_mutex` in the destructor.
cpp/azure/azcache_provider/azcache_provider_loader.h	Header defining `CacheMode`, `CacheProviderConfig`, `CacheLibHandle`, and `AzCacheProviderLoader`; design is clean with move/copy suppressed and clear immutability contracts.
cpp/azure/azcache_provider/runai_azcache_provider.h	C ABI contract header with 8-parameter `blob_read_fn`, ABI versioning, and clear contract documentation; consistent with actual loader usage.
cpp/azure/azcache_provider/simple_file_cache_test.cc	Reference/test cache provider with path-traversal guards (absolute-path fix applied), proper `pread` usage, and optional `shutdown` tracking for unit tests.
cpp/azure/azcache_provider/azcache_provider_test.cc	Unit tests cover ABI versioning, shared handle lifecycle, and path-traversal; `load_cache_fn()` and `load_abi_version()` helpers leak `dlopen` handles on every call (previously flagged).
cpp/azure/client/async_azure_client/async_azure_client.h	Cache integration added to `AsyncAzureClient`; cache-read failure throws `runtime_error` without Azure fallback (previously flagged); `_cache_enabled` is correctly initialised after `_cache_loader` is moved.
cpp/azure/client/client.cc	Cache loader instantiated from env vars and wired into `AsyncAzureClient`; `_account_name` is guaranteed set when the cache path is reached (non-Azurite production builds throw before that point).
tests/azure/test_azurecache.py	`test_stream_from_cache` is a meaningful e2e test; `test_missing_cache_lib_falls_back` only unsets an env var and never exercises the streamer (previously flagged).
docs/src/env-vars.md	Documents `RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_ENABLED`; `false` will not disable the cache (only `"0"` works — previously flagged).
docs/src/usage.md	Usage guide with cache-provider authoring section; `blob_read` signature shown uses 8 parameters matching the actual ABI; disable instruction correctly shows `=0`.

_{Reviews (11): Last reviewed commit: "feat(azure): shared CacheLibHandle singl..." | Re-trigger Greptile}

coderabbitai · 2026-04-22T16:22:02Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds a pluggable, runtime-loaded Azure blob cache provider (C ABI + singleton loader) and integrates it into AsyncAzureClient so blob-range reads may be served from a provider .so. Includes BUILD rules, a test provider .so, unit and e2e tests, and documentation for the new env var.

Changes

Cohort / File(s)	Summary
Build & Targets `cpp/azure/azcache_provider/BUILD`, `cpp/azure/client/async_azure_client/BUILD`	Adds Bazel targets: `azcache_provider` cc_library, `libsimple_file_cache_test.so` shared binary, `azcache_provider_test` test; adds `//azure/azcache_provider` to `async_azure_client` deps.
Public C API & Headers `cpp/azure/azcache_provider/runai_azcache_provider.h`, `cpp/azure/azcache_provider/azcache_provider_loader.h`	New public C header declares `blob_read_fn`, `BLOB_READ_SYMBOL`, `RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_ENABLED`, and `RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_LIB`. Declares `AzCacheProviderLoader` singleton with `instance()`, `is_enabled()`, and `read()`; deleted copy ops.
Loader Implementation & Test Provider `cpp/azure/azcache_provider/azcache_provider_loader.cc`, `cpp/azure/azcache_provider/simple_file_cache_test.cc`	Implements runtime `dlopen`/`dlsym` loader, auto-discovery via `dladdr`, atomic enabled flag, and `read()` dispatch. Adds test-only `blob_read` implementation performing file-backed reads, path validation, and malloc'd error strings.
Async Client Integration `cpp/azure/client/async_azure_client/async_azure_client.h`, `cpp/azure/client/async_azure_client/async_azure_client.cc`, `cpp/azure/client/client.cc`	AsyncAzureClient now snapshots `_cache_enabled` (init from loader) and exposes `is_cache_enabled()`. `DownloadBlobRangeAsync` signature adds `account_name` and is no longer inline; when cache enabled it schedules provider reads, otherwise falls back to Azure SDK. `AzureClient::async_read` forces single-chunk reads when cache enabled.
Tests `cpp/azure/azcache_provider/azcache_provider_test.cc`, `tests/azure/test_azurecache.py`	Adds unit tests for loader/provider behaviors (dlopen/dlsym, error cases, path traversal) and an end-to-end pytest that compiles the test .so, populates a cache, runs the streamer subprocess, and verifies streamed tensors from cache.
Docs `docs/src/env-vars.md`, `docs/src/usage.md`	Documents `RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_ENABLED` and `RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_LIB`, provider contract (`blob_read`), auto-discovery/loading behavior, and debug guidance.

Sequence Diagram

sequenceDiagram
    participant Client
    participant AsyncAzureClient
    participant AzCacheProviderLoader
    participant CacheLibrary as "Dynamic Cache Library"
    participant AzureSDK

    Client->>AsyncAzureClient: DownloadBlobRangeAsync(account, container, blob, buf, offset, len)
    AsyncAzureClient->>AsyncAzureClient: check _cache_enabled

    alt cache enabled
        AsyncAzureClient->>AzCacheProviderLoader: read(account, container, blob, buf, offset, len)
        AzCacheProviderLoader->>CacheLibrary: blob_read(account, container, blob, buf, offset, len, &err)
        CacheLibrary-->>AzCacheProviderLoader: bytes_read or error_string
        AzCacheProviderLoader-->>AsyncAzureClient: success/failure
        AsyncAzureClient->>Client: callback or throw on failure
    else cache disabled
        AsyncAzureClient->>AzureSDK: perform SDK download (createDownloadBlobFn)
        AzureSDK-->>AsyncAzureClient: bytes_read or error
        AsyncAzureClient->>Client: callback or error
    end

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I found a tiny .so beneath the logs,
I dlopen, dlsym, then hop through the fog,
A singleton nibble, an error freed,
Cache or Azure — I fetch what you need! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly describes the main feature: a pluggable Azure blob cache provider with auto-discovery capability, all marked as experimental.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

cpp/azure/client/async_azure_client/async_azure_client.h (1)
89-103: Consider using is_cache_enabled() for consistency.

Line 89 directly calls AzCacheProviderLoader::instance().is_enabled() instead of the newly added is_cache_enabled() method. While functionally equivalent, using the wrapper method would be more consistent.
♻️ Proposed fix
-        if (AzCacheProviderLoader::instance().is_enabled())
+        if (is_cache_enabled())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/azure/client/async_azure_client/async_azure_client.h` around lines 89 -
103, Replace the direct call to AzCacheProviderLoader::instance().is_enabled()
with the wrapper method is_cache_enabled() for consistency: in the block that
creates the DownloadBlobTask (the lambda using
AzCacheProviderLoader::instance().read and the push_task(std::move(task)) call),
change the condition to call
AzCacheProviderLoader::instance().is_cache_enabled() so the code uses the new
consistent API surface.
docs/src/usage.md (1)
239-242: Add language specifier to fenced code block.

The code block showing expected log output should have a language specifier for consistency and to satisfy linting.
📝 Proposed fix
 You should see:
-```
+```text
 AzCacheProvider: loading cache library: /path/to/your_cache_provider.so
 AzCacheProvider: cache provider loaded successfully from /path/to/your_cache_provider.so
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @docs/src/usage.md around lines 239 - 242, The fenced code block showing
AzCacheProvider log output lacks a language specifier; update the markdown in
docs/src/usage.md to add a language tag (e.g., text or console) to the
existing block that contains the AzCacheProvider lines so linting and syntax
highlighting work correctly while preserving the exact log content.
</details>

</blockquote></details>
<details>
<summary>cpp/azure/azcache_provider/azcache_provider_test.cc (1)</summary><blockquote>

`62-78`: **Handle leak: `dlopen` result is not closed.**

Each call to `load_cache_fn()` opens a new handle that is never closed. While the OS reclaims on test exit and the same `.so` will likely return the same handle, it's cleaner to either store the handle in the fixture and close it in `TearDown`, or use a class member.


<details>
<summary>♻️ Store handle in fixture for proper cleanup</summary>

```diff
 class SimpleFileCacheTest : public ::testing::Test
 {
 protected:
     fs::path cache_dir_;
     fs::path so_path_;
+    void* so_handle_ = nullptr;

     void SetUp() override
     {
         // ... existing code ...
     }

     void TearDown() override
     {
+        if (so_handle_)
+        {
+            dlclose(so_handle_);
+            so_handle_ = nullptr;
+        }
         fs::remove_all(cache_dir_);
         unsetenv("RUNAI_CACHE_DIR");
     }

     az_cache_read_fn load_cache_fn()
     {
-        void* handle = dlopen(so_path_.c_str(), RTLD_NOW | RTLD_LOCAL);
-        if (!handle)
+        if (!so_handle_)
+        {
+            so_handle_ = dlopen(so_path_.c_str(), RTLD_NOW | RTLD_LOCAL);
+        }
+        if (!so_handle_)
         {
             ADD_FAILURE() << "dlopen failed: " << dlerror();
             return nullptr;
         }
         auto fn = reinterpret_cast<az_cache_read_fn>(
-            dlsym(handle, AZ_CACHE_READ_SYMBOL));
+            dlsym(so_handle_, AZ_CACHE_READ_SYMBOL));
         // ...
     }
 };
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@cpp/azure/azcache_provider/azcache_provider_test.cc` around lines 62 - 78,
load_cache_fn opens the shared library with dlopen but never closes the handle,
leaking it; modify the test fixture to store the void* handle as a member (e.g.,
add a member like lib_handle_), have load_cache_fn assign the dlopen result to
that member instead of a local variable, and in the fixture TearDown (or the
fixture destructor) call dlclose(lib_handle_) if non-null and reset it; keep
using so_path_ and AZ_CACHE_READ_SYMBOL for dlsym and ensure load_cache_fn still
returns the az_cache_read_fn pointer.
```

</details>

</blockquote></details>
<details>
<summary>cpp/azure/azcache_provider/azcache_provider_loader.cc (1)</summary><blockquote>

`79-100`: **Consider defensive cleanup of `error_string` on success path.**

A misbehaving cache provider might allocate `error_string` even on success. While the contract says it should only be set on failure, adding a defensive `free` on the success path would prevent a memory leak from non-conforming providers.


<details>
<summary>🛡️ Optional defensive fix</summary>

```diff
     LOG(SPAM) << "AzCacheProvider: cache read " << length << " bytes from "
               << container << "/" << blob << " offset=" << offset;
+    if (error_string)
+    {
+        free(error_string);  // Defensive: provider should not set on success
+    }
     return true;
 }
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@cpp/azure/azcache_provider/azcache_provider_loader.cc` around lines 79 - 100,
The success path currently ignores a non-null error_string which could be
allocated by a misbehaving provider; in the function containing the bytes_read
check (variables: bytes_read, length, container, blob, offset, and error_string)
add a defensive cleanup: before logging the success (LOG(SPAM) ...) and
returning true, check if error_string is non-null and free it to avoid leaking
memory from non-conforming providers (keeping the existing free(error_string) in
the failure/error branch).
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @docs/src/env-vars.md:

Line 123: Update the broken intra-document anchor link on the line containing
"See Azure Blob Cache Provider
(Experimental)..." to point to the
correct document by using a relative cross-document link (e.g.,
"usage.md#azure-blob-cache-provider-experimental"); modify the link text in
docs/src/env-vars.md so it references the usage.md anchor instead of a local
anchor.

In @tests/azure/test_azurecache.py:

Around line 74-79: The test invokes subprocess.run to compile
simple_file_cache_test.cc using "gcc"; change the compiler invocation in the
subprocess.run call inside tests/azure/test_azurecache.py to use "g++" instead
of "gcc" so the .cc C++ source is compiled correctly (update the argument list
where ["gcc", "-shared", "-fPIC", "-o", so_path, src] appears and keep the
existing capture_output/text/error handling intact).

Around line 182-185: The test test_missing_cache_lib_falls_back currently only
unsets RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_LIB but does not assert fallback
behavior; update it to either (A) actually exercise the streaming path and
assert the cache provider is not used (e.g., start a subprocess or invoke the
streaming function and check that the Azure cache provider is disabled/not
initialized or that reads come from Azure instead), referencing the same test
name test_missing_cache_lib_falls_back and any provider init/check function used
in tests (e.g., the cache provider initialization method or is_enabled flag) to
verify provider disabled, or (B) mark the test as a placeholder by adding a
skip/ TODO so it’s clear it’s incomplete; ensure the chosen approach asserts the
expected fallback to Azure when RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_LIB is
unset.

Nitpick comments:
In @cpp/azure/azcache_provider/azcache_provider_loader.cc:

Around line 79-100: The success path currently ignores a non-null error_string
which could be allocated by a misbehaving provider; in the function containing
the bytes_read check (variables: bytes_read, length, container, blob, offset,
and error_string) add a defensive cleanup: before logging the success (LOG(SPAM)
...) and returning true, check if error_string is non-null and free it to avoid
leaking memory from non-conforming providers (keeping the existing
free(error_string) in the failure/error branch).

In @cpp/azure/azcache_provider/azcache_provider_test.cc:

Around line 62-78: load_cache_fn opens the shared library with dlopen but
never closes the handle, leaking it; modify the test fixture to store the void*
handle as a member (e.g., add a member like lib_handle_), have load_cache_fn
assign the dlopen result to that member instead of a local variable, and in the
fixture TearDown (or the fixture destructor) call dlclose(lib_handle_) if
non-null and reset it; keep using so_path_ and AZ_CACHE_READ_SYMBOL for dlsym
and ensure load_cache_fn still returns the az_cache_read_fn pointer.

In @cpp/azure/client/async_azure_client/async_azure_client.h:

Around line 89-103: Replace the direct call to
AzCacheProviderLoader::instance().is_enabled() with the wrapper method
is_cache_enabled() for consistency: in the block that creates the
DownloadBlobTask (the lambda using AzCacheProviderLoader::instance().read and
the push_task(std::move(task)) call), change the condition to call
AzCacheProviderLoader::instance().is_cache_enabled() so the code uses the new
consistent API surface.

In @docs/src/usage.md:

Around line 239-242: The fenced code block showing AzCacheProvider log output
lacks a language specifier; update the markdown in docs/src/usage.md to add a
language tag (e.g., text or console) to the existing block that contains
the AzCacheProvider lines so linting and syntax highlighting work correctly
while preserving the exact log content.
</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ]  Push a commit to this branch (recommended)
- [ ]  Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: CHILL

**Plan**: Pro Plus

**Run ID**: `73061442-98a8-43ca-af7a-42ed2f0e9003`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 2e26a8cf78012f2c5fb10c9f4c66c0d2114612d4 and aae30930cf428b3da0e7d1be97a3ac8aae5a05e9.

</details>

<details>
<summary>📒 Files selected for processing (12)</summary>

* `cpp/azure/azcache_provider/BUILD`
* `cpp/azure/azcache_provider/azcache_provider_loader.cc`
* `cpp/azure/azcache_provider/azcache_provider_loader.h`
* `cpp/azure/azcache_provider/azcache_provider_test.cc`
* `cpp/azure/azcache_provider/runai_azcache_provider.h`
* `cpp/azure/azcache_provider/simple_file_cache_test.cc`
* `cpp/azure/client/async_azure_client/BUILD`
* `cpp/azure/client/async_azure_client/async_azure_client.h`
* `cpp/azure/client/client.cc`
* `docs/src/env-vars.md`
* `docs/src/usage.md`
* `tests/azure/test_azurecache.py`

</details>

</details>

hasethuraman

Addressed all open review comments in commit 7367e06:

Path traversal: absolute paths (greptile) — Added if (s[0] == '/') return 1; at the top of has_path_traversal() to reject absolute paths. Good catch.
Broken cross-document link (CodeRabbit) — Fixed #azure-blob-cache-provider-experimental → usage.md#azure-blob-cache-provider-experimental.
gcc → g++ (CodeRabbit) — Changed compiler to g++ for the .cc source file in test fixture.
Incomplete fallback test (CodeRabbit) — Added TODO comment explaining that full fallback verification requires a subprocess with Azure credentials due to the singleton design.

Additionally: cached is_cache_enabled() as a const bool _cache_enabled member in AsyncAzureClient, initialized once at construction, to avoid repeated singleton access on every blob range read.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

cpp/azure/client/async_azure_client/async_azure_client.h (1)
86-99: Consider including container/blob in the error message for easier debugging.

When the cache provider read fails, the error message "Cache provider read failed" lacks context about which blob failed. The AzCacheProviderLoader::read() already logs details, but including identifiers in the thrown exception would help correlate errors.
💡 Suggested improvement
             DownloadBlobTask task{
                 [container_name, blob_name, buffer, offset, length]() {
                     bool ok = AzCacheProviderLoader::instance().read(
                         container_name, blob_name, buffer, offset, length);
                     if (!ok)
                     {
-                        throw std::runtime_error("Cache provider read failed");
+                        throw std::runtime_error(
+                            "Cache provider read failed for " + container_name + "/" + blob_name);
                     }
                 },
                 std::move(callback)
             };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/azure/client/async_azure_client/async_azure_client.h` around lines 86 -
99, The thrown exception inside the _cache_enabled branch of DownloadBlobTask
should include the container/blob identifiers to aid debugging: update the
lambda passed to DownloadBlobTask (the one calling
AzCacheProviderLoader::instance().read with container_name and blob_name) so
that when read() returns false it throws a std::runtime_error containing
container_name and blob_name (and optionally offset/length) instead of the
generic "Cache provider read failed", then push_task as before; reference the
DownloadBlobTask lambda, AzCacheProviderLoader::instance().read, container_name
and blob_name.
tests/azure/test_azurecache.py (1)
21-22: Unused import: time.

The time module is imported but never used in this file.
🧹 Proposed fix
 import tempfile
-import time
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/azure/test_azurecache.py` around lines 21 - 22, Remove the unused
import of the time module from tests/azure/test_azurecache.py by deleting the
"import time" statement so there are no unused imports; locate the top-level
import line that references time and remove it (no other changes to functions or
tests are required).
docs/src/usage.md (1)
238-243: Add language identifier to fenced code block.

The code block showing example log output should specify a language (e.g., text or plaintext) for consistency and to satisfy linting rules.
📝 Proposed fix
 You should see:
-```
+```text
 AzCacheProvider: loading cache library: /path/to/your_cache_provider.so
 AzCacheProvider: cache provider loaded successfully from /path/to/your_cache_provider.so
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @docs/src/usage.md around lines 238 - 243, The fenced code block showing
example log output lacks a language identifier; update the block surrounding the
two AzCacheProvider lines to include a language tag (e.g., use ```text or
consistent—locate the example block containing "AzCacheProvider: loading cache
library..." and "AzCacheProvider: cache provider loaded successfully..." and add
the language identifier after the opening backticks.
cpp/azure/azcache_provider/azcache_provider_test.cc (1)
63-78: Minor: dlopen handle is never closed.

The load_cache_fn() helper opens the shared library but never stores or closes the handle. For tests this is acceptable since the process exits, but it's technically a resource leak. Consider storing the handle in the fixture and closing in TearDown() if this pattern is reused.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/azure/azcache_provider/azcache_provider_test.cc` around lines 63 - 78,
load_cache_fn opens the shared object with dlopen but never closes the handle,
leaking a resource; modify the test fixture to store the handle (e.g., add a
void* so_handle_ member initialized to nullptr), set so_handle_ =
dlopen(so_path_.c_str(), ...) inside load_cache_fn (closing any previous handle
if non-null), and return the dlsym result as before; then implement TearDown()
to check if so_handle_ is non-null and call dlclose(so_handle_) and set it to
nullptr, handling error cases from dlopen/dlsym as already done by
ADD_FAILURE().

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/azure/azcache_provider/azcache_provider_loader.h`:
- Around line 3-5: Remove the unused `#include` <mutex> from
azcache_provider_loader.h: the header and its implementation use std::atomic
(e.g., the singleton flag/atomic<bool> usage) and do not reference std::mutex or
any mutex types, so delete the <mutex> include to avoid an unused include and
keep the file minimal.

---

Nitpick comments:
In `@cpp/azure/azcache_provider/azcache_provider_test.cc`:
- Around line 63-78: load_cache_fn opens the shared object with dlopen but never
closes the handle, leaking a resource; modify the test fixture to store the
handle (e.g., add a void* so_handle_ member initialized to nullptr), set
so_handle_ = dlopen(so_path_.c_str(), ...) inside load_cache_fn (closing any
previous handle if non-null), and return the dlsym result as before; then
implement TearDown() to check if so_handle_ is non-null and call
dlclose(so_handle_) and set it to nullptr, handling error cases from
dlopen/dlsym as already done by ADD_FAILURE().

In `@cpp/azure/client/async_azure_client/async_azure_client.h`:
- Around line 86-99: The thrown exception inside the _cache_enabled branch of
DownloadBlobTask should include the container/blob identifiers to aid debugging:
update the lambda passed to DownloadBlobTask (the one calling
AzCacheProviderLoader::instance().read with container_name and blob_name) so
that when read() returns false it throws a std::runtime_error containing
container_name and blob_name (and optionally offset/length) instead of the
generic "Cache provider read failed", then push_task as before; reference the
DownloadBlobTask lambda, AzCacheProviderLoader::instance().read, container_name
and blob_name.

In `@docs/src/usage.md`:
- Around line 238-243: The fenced code block showing example log output lacks a
language identifier; update the block surrounding the two AzCacheProvider lines
to include a language tag (e.g., use ```text or ```plaintext) so the markdown
linter accepts it and syntax highlighting is consistent—locate the example block
containing "AzCacheProvider: loading cache library..." and "AzCacheProvider:
cache provider loaded successfully..." and add the language identifier after the
opening backticks.

In `@tests/azure/test_azurecache.py`:
- Around line 21-22: Remove the unused import of the time module from
tests/azure/test_azurecache.py by deleting the "import time" statement so there
are no unused imports; locate the top-level import line that references time and
remove it (no other changes to functions or tests are required).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: bbf12699-0b23-42a7-8bb6-9b5b11010e8d

📥 Commits

Reviewing files that changed from the base of the PR and between aae3093 and 7367e06.

📒 Files selected for processing (13)

cpp/azure/azcache_provider/BUILD
cpp/azure/azcache_provider/azcache_provider_loader.cc
cpp/azure/azcache_provider/azcache_provider_loader.h
cpp/azure/azcache_provider/azcache_provider_test.cc
cpp/azure/azcache_provider/runai_azcache_provider.h
cpp/azure/azcache_provider/simple_file_cache_test.cc
cpp/azure/client/async_azure_client/BUILD
cpp/azure/client/async_azure_client/async_azure_client.cc
cpp/azure/client/async_azure_client/async_azure_client.h
cpp/azure/client/client.cc
docs/src/env-vars.md
docs/src/usage.md
tests/azure/test_azurecache.py

✅ Files skipped from review due to trivial changes (2)

cpp/azure/client/async_azure_client/BUILD
docs/src/env-vars.md

🚧 Files skipped from review as they are similar to previous changes (1)

cpp/azure/client/client.cc

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

cpp/azure/azcache_provider/azcache_provider_test.cc (1)

62-78: Consider closing the library handle after tests or reusing a single handle.

load_cache_fn() calls dlopen but never dlclose, leaking the handle. While this is acceptable in tests (the OS cleans up on process exit), you could either:

Store the handle in the fixture and call dlclose in TearDown
Reuse a single handle across test cases

♻️ Optional improvement to track and close the handle

 class SimpleFileCacheTest : public ::testing::Test
 {
 protected:
     fs::path cache_dir_;
     fs::path so_path_;
+    void* lib_handle_ = nullptr;

     void SetUp() override
     { ... }

     void TearDown() override
     {
         fs::remove_all(cache_dir_);
         unsetenv("RUNAI_CACHE_DIR");
+        if (lib_handle_) {
+            dlclose(lib_handle_);
+            lib_handle_ = nullptr;
+        }
     }

     az_cache_read_fn load_cache_fn()
     {
-        void* handle = dlopen(so_path_.c_str(), RTLD_NOW | RTLD_LOCAL);
-        if (!handle)
+        if (!lib_handle_) {
+            lib_handle_ = dlopen(so_path_.c_str(), RTLD_NOW | RTLD_LOCAL);
+        }
+        if (!lib_handle_)
         {
             ADD_FAILURE() << "dlopen failed: " << dlerror();
             return nullptr;
         }
         auto fn = reinterpret_cast<az_cache_read_fn>(
-            dlsym(handle, AZ_CACHE_READ_SYMBOL));
+            dlsym(lib_handle_, AZ_CACHE_READ_SYMBOL));
         ...
     }
 };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/azure/azcache_provider/azcache_provider_test.cc` around lines 62 - 78,
The test currently leaks the dlopen handle in load_cache_fn() by never calling
dlclose; modify the test fixture to store the void* handle (returned by dlopen
using so_path_) as a member (e.g., fixture field) and change load_cache_fn() to
reuse that handle (dlopen once) or to set the handle member when opening; then
implement TearDown() in the test fixture to call dlclose(handle) if handle is
non-null, and update load_cache_fn() to not return early without setting or
reusing the stored handle (referencing load_cache_fn, so_path_, dlopen, dlclose,
and the fixture TearDown).

tests/azure/test_azurecache.py (1)

21-21: Remove unused time import.

The time module is imported but not used in this test file.

🧹 Proposed fix

 import subprocess
 import sys
 import tempfile
-import time

 import numpy as np

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/azure/test_azurecache.py` at line 21, Remove the unused import
statement "import time" from the test file; locate the top-level import line
"import time" in tests/azure/test_azurecache.py and delete it so there are no
unused imports left in the module.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/azure/client/async_azure_client/async_azure_client.h`:
- Around line 86-100: The code throws a generic std::runtime_error("Cache
provider read failed") inside the DownloadBlobTask when
AzCacheProviderLoader::instance().read(...) returns false, losing provider error
details; update the failure path in the _cache_enabled branch (DownloadBlobTask
lambda) to capture and include the provider's error information (e.g., call a
loader method like AzCacheProviderLoader::instance().last_error() or equivalent)
or call the loader's logging before throwing, and construct the thrown exception
message to include that detailed text so push_task receives an exception with
the provider error included.

In `@docs/src/usage.md`:
- Around line 238-243: The fenced code block showing the example log output (the
block containing the two "AzCacheProvider: ..." lines) is missing a language
specifier; update that triple-backtick block to include a language like text or
console (e.g., ```text) so linting and syntax highlighting work correctly and
the example log output is consistent.

---

Nitpick comments:
In `@cpp/azure/azcache_provider/azcache_provider_test.cc`:
- Around line 62-78: The test currently leaks the dlopen handle in
load_cache_fn() by never calling dlclose; modify the test fixture to store the
void* handle (returned by dlopen using so_path_) as a member (e.g., fixture
field) and change load_cache_fn() to reuse that handle (dlopen once) or to set
the handle member when opening; then implement TearDown() in the test fixture to
call dlclose(handle) if handle is non-null, and update load_cache_fn() to not
return early without setting or reusing the stored handle (referencing
load_cache_fn, so_path_, dlopen, dlclose, and the fixture TearDown).

In `@tests/azure/test_azurecache.py`:
- Line 21: Remove the unused import statement "import time" from the test file;
locate the top-level import line "import time" in tests/azure/test_azurecache.py
and delete it so there are no unused imports left in the module.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 1b4dbc49-9dc5-403c-b5e1-7905f51c3fd3

📥 Commits

Reviewing files that changed from the base of the PR and between 7367e06 and 69ced42.

📒 Files selected for processing (13)

cpp/azure/azcache_provider/BUILD
cpp/azure/azcache_provider/azcache_provider_loader.cc
cpp/azure/azcache_provider/azcache_provider_loader.h
cpp/azure/azcache_provider/azcache_provider_test.cc
cpp/azure/azcache_provider/runai_azcache_provider.h
cpp/azure/azcache_provider/simple_file_cache_test.cc
cpp/azure/client/async_azure_client/BUILD
cpp/azure/client/async_azure_client/async_azure_client.cc
cpp/azure/client/async_azure_client/async_azure_client.h
cpp/azure/client/client.cc
docs/src/env-vars.md
docs/src/usage.md
tests/azure/test_azurecache.py

✅ Files skipped from review due to trivial changes (2)

cpp/azure/client/async_azure_client/BUILD
docs/src/env-vars.md

🚧 Files skipped from review as they are similar to previous changes (4)

cpp/azure/client/async_azure_client/async_azure_client.cc
cpp/azure/client/client.cc
cpp/azure/azcache_provider/azcache_provider_loader.cc
cpp/azure/azcache_provider/BUILD

coderabbitai

🧹 Nitpick comments (3)

cpp/azure/azcache_provider/azcache_provider_test.cc (1)
62-78: Minor: dlopen handle is not closed.

Each call to load_cache_fn() opens the shared library but never calls dlclose(). While this is acceptable in test code (process termination cleans up), storing and closing the handle would be cleaner. Since dlopen with the same path returns the same handle (reference counted), this doesn't cause multiple loads, just a minor resource tracking issue.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/azure/azcache_provider/azcache_provider_test.cc` around lines 62 - 78,
The test helper load_cache_fn opens the shared object with dlopen and never
calls dlclose, leaking the handle; modify load_cache_fn to store the dlopen
result in a local variable (handle already present), and after obtaining the
symbol via dlsym call dlclose(handle) before returning (or keep and manage the
handle lifecycle in a static/field if you need the handle alive); ensure you
still return the function pointer (az_cache_read_fn) and report errors via
ADD_FAILURE() when dlopen or dlsym fail (use the existing so_path_, handle,
dlopen, dlsym, dlclose, and AZ_CACHE_READ_SYMBOL identifiers to locate and
update the code).
tests/azure/test_azurecache.py (2)
21-21: Unused import.

time is imported but never used in the file.
🧹 Proposed fix
 import subprocess
 import sys
 import tempfile
-import time
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/azure/test_azurecache.py` at line 21, Remove the unused import "time"
from the import section of tests/azure/test_azurecache.py; locate the line that
reads "import time" and delete it so there are no unused imports remaining (or
replace it with a needed import only if intended to be used by a test in
functions/classes in this file).
145-159: Consider removing redundant env var settings in inline script.

Lines 147-148 set RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_LIB and RUNAI_CACHE_DIR in the script, but these are already passed via env=env at line 167. The subprocess inherits the environment dict, making these assignments redundant.

Not a bug, but could simplify the script slightly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/azure/test_azurecache.py` around lines 145 - 159, The inline Python
script assigned to the variable script sets
RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_LIB and RUNAI_CACHE_DIR via os.environ
but those are already provided through the env dictionary passed to the
subprocess (env), so remove the redundant os.environ assignments from the script
string; keep the imports, uri, SafetensorsStreamer usage and assertions intact,
and rely on the existing env dict used when launching the subprocess to supply
those variables.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cpp/azure/azcache_provider/azcache_provider_test.cc`:
- Around line 62-78: The test helper load_cache_fn opens the shared object with
dlopen and never calls dlclose, leaking the handle; modify load_cache_fn to
store the dlopen result in a local variable (handle already present), and after
obtaining the symbol via dlsym call dlclose(handle) before returning (or keep
and manage the handle lifecycle in a static/field if you need the handle alive);
ensure you still return the function pointer (az_cache_read_fn) and report
errors via ADD_FAILURE() when dlopen or dlsym fail (use the existing so_path_,
handle, dlopen, dlsym, dlclose, and AZ_CACHE_READ_SYMBOL identifiers to locate
and update the code).

In `@tests/azure/test_azurecache.py`:
- Line 21: Remove the unused import "time" from the import section of
tests/azure/test_azurecache.py; locate the line that reads "import time" and
delete it so there are no unused imports remaining (or replace it with a needed
import only if intended to be used by a test in functions/classes in this file).
- Around line 145-159: The inline Python script assigned to the variable script
sets RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_LIB and RUNAI_CACHE_DIR via
os.environ but those are already provided through the env dictionary passed to
the subprocess (env), so remove the redundant os.environ assignments from the
script string; keep the imports, uri, SafetensorsStreamer usage and assertions
intact, and rely on the existing env dict used when launching the subprocess to
supply those variables.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: b6e9f90f-9cd6-4ab8-aeae-ada341e2b9b5

📥 Commits

Reviewing files that changed from the base of the PR and between 69ced42 and 877d721.

📒 Files selected for processing (13)

cpp/azure/azcache_provider/BUILD
cpp/azure/azcache_provider/azcache_provider_loader.cc
cpp/azure/azcache_provider/azcache_provider_loader.h
cpp/azure/azcache_provider/azcache_provider_test.cc
cpp/azure/azcache_provider/runai_azcache_provider.h
cpp/azure/azcache_provider/simple_file_cache_test.cc
cpp/azure/client/async_azure_client/BUILD
cpp/azure/client/async_azure_client/async_azure_client.cc
cpp/azure/client/async_azure_client/async_azure_client.h
cpp/azure/client/client.cc
docs/src/env-vars.md
docs/src/usage.md
tests/azure/test_azurecache.py

✅ Files skipped from review due to trivial changes (4)

cpp/azure/client/async_azure_client/BUILD
docs/src/env-vars.md
docs/src/usage.md
cpp/azure/azcache_provider/BUILD

🚧 Files skipped from review as they are similar to previous changes (3)

cpp/azure/client/async_azure_client/async_azure_client.cc
cpp/azure/azcache_provider/azcache_provider_loader.cc
cpp/azure/client/async_azure_client/async_azure_client.h

hasethuraman · 2026-04-23T10:32:37Z

@noa-neria I am also getting an internal review FYI. Will update once it is done

…mental) - Define blob_read C API header for custom cache providers - Dynamic loader with dlopen/dlsym for cache provider .so - Auto-discover cache library in Python site-packages via dladdr - RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_ENABLED as a disable switch - Example simple_file_cache_test implementation for reference/testing - Unit tests and integration test infrastructure - Documentation: env-vars.md and usage.md updated Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

tests/azure/test_azurecache.py (1)

27-29: Path resolution may fail in some test environments.

The relative path construction using os.path.dirname(__file__) with multiple .. segments assumes a specific directory structure. If tests are run from an installed package or a different working directory, this path may not resolve correctly.

The existing pytest.skip() in the fixture handles this gracefully, but consider adding a comment explaining the expected directory structure.
📝 Suggested documentation improvement
+# Path to the C++ test provider source file.
+# This assumes the test is run from the repository root with the standard layout:
+#   tests/azure/test_azurecache.py
+#   cpp/azure/azcache_provider/simple_file_cache_test.cc
 SIMPLE_FILE_CACHE_SRC = os.path.join(
     os.path.dirname(__file__), "..", "..", "cpp", "azure", "azcache_provider", "simple_file_cache_test.cc"
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/azure/test_azurecache.py` around lines 27 - 29, SIMPLE_FILE_CACHE_SRC
path resolution can fail in installs or different CWDs; add a concise comment
above the SIMPLE_FILE_CACHE_SRC definition explaining the expected repository
layout (that tests assume the repo source tree with
cpp/azure/azcache_provider/simple_file_cache_test.cc located two levels up) and
note that the test fixture uses pytest.skip() when the file isn't found—this
documents why the relative path is used and when the skip will occur so future
readers understand the assumption and fallback.

cpp/azure/azcache_provider/azcache_provider_test.cc (1)

62-78: Minor: dlopen handle is never closed.

The load_cache_fn() helper calls dlopen() but never stores the handle for later dlclose(). While this is acceptable in tests (process exit cleans up), it could cause issues if tests are run repeatedly or if the test framework reuses processes.

Consider storing the handle in the fixture and calling dlclose() in TearDown().

♻️ Suggested improvement

 class SimpleFileCacheTest : public ::testing::Test
 {
 protected:
     fs::path cache_dir_;
     fs::path so_path_;
+    void* lib_handle_ = nullptr;

     void SetUp() override
     {
         // ... existing code ...
     }

     void TearDown() override
     {
+        if (lib_handle_) {
+            dlclose(lib_handle_);
+            lib_handle_ = nullptr;
+        }
         fs::remove_all(cache_dir_);
         unsetenv("RUNAI_CACHE_DIR");
     }

     blob_read_fn load_cache_fn()
     {
-        void* handle = dlopen(so_path_.c_str(), RTLD_NOW | RTLD_LOCAL);
-        if (!handle)
+        if (!lib_handle_) {
+            lib_handle_ = dlopen(so_path_.c_str(), RTLD_NOW | RTLD_LOCAL);
+        }
+        if (!lib_handle_)
         {
             ADD_FAILURE() << "dlopen failed: " << dlerror();
             return nullptr;
         }
         auto fn = reinterpret_cast<blob_read_fn>(
-            dlsym(handle, BLOB_READ_SYMBOL));
+            dlsym(lib_handle_, BLOB_READ_SYMBOL));
         if (!fn)
         {
             ADD_FAILURE() << "dlsym failed: " << dlerror();
         }
         return fn;
     }
 };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/azure/azcache_provider/azcache_provider_test.cc` around lines 62 - 78,
load_cache_fn() opens a shared object with dlopen but never saves the returned
handle for dlclose; modify the test fixture to store the void* handle (returned
by dlopen using so_path_ and BLOB_READ_SYMBOL lookup) on the fixture (e.g. a
member like dl_handle_), return the blob_read_fn as before, and implement
TearDown() to call dlclose(dl_handle_) if non-null and clear it to ensure the
handle is closed between tests.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/azure/azcache_provider/azcache_provider_loader.cc`:
- Around line 94-101: The env-var handling currently treats an explicitly-set
empty RUNAI_AZURE_CACHE_LIB_ENV value as valid, which lets dlopen("") bind the
main program; update the logic in the azcache provider loader so that when
utils::try_getenv<std::string>(RUNAI_AZURE_CACHE_LIB_ENV, lib_path) returns true
you also treat lib_path.empty() as a disabled/invalid value (same behavior as
is_disabled_value), log that it was explicitly disabled, and return early before
the dlopen call; apply the same empty-string check where the code later handles
RUNAI_AZURE_CACHE_LIB_ENV/loader path prior to calling dlopen to avoid binding
the main program handle.
- Around line 61-65: The call to ::tolower in is_disabled_value risks undefined
behavior when char is signed; modify the std::transform lambda to cast each
character to unsigned char before calling ::tolower (i.e.,
transform(lower.begin(), lower.end(), lower.begin(), [](unsigned char c){ return
std::tolower(static_cast<unsigned char>(c)); }) or equivalent) so that the
characters in string lower are safely converted to lower-case; update the
transform usage in is_disabled_value accordingly.

---

Nitpick comments:
In `@cpp/azure/azcache_provider/azcache_provider_test.cc`:
- Around line 62-78: load_cache_fn() opens a shared object with dlopen but never
saves the returned handle for dlclose; modify the test fixture to store the
void* handle (returned by dlopen using so_path_ and BLOB_READ_SYMBOL lookup) on
the fixture (e.g. a member like dl_handle_), return the blob_read_fn as before,
and implement TearDown() to call dlclose(dl_handle_) if non-null and clear it to
ensure the handle is closed between tests.

In `@tests/azure/test_azurecache.py`:
- Around line 27-29: SIMPLE_FILE_CACHE_SRC path resolution can fail in installs
or different CWDs; add a concise comment above the SIMPLE_FILE_CACHE_SRC
definition explaining the expected repository layout (that tests assume the repo
source tree with cpp/azure/azcache_provider/simple_file_cache_test.cc located
two levels up) and note that the test fixture uses pytest.skip() when the file
isn't found—this documents why the relative path is used and when the skip will
occur so future readers understand the assumption and fallback.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 3ff28027-e486-4385-bf2e-608adaa0c533

📥 Commits

Reviewing files that changed from the base of the PR and between 877d721 and ccbe45e.

📒 Files selected for processing (13)

cpp/azure/azcache_provider/BUILD
cpp/azure/azcache_provider/azcache_provider_loader.cc
cpp/azure/azcache_provider/azcache_provider_loader.h
cpp/azure/azcache_provider/azcache_provider_test.cc
cpp/azure/azcache_provider/runai_azcache_provider.h
cpp/azure/azcache_provider/simple_file_cache_test.cc
cpp/azure/client/async_azure_client/BUILD
cpp/azure/client/async_azure_client/async_azure_client.cc
cpp/azure/client/async_azure_client/async_azure_client.h
cpp/azure/client/client.cc
docs/src/env-vars.md
docs/src/usage.md
tests/azure/test_azurecache.py

✅ Files skipped from review due to trivial changes (4)

cpp/azure/client/async_azure_client/async_azure_client.cc
cpp/azure/client/async_azure_client/BUILD
cpp/azure/azcache_provider/BUILD
docs/src/usage.md

kyleknap

Just leaving some comments/questions that I had as I work through the PR

kyleknap · 2026-05-01T21:05:18Z

+    }
+
+    // libstreamerazure.so → lib/ → libstreamer/ → runai_model_streamer/ → site-packages/
+    auto site_packages = azure_so.parent_path().parent_path().parent_path().parent_path();


Instead of relying on where the current location of libstreamerazure.so, I wonder if it is possible to leverage Python itself (e.g., using sys.path or site.getsitepackages()) to get the full list of locations Python will look for packages? Maybe if we are able to call one of these functions as part of the bindings to model streamer interface we could inject the value down into configuration to be read as part of this logic?

Mainly with the logic as is, it will probably work for most cases (e.g., a customer installing the package from PyPI) but not sure if it will hold for cases like editable installs or environments where the site packages is searched in multiple different locations (e.g., setting PYTHONPATH)

Common python layer calls the libstreamerazure.so & so we dont have a chance to call azure specific. And we wanted zero touch without user doing any explicit import. Since libstreamerazure.so is expected to be in site-packages, we will support the tachyon installed in same site. WDYT? We can enhance c++ code to look into PYTHONPATH later?

Resolving it for now & track later.

I agree that this should be moved to the python layer.

We can add a new hook prepare_runtime similar to the existing scheme in safetensors streamer:

runai_model_streamer/s3_utils/s3_utils.py is the existing scheme-dispatch layer:

Scheme detectors: is_s3_path, is_gs_path, is_azure_path.

Lazy importers: `get_s3_files_module(), get_gcs_files_module(), get_azure_files_module() — each returns None if the backend package isn't installed.

Thin per-scheme wrappers: s3_pull_files, azure_glob, etc. — they fetch the backend module and delegate.

And safetensors_streamer.py:36-45, 56-62 is the dispatch site:

if is_s3_path(path): files = s3_glob(...) elif is_gs_path(path): files = gcs_glob(...) elif is_azure_path(path): files = azure_glob(...)

Each cloud's logic lives in its own package — runai_model_streamer_azure is the natural place for "before we touch Azure, prepare any Azure runtime state" — including resolving the cache .so.

The resolver code in azure implementation for prepare_runtime can set the environment variable of the package name (unless alreday set)

@hasethuraman @kyleknap If that is ok with you, I can create a quick PR for prepare_runtime which you can use after rebasing

@noa-neria I like this idea of having a hook to do any prep work prior to going into C++ runtime. I think it will be generally useful in the future.

In terms of passing state from the prepare_runtime() hook, will it only be environment variables we'll be able to leverage/set? Or will it be able to set any in-code configuration that will cross the Python layer to the C++ layer (e.g., a config, context object, or even a configuration dictionary)? Main concern is if it just environment variables we can set, it might make sense to have some sort of teardown hook as well to make sure we reset environment variables back if this is only being called when instantiating the SafetensorsStreamer?

I hope we can take this to a follow up PR. As the changes are increasing, can we merge this so that we get soaking time for what we have manwhile.

@hasethuraman I'm good with deferring this improvement to a fast follow up PR. The RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_LIB env var can be leveraged in the scenarios that I was concerned won't work.

…o mode, non-singleton loader - Caller-owned error buffer (stack-allocated, no cross-.so malloc/free) - CacheMode enum: 0=disabled, 1=required, auto=default (matches RUNAI_STREAMER_DIST) - Non-singleton: from_env() factory, shared_ptr ownership per client - Removed disable-via-lib-path; _ENABLED is sole on/off control - Added ec.message() to debug logs for canonical path and auto-discovery - Fixed tolower UB (unsigned char cast) - Updated usage.md blob_read signature to match new API Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

hasethuraman · 2026-05-05T14:36:15Z

@noa-neria addressed most of the internal review comments and as it continues, I would like to request your & other reviewers to go through. Thanks. I may be offline for a week but @kyleknap has a better idea now what this PR's intent is & requesting @kyleknap to take questions meanwhile if works. Will take up further review comments / left over questions once I am back.

hasethuraman · 2026-05-13T09:48:57Z

if there are no comments, can we go ahead for merge?

noa-neria

@hasethuraman thank you for your patiance. We've been quite busy lately.

There are few comments I've added.

Also, have you considered adding versioning for the ABI, if things may change in the future?
Since the cache shared library is not part of the runai-model-streamer package, querying the ABI version after dlopen is a good practice.
For example:

/* runai_azcache_provider.h */
#define RUNAI_CACHE_ABI_VERSION 1
#define RUNAI_CACHE_ABI_VERSION_SYMBOL "runai_cache_abi_version"

/* Provider exports: */
extern "C" uint32_t runai_cache_abi_version(void) { return RUNAI_CACHE_ABI_VERSION; }

Loader flow:

dlopen the library.
dlsym("runai_cache_abi_version"). Missing → reject (pre-versioning build, treat as incompatible). Present → call it.
If the returned version is in the set the streamer accepts → install blob_read. Otherwise log "expected vN, got vM" and either fall back (Auto) or fail (Required).

noa-neria · 2026-05-14T11:29:22Z

+    }
+
+    // libstreamerazure.so → lib/ → libstreamer/ → runai_model_streamer/ → site-packages/
+    auto site_packages = azure_so.parent_path().parent_path().parent_path().parent_path();


I agree that this should be moved to the python layer.

We can add a new hook prepare_runtime similar to the existing scheme in safetensors streamer:

runai_model_streamer/s3_utils/s3_utils.py is the existing scheme-dispatch layer:

Scheme detectors: is_s3_path, is_gs_path, is_azure_path.

Lazy importers: `get_s3_files_module(), get_gcs_files_module(), get_azure_files_module() — each returns None if the backend package isn't installed.

Thin per-scheme wrappers: s3_pull_files, azure_glob, etc. — they fetch the backend module and delegate.

And safetensors_streamer.py:36-45, 56-62 is the dispatch site:

if is_s3_path(path): files = s3_glob(...) elif is_gs_path(path): files = gcs_glob(...) elif is_azure_path(path): files = azure_glob(...)

Each cloud's logic lives in its own package — runai_model_streamer_azure is the natural place for "before we touch Azure, prepare any Azure runtime state" — including resolving the cache .so.

The resolver code in azure implementation for prepare_runtime can set the environment variable of the package name (unless alreday set)

@hasethuraman @kyleknap If that is ok with you, I can create a quick PR for prepare_runtime which you can use after rebasing

hasethuraman · 2026-05-14T16:02:05Z

@noa-neria thank you for review. Will work on the comments. Regarding the prepare_runtime hook, if I understand correctly, users dont need to add any additional import statements or something for this hook to be executed. If it is, I am all for it. We can have a follow up PR to integrate with hook to unblock this one.

noa-neria · 2026-05-14T16:39:33Z

@noa-neria thank you for review. Will work on the comments. Regarding the prepare_runtime hook, if I understand correctly, users dont need to add any additional import statements or something for this hook to be executed. If it is, I am all for it. We can have a follow up PR to integrate with hook to unblock this one.

Correct - prepare_runtime will be called automatically in SafetensorsStreamer

- docs: values accepted are 0, 1, auto (not true/false) - docs: disable example uses =0 instead of =false - tests: replace stale 'singleton' references with accurate ClientMgr pool explanation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

hasethuraman · 2026-05-15T08:11:09Z

Updated the code to support ABI version

- Header: RUNAI_CACHE_ABI_VERSION=1, runai_cache_abi_version symbol - Loader: checks version after dlopen, rejects missing/mismatched - Test .so: exports runai_cache_abi_version() returning 1 - Tests: 8 new loader tests covering mode × version combinations Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

noa-neria · 2026-05-17T12:44:15Z

@hasethuraman thanks for handling these issues! One more question after reviewing the code again:

Is there a reason each AzureClient independently dlopens the cache library? If not, could we dlopen it once and share a Dylib handle across clients — the same way S3ClientWrapper::manage_backend_handle manages the cloud plugin?

The reason I'm asking: by design dlclose is never called on the cache library, so its resources aren't released when the streamer is closed — they're only freed when the host process exits. For long-running processes (e.g. an inference engine that uses the streamer briefly at startup and then serves for hours), the cache library sits idle for the rest of the process's lifetime, still holding its threads, connections, and memory.

This is unlike our treatment of the CSP plugin libraries, which are dlclose'd at streamer close. Routing the cache library through a shared singleton handle would put it on the same lifecycle: refcount drops to zero when the last client is released at streamer close, and dlclose happens naturally.

- Refactor AzCacheProviderLoader to use a shared CacheLibHandle via weak_ptr singleton, ensuring single dlopen per process - Add shutdown_fn typedef and SHUTDOWN_SYMBOL to ABI header - CacheLibHandle destructor calls shutdown() for graceful provider cleanup before process exit (no dlclose for safety) - Mutex-protected creation and destruction to prevent race conditions - Update tests: shared handle reuse, shutdown_called verification - Update example cache (simple_file_cache_test.cc) to export shutdown() Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

hasethuraman · 2026-05-18T14:14:59Z

@hasethuraman thanks for handling these issues! One more question after reviewing the code again:

Is there a reason each AzureClient independently dlopens the cache library? If not, could we dlopen it once and share a Dylib handle across clients — the same way S3ClientWrapper::manage_backend_handle manages the cloud plugin?

The reason I'm asking: by design dlclose is never called on the cache library, so its resources aren't released when the streamer is closed — they're only freed when the host process exits. For long-running processes (e.g. an inference engine that uses the streamer briefly at startup and then serves for hours), the cache library sits idle for the rest of the process's lifetime, still holding its threads, connections, and memory.

This is unlike our treatment of the CSP plugin libraries, which are dlclose'd at streamer close. Routing the cache library through a shared singleton handle would put it on the same lifecycle: refcount drops to zero when the last client is released at streamer close, and dlclose happens naturally.

@noa-neria agree on the concern. Thanks for direction. We have limitations in dlclose at the moment. However, introduced the shared ptr for lib cache across clients. And on streamer close, calling a new function shutdown() on cache lib to make the cache resources claim possible in future till we get dlclose to work.

greptile-apps · 2026-05-18T14:18:24Z

Want your agent to iterate on Greptile's feedback? Try greploops.

hasethuraman force-pushed the hari/cache-support branch 3 times, most recently from eb11665 to aae3093 Compare April 22, 2026 16:11

hasethuraman changed the title ~~feat: generic azure cache provider interface~~ feat: pluggable azure blob cache provider interface (experimental) Apr 22, 2026

hasethuraman marked this pull request as ready for review April 22, 2026 16:17

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread cpp/azure/azcache_provider/simple_file_cache_test.cc

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread docs/src/env-vars.md Outdated

Comment thread tests/azure/test_azurecache.py

Comment thread tests/azure/test_azurecache.py

hasethuraman force-pushed the hari/cache-support branch from aae3093 to 7367e06 Compare April 23, 2026 04:57

hasethuraman commented Apr 23, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread cpp/azure/azcache_provider/azcache_provider_loader.h Outdated

hasethuraman force-pushed the hari/cache-support branch from 7367e06 to 69ced42 Compare April 23, 2026 06:28

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread cpp/azure/client/async_azure_client/async_azure_client.h

Comment thread docs/src/usage.md

hasethuraman force-pushed the hari/cache-support branch from 69ced42 to 877d721 Compare April 23, 2026 10:01

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

hasethuraman force-pushed the hari/cache-support branch from 877d721 to ccbe45e Compare April 29, 2026 09:28

hasethuraman changed the title ~~feat: pluggable azure blob cache provider interface (experimental)~~ feat: pluggable azure blob cache provider with auto-discovery (experimental) Apr 29, 2026

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread cpp/azure/azcache_provider/azcache_provider_loader.cc Outdated

Comment thread cpp/azure/azcache_provider/azcache_provider_loader.cc

jmclong reviewed Apr 29, 2026

View reviewed changes

kyleknap reviewed May 1, 2026

View reviewed changes

greptile-apps Bot reviewed May 5, 2026

View reviewed changes

Comment thread docs/src/usage.md

hasethuraman force-pushed the hari/cache-support branch from 3d1b185 to 49ba365 Compare May 5, 2026 14:14

noa-neria reviewed May 14, 2026

View reviewed changes

hasethuraman force-pushed the hari/cache-support branch from 28395cc to 1e5436f Compare May 15, 2026 11:18

Conversation

hasethuraman commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pluggable Azure Blob Cache Provider with Auto-Discovery

How it works

Components

Testing

Summary by CodeRabbit

Uh oh!

noa-neria commented Apr 16, 2026

Uh oh!

hasethuraman commented Apr 22, 2026

Uh oh!

greptile-apps Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hasethuraman left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

hasethuraman commented Apr 23, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kyleknap left a comment

Choose a reason for hiding this comment

Uh oh!

kyleknap May 1, 2026

Choose a reason for hiding this comment

Uh oh!

hasethuraman May 5, 2026

Choose a reason for hiding this comment

Uh oh!

hasethuraman May 13, 2026

Choose a reason for hiding this comment

Uh oh!

noa-neria May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyleknap May 14, 2026

hasethuraman commented Apr 12, 2026 •

edited

Loading

greptile-apps Bot commented Apr 22, 2026 •

edited

Loading

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading

noa-neria May 14, 2026 •

edited

Loading

kyleknap May 18, 2026 •

edited

Loading

noa-neria left a comment •

edited

Loading

noa-neria May 14, 2026 •

edited

Loading

noa-neria commented May 14, 2026 •

edited

Loading