Skip to content

feat: pluggable azure blob cache provider with auto-discovery (experimental)#139

Open
hasethuraman wants to merge 5 commits into
run-ai:masterfrom
hasethuraman:hari/cache-support
Open

feat: pluggable azure blob cache provider with auto-discovery (experimental)#139
hasethuraman wants to merge 5 commits into
run-ai:masterfrom
hasethuraman:hari/cache-support

Conversation

@hasethuraman
Copy link
Copy Markdown
Contributor

@hasethuraman hasethuraman commented Apr 12, 2026

Pluggable Azure Blob Cache Provider with Auto-Discovery

Adds a pluggable cache provider for Azure Blob Storage reads with zero-config auto-discovery. When a compatible cache provider package (e.g., tachyon-client) is installed alongside runai-model-streamer, the streamer automatically discovers and loads it at runtime — no environment variable configuration needed.

How it works

  1. Cache provider package installs a shared library (e.g., libStorageDirect.so) into Python site-packages
  2. At startup, the streamer auto-discovers the library via dladdr (walking from the runai .so location)
  3. The library is loaded via dlopen and the blob_read symbol is resolved
  4. All subsequent Azure blob reads are served through the cache provider
  5. If no cache provider is installed, reads go directly to Azure Blob Storage — no regression

Components

  • runai_azcache_provider.h: C API contract (blob_read) for cache provider .so plugins
  • azcache_provider_loader.cc: Singleton that auto-discovers via dladdr, falls back to dlopen/dlsym
  • AsyncAzureClient integration: Routes reads through cache when provider is discovered
  • simple_file_cache_test.cc: Example cache implementation for reference and unit testing
  • RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_ENABLED: Disable switch to turn off cache even when provider is installed
  • Documentation: env-vars.md and usage.md updated with auto-discovery flow and cache provider authoring guide

Testing

  • C++ unit tests: provider loading, error handling, fallback behavior
  • Python e2e test: validates cache intercept with simple file-based provider
  • AKS validated: cold cache and warm cache

Summary by CodeRabbit

  • New Features

    • Experimental pluggable Azure Blob cache provider: runtime-loadable provider redirects blob reads; clients can query if cache is enabled and async downloads use a single request when caching is active.
    • Build support added to package the provider and test helper as shared libraries.
  • Documentation

    • Added env-var reference, usage guide, and debugging instructions for the experimental cache provider.
  • Tests

    • Added unit and end-to-end tests covering reads, errors, security checks, and fallback behavior.

@noa-neria
Copy link
Copy Markdown
Collaborator

@hasethuraman few questions about this feature:

  • Have you considered providing local path to the model streamer instead, if the model files are already cached? If the entire model is cached, why not pass the path to the cached model to the engine?

  • Is the cache intended only for the safetensors files? Asking because pull_files and list_safetensors still look at the blob stoprage.

  • There is no fallback to the blob storage if reading from cache had failed. Is that by design?

@hasethuraman hasethuraman force-pushed the hari/cache-support branch 3 times, most recently from eb11665 to aae3093 Compare April 22, 2026 16:11
@hasethuraman hasethuraman changed the title feat: generic azure cache provider interface feat: pluggable azure blob cache provider interface (experimental) Apr 22, 2026
@hasethuraman
Copy link
Copy Markdown
Contributor Author

@hasethuraman few questions about this feature:

  • Have you considered providing local path to the model streamer instead, if the model files are already cached? If the entire model is cached, why not pass the path to the cached model to the engine?
  • Is the cache intended only for the safetensors files? Asking because pull_files and list_safetensors still look at the blob stoprage.
  • There is no fallback to the blob storage if reading from cache had failed. Is that by design?

Apologies for the confusion @noa-neria Updated the PR description & relevant code. This is mainly for the integration with distributed cache solution which must handle the CACHEMISS scenario which would further speed up the streaming for cached model in the cluster. Please let me know if this makes sense.

@hasethuraman hasethuraman marked this pull request as ready for review April 22, 2026 16:17
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This PR adds an experimental pluggable cache provider for Azure Blob Storage reads, with zero-config auto-discovery via dladdr. When a compatible package (e.g., tachyon-client) is installed alongside the streamer, its shared library is auto-loaded at runtime; if none is found, all reads continue through the Azure SDK unchanged.

  • New AzCacheProviderLoader (singleton shared handle, ABI version check, dlopen/dlsym) wired into AsyncAzureClient; cache reads replace the chunked Azure SDK path with a single call per range.
  • C ABI header (runai_azcache_provider.h) and reference implementation (simple_file_cache_test.cc) define the provider contract with path-traversal guards and an optional shutdown symbol.
  • Tests and docs cover loading, ABI mismatch, path traversal, shared handle lifecycle, and a Python e2e test; several gaps (env-var false/true disable semantics, test_missing_cache_lib_falls_back stub) were noted in prior review rounds.

Confidence Score: 4/5

Safe to merge for an experimental feature; the new code paths are guarded behind a runtime flag and fall back gracefully when no provider is installed.

The only new finding beyond the previously flagged issues is that shutdown() from the plugin and dlopen during construction are both invoked while holding s_handle_mutex, which can block all other loader threads for the duration of plugin shutdown and is a latent deadlock risk if the plugin ever re-enters loader creation.

azcache_provider_loader.cc (mutex-scope around dlopen and shutdown) and azcache_provider_test.cc (open dlopen handle leaks per call).

Important Files Changed

Filename Overview
cpp/azure/azcache_provider/azcache_provider_loader.cc Core loader implementation: singleton shared handle, dladdr auto-discovery, ABI version check, parse_cache_mode only recognises "0" and "1" (previously flagged), and shutdown() is called while holding s_handle_mutex in the destructor.
cpp/azure/azcache_provider/azcache_provider_loader.h Header defining CacheMode, CacheProviderConfig, CacheLibHandle, and AzCacheProviderLoader; design is clean with move/copy suppressed and clear immutability contracts.
cpp/azure/azcache_provider/runai_azcache_provider.h C ABI contract header with 8-parameter blob_read_fn, ABI versioning, and clear contract documentation; consistent with actual loader usage.
cpp/azure/azcache_provider/simple_file_cache_test.cc Reference/test cache provider with path-traversal guards (absolute-path fix applied), proper pread usage, and optional shutdown tracking for unit tests.
cpp/azure/azcache_provider/azcache_provider_test.cc Unit tests cover ABI versioning, shared handle lifecycle, and path-traversal; load_cache_fn() and load_abi_version() helpers leak dlopen handles on every call (previously flagged).
cpp/azure/client/async_azure_client/async_azure_client.h Cache integration added to AsyncAzureClient; cache-read failure throws runtime_error without Azure fallback (previously flagged); _cache_enabled is correctly initialised after _cache_loader is moved.
cpp/azure/client/client.cc Cache loader instantiated from env vars and wired into AsyncAzureClient; _account_name is guaranteed set when the cache path is reached (non-Azurite production builds throw before that point).
tests/azure/test_azurecache.py test_stream_from_cache is a meaningful e2e test; test_missing_cache_lib_falls_back only unsets an env var and never exercises the streamer (previously flagged).
docs/src/env-vars.md Documents RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_ENABLED; false will not disable the cache (only "0" works — previously flagged).
docs/src/usage.md Usage guide with cache-provider authoring section; blob_read signature shown uses 8 parameters matching the actual ABI; disable instruction correctly shows =0.

Reviews (11): Last reviewed commit: "feat(azure): shared CacheLibHandle singl..." | Re-trigger Greptile

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a pluggable, runtime-loaded Azure blob cache provider (C ABI + singleton loader) and integrates it into AsyncAzureClient so blob-range reads may be served from a provider .so. Includes BUILD rules, a test provider .so, unit and e2e tests, and documentation for the new env var.

Changes

Cohort / File(s) Summary
Build & Targets
cpp/azure/azcache_provider/BUILD, cpp/azure/client/async_azure_client/BUILD
Adds Bazel targets: azcache_provider cc_library, libsimple_file_cache_test.so shared binary, azcache_provider_test test; adds //azure/azcache_provider to async_azure_client deps.
Public C API & Headers
cpp/azure/azcache_provider/runai_azcache_provider.h, cpp/azure/azcache_provider/azcache_provider_loader.h
New public C header declares blob_read_fn, BLOB_READ_SYMBOL, RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_ENABLED, and RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_LIB. Declares AzCacheProviderLoader singleton with instance(), is_enabled(), and read(); deleted copy ops.
Loader Implementation & Test Provider
cpp/azure/azcache_provider/azcache_provider_loader.cc, cpp/azure/azcache_provider/simple_file_cache_test.cc
Implements runtime dlopen/dlsym loader, auto-discovery via dladdr, atomic enabled flag, and read() dispatch. Adds test-only blob_read implementation performing file-backed reads, path validation, and malloc'd error strings.
Async Client Integration
cpp/azure/client/async_azure_client/async_azure_client.h, cpp/azure/client/async_azure_client/async_azure_client.cc, cpp/azure/client/client.cc
AsyncAzureClient now snapshots _cache_enabled (init from loader) and exposes is_cache_enabled(). DownloadBlobRangeAsync signature adds account_name and is no longer inline; when cache enabled it schedules provider reads, otherwise falls back to Azure SDK. AzureClient::async_read forces single-chunk reads when cache enabled.
Tests
cpp/azure/azcache_provider/azcache_provider_test.cc, tests/azure/test_azurecache.py
Adds unit tests for loader/provider behaviors (dlopen/dlsym, error cases, path traversal) and an end-to-end pytest that compiles the test .so, populates a cache, runs the streamer subprocess, and verifies streamed tensors from cache.
Docs
docs/src/env-vars.md, docs/src/usage.md
Documents RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_ENABLED and RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_LIB, provider contract (blob_read), auto-discovery/loading behavior, and debug guidance.

Sequence Diagram

sequenceDiagram
    participant Client
    participant AsyncAzureClient
    participant AzCacheProviderLoader
    participant CacheLibrary as "Dynamic Cache Library"
    participant AzureSDK

    Client->>AsyncAzureClient: DownloadBlobRangeAsync(account, container, blob, buf, offset, len)
    AsyncAzureClient->>AsyncAzureClient: check _cache_enabled

    alt cache enabled
        AsyncAzureClient->>AzCacheProviderLoader: read(account, container, blob, buf, offset, len)
        AzCacheProviderLoader->>CacheLibrary: blob_read(account, container, blob, buf, offset, len, &err)
        CacheLibrary-->>AzCacheProviderLoader: bytes_read or error_string
        AzCacheProviderLoader-->>AsyncAzureClient: success/failure
        AsyncAzureClient->>Client: callback or throw on failure
    else cache disabled
        AsyncAzureClient->>AzureSDK: perform SDK download (createDownloadBlobFn)
        AzureSDK-->>AsyncAzureClient: bytes_read or error
        AsyncAzureClient->>Client: callback or error
    end
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I found a tiny .so beneath the logs,
I dlopen, dlsym, then hop through the fog,
A singleton nibble, an error freed,
Cache or Azure — I fetch what you need! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly describes the main feature: a pluggable Azure blob cache provider with auto-discovery capability, all marked as experimental.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread cpp/azure/azcache_provider/simple_file_cache_test.cc
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
cpp/azure/client/async_azure_client/async_azure_client.h (1)

89-103: Consider using is_cache_enabled() for consistency.

Line 89 directly calls AzCacheProviderLoader::instance().is_enabled() instead of the newly added is_cache_enabled() method. While functionally equivalent, using the wrapper method would be more consistent.

♻️ Proposed fix
-        if (AzCacheProviderLoader::instance().is_enabled())
+        if (is_cache_enabled())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/azure/client/async_azure_client/async_azure_client.h` around lines 89 -
103, Replace the direct call to AzCacheProviderLoader::instance().is_enabled()
with the wrapper method is_cache_enabled() for consistency: in the block that
creates the DownloadBlobTask (the lambda using
AzCacheProviderLoader::instance().read and the push_task(std::move(task)) call),
change the condition to call
AzCacheProviderLoader::instance().is_cache_enabled() so the code uses the new
consistent API surface.
docs/src/usage.md (1)

239-242: Add language specifier to fenced code block.

The code block showing expected log output should have a language specifier for consistency and to satisfy linting.

📝 Proposed fix
 You should see:
-```
+```text
 AzCacheProvider: loading cache library: /path/to/your_cache_provider.so
 AzCacheProvider: cache provider loaded successfully from /path/to/your_cache_provider.so
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/src/usage.md around lines 239 - 242, The fenced code block showing
AzCacheProvider log output lacks a language specifier; update the markdown in
docs/src/usage.md to add a language tag (e.g., text or console) to the
existing block that contains the AzCacheProvider lines so linting and syntax
highlighting work correctly while preserving the exact log content.


</details>

</blockquote></details>
<details>
<summary>cpp/azure/azcache_provider/azcache_provider_test.cc (1)</summary><blockquote>

`62-78`: **Handle leak: `dlopen` result is not closed.**

Each call to `load_cache_fn()` opens a new handle that is never closed. While the OS reclaims on test exit and the same `.so` will likely return the same handle, it's cleaner to either store the handle in the fixture and close it in `TearDown`, or use a class member.


<details>
<summary>♻️ Store handle in fixture for proper cleanup</summary>

```diff
 class SimpleFileCacheTest : public ::testing::Test
 {
 protected:
     fs::path cache_dir_;
     fs::path so_path_;
+    void* so_handle_ = nullptr;

     void SetUp() override
     {
         // ... existing code ...
     }

     void TearDown() override
     {
+        if (so_handle_)
+        {
+            dlclose(so_handle_);
+            so_handle_ = nullptr;
+        }
         fs::remove_all(cache_dir_);
         unsetenv("RUNAI_CACHE_DIR");
     }

     az_cache_read_fn load_cache_fn()
     {
-        void* handle = dlopen(so_path_.c_str(), RTLD_NOW | RTLD_LOCAL);
-        if (!handle)
+        if (!so_handle_)
+        {
+            so_handle_ = dlopen(so_path_.c_str(), RTLD_NOW | RTLD_LOCAL);
+        }
+        if (!so_handle_)
         {
             ADD_FAILURE() << "dlopen failed: " << dlerror();
             return nullptr;
         }
         auto fn = reinterpret_cast<az_cache_read_fn>(
-            dlsym(handle, AZ_CACHE_READ_SYMBOL));
+            dlsym(so_handle_, AZ_CACHE_READ_SYMBOL));
         // ...
     }
 };
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@cpp/azure/azcache_provider/azcache_provider_test.cc` around lines 62 - 78,
load_cache_fn opens the shared library with dlopen but never closes the handle,
leaking it; modify the test fixture to store the void* handle as a member (e.g.,
add a member like lib_handle_), have load_cache_fn assign the dlopen result to
that member instead of a local variable, and in the fixture TearDown (or the
fixture destructor) call dlclose(lib_handle_) if non-null and reset it; keep
using so_path_ and AZ_CACHE_READ_SYMBOL for dlsym and ensure load_cache_fn still
returns the az_cache_read_fn pointer.
```

</details>

</blockquote></details>
<details>
<summary>cpp/azure/azcache_provider/azcache_provider_loader.cc (1)</summary><blockquote>

`79-100`: **Consider defensive cleanup of `error_string` on success path.**

A misbehaving cache provider might allocate `error_string` even on success. While the contract says it should only be set on failure, adding a defensive `free` on the success path would prevent a memory leak from non-conforming providers.


<details>
<summary>🛡️ Optional defensive fix</summary>

```diff
     LOG(SPAM) << "AzCacheProvider: cache read " << length << " bytes from "
               << container << "/" << blob << " offset=" << offset;
+    if (error_string)
+    {
+        free(error_string);  // Defensive: provider should not set on success
+    }
     return true;
 }
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@cpp/azure/azcache_provider/azcache_provider_loader.cc` around lines 79 - 100,
The success path currently ignores a non-null error_string which could be
allocated by a misbehaving provider; in the function containing the bytes_read
check (variables: bytes_read, length, container, blob, offset, and error_string)
add a defensive cleanup: before logging the success (LOG(SPAM) ...) and
returning true, check if error_string is non-null and free it to avoid leaking
memory from non-conforming providers (keeping the existing free(error_string) in
the failure/error branch).
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @docs/src/env-vars.md:

  • Line 123: Update the broken intra-document anchor link on the line containing
    "See Azure Blob Cache Provider
    (Experimental)
    ..." to point to the
    correct document by using a relative cross-document link (e.g.,
    "usage.md#azure-blob-cache-provider-experimental"); modify the link text in
    docs/src/env-vars.md so it references the usage.md anchor instead of a local
    anchor.

In @tests/azure/test_azurecache.py:

  • Around line 74-79: The test invokes subprocess.run to compile
    simple_file_cache_test.cc using "gcc"; change the compiler invocation in the
    subprocess.run call inside tests/azure/test_azurecache.py to use "g++" instead
    of "gcc" so the .cc C++ source is compiled correctly (update the argument list
    where ["gcc", "-shared", "-fPIC", "-o", so_path, src] appears and keep the
    existing capture_output/text/error handling intact).
  • Around line 182-185: The test test_missing_cache_lib_falls_back currently only
    unsets RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_LIB but does not assert fallback
    behavior; update it to either (A) actually exercise the streaming path and
    assert the cache provider is not used (e.g., start a subprocess or invoke the
    streaming function and check that the Azure cache provider is disabled/not
    initialized or that reads come from Azure instead), referencing the same test
    name test_missing_cache_lib_falls_back and any provider init/check function used
    in tests (e.g., the cache provider initialization method or is_enabled flag) to
    verify provider disabled, or (B) mark the test as a placeholder by adding a
    skip/ TODO so it’s clear it’s incomplete; ensure the chosen approach asserts the
    expected fallback to Azure when RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_LIB is
    unset.

Nitpick comments:
In @cpp/azure/azcache_provider/azcache_provider_loader.cc:

  • Around line 79-100: The success path currently ignores a non-null error_string
    which could be allocated by a misbehaving provider; in the function containing
    the bytes_read check (variables: bytes_read, length, container, blob, offset,
    and error_string) add a defensive cleanup: before logging the success (LOG(SPAM)
    ...) and returning true, check if error_string is non-null and free it to avoid
    leaking memory from non-conforming providers (keeping the existing
    free(error_string) in the failure/error branch).

In @cpp/azure/azcache_provider/azcache_provider_test.cc:

  • Around line 62-78: load_cache_fn opens the shared library with dlopen but
    never closes the handle, leaking it; modify the test fixture to store the void*
    handle as a member (e.g., add a member like lib_handle_), have load_cache_fn
    assign the dlopen result to that member instead of a local variable, and in the
    fixture TearDown (or the fixture destructor) call dlclose(lib_handle_) if
    non-null and reset it; keep using so_path_ and AZ_CACHE_READ_SYMBOL for dlsym
    and ensure load_cache_fn still returns the az_cache_read_fn pointer.

In @cpp/azure/client/async_azure_client/async_azure_client.h:

  • Around line 89-103: Replace the direct call to
    AzCacheProviderLoader::instance().is_enabled() with the wrapper method
    is_cache_enabled() for consistency: in the block that creates the
    DownloadBlobTask (the lambda using AzCacheProviderLoader::instance().read and
    the push_task(std::move(task)) call), change the condition to call
    AzCacheProviderLoader::instance().is_cache_enabled() so the code uses the new
    consistent API surface.

In @docs/src/usage.md:

  • Around line 239-242: The fenced code block showing AzCacheProvider log output
    lacks a language specifier; update the markdown in docs/src/usage.md to add a
    language tag (e.g., text or console) to the existing block that contains
    the AzCacheProvider lines so linting and syntax highlighting work correctly
    while preserving the exact log content.

</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended)
- [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: CHILL

**Plan**: Pro Plus

**Run ID**: `73061442-98a8-43ca-af7a-42ed2f0e9003`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 2e26a8cf78012f2c5fb10c9f4c66c0d2114612d4 and aae30930cf428b3da0e7d1be97a3ac8aae5a05e9.

</details>

<details>
<summary>📒 Files selected for processing (12)</summary>

* `cpp/azure/azcache_provider/BUILD`
* `cpp/azure/azcache_provider/azcache_provider_loader.cc`
* `cpp/azure/azcache_provider/azcache_provider_loader.h`
* `cpp/azure/azcache_provider/azcache_provider_test.cc`
* `cpp/azure/azcache_provider/runai_azcache_provider.h`
* `cpp/azure/azcache_provider/simple_file_cache_test.cc`
* `cpp/azure/client/async_azure_client/BUILD`
* `cpp/azure/client/async_azure_client/async_azure_client.h`
* `cpp/azure/client/client.cc`
* `docs/src/env-vars.md`
* `docs/src/usage.md`
* `tests/azure/test_azurecache.py`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment thread docs/src/env-vars.md Outdated
Comment thread tests/azure/test_azurecache.py
Comment thread tests/azure/test_azurecache.py
Copy link
Copy Markdown
Contributor Author

@hasethuraman hasethuraman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed all open review comments in commit 7367e06:

  1. Path traversal: absolute paths (greptile) — Added if (s[0] == '/') return 1; at the top of has_path_traversal() to reject absolute paths. Good catch.

  2. Broken cross-document link (CodeRabbit) — Fixed #azure-blob-cache-provider-experimentalusage.md#azure-blob-cache-provider-experimental.

  3. gcc → g++ (CodeRabbit) — Changed compiler to g++ for the .cc source file in test fixture.

  4. Incomplete fallback test (CodeRabbit) — Added TODO comment explaining that full fallback verification requires a subprocess with Azure credentials due to the singleton design.

Additionally: cached is_cache_enabled() as a const bool _cache_enabled member in AsyncAzureClient, initialized once at construction, to avoid repeated singleton access on every blob range read.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
cpp/azure/client/async_azure_client/async_azure_client.h (1)

86-99: Consider including container/blob in the error message for easier debugging.

When the cache provider read fails, the error message "Cache provider read failed" lacks context about which blob failed. The AzCacheProviderLoader::read() already logs details, but including identifiers in the thrown exception would help correlate errors.

💡 Suggested improvement
             DownloadBlobTask task{
                 [container_name, blob_name, buffer, offset, length]() {
                     bool ok = AzCacheProviderLoader::instance().read(
                         container_name, blob_name, buffer, offset, length);
                     if (!ok)
                     {
-                        throw std::runtime_error("Cache provider read failed");
+                        throw std::runtime_error(
+                            "Cache provider read failed for " + container_name + "/" + blob_name);
                     }
                 },
                 std::move(callback)
             };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/azure/client/async_azure_client/async_azure_client.h` around lines 86 -
99, The thrown exception inside the _cache_enabled branch of DownloadBlobTask
should include the container/blob identifiers to aid debugging: update the
lambda passed to DownloadBlobTask (the one calling
AzCacheProviderLoader::instance().read with container_name and blob_name) so
that when read() returns false it throws a std::runtime_error containing
container_name and blob_name (and optionally offset/length) instead of the
generic "Cache provider read failed", then push_task as before; reference the
DownloadBlobTask lambda, AzCacheProviderLoader::instance().read, container_name
and blob_name.
tests/azure/test_azurecache.py (1)

21-22: Unused import: time.

The time module is imported but never used in this file.

🧹 Proposed fix
 import tempfile
-import time
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/azure/test_azurecache.py` around lines 21 - 22, Remove the unused
import of the time module from tests/azure/test_azurecache.py by deleting the
"import time" statement so there are no unused imports; locate the top-level
import line that references time and remove it (no other changes to functions or
tests are required).
docs/src/usage.md (1)

238-243: Add language identifier to fenced code block.

The code block showing example log output should specify a language (e.g., text or plaintext) for consistency and to satisfy linting rules.

📝 Proposed fix
 You should see:
-```
+```text
 AzCacheProvider: loading cache library: /path/to/your_cache_provider.so
 AzCacheProvider: cache provider loaded successfully from /path/to/your_cache_provider.so
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/src/usage.md around lines 238 - 243, The fenced code block showing
example log output lacks a language identifier; update the block surrounding the
two AzCacheProvider lines to include a language tag (e.g., use ```text or

consistent—locate the example block containing "AzCacheProvider: loading cache
library..." and "AzCacheProvider: cache provider loaded successfully..." and add
the language identifier after the opening backticks.
cpp/azure/azcache_provider/azcache_provider_test.cc (1)

63-78: Minor: dlopen handle is never closed.

The load_cache_fn() helper opens the shared library but never stores or closes the handle. For tests this is acceptable since the process exits, but it's technically a resource leak. Consider storing the handle in the fixture and closing in TearDown() if this pattern is reused.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/azure/azcache_provider/azcache_provider_test.cc` around lines 63 - 78,
load_cache_fn opens the shared object with dlopen but never closes the handle,
leaking a resource; modify the test fixture to store the handle (e.g., add a
void* so_handle_ member initialized to nullptr), set so_handle_ =
dlopen(so_path_.c_str(), ...) inside load_cache_fn (closing any previous handle
if non-null), and return the dlsym result as before; then implement TearDown()
to check if so_handle_ is non-null and call dlclose(so_handle_) and set it to
nullptr, handling error cases from dlopen/dlsym as already done by
ADD_FAILURE().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/azure/azcache_provider/azcache_provider_loader.h`:
- Around line 3-5: Remove the unused `#include` <mutex> from
azcache_provider_loader.h: the header and its implementation use std::atomic
(e.g., the singleton flag/atomic<bool> usage) and do not reference std::mutex or
any mutex types, so delete the <mutex> include to avoid an unused include and
keep the file minimal.

---

Nitpick comments:
In `@cpp/azure/azcache_provider/azcache_provider_test.cc`:
- Around line 63-78: load_cache_fn opens the shared object with dlopen but never
closes the handle, leaking a resource; modify the test fixture to store the
handle (e.g., add a void* so_handle_ member initialized to nullptr), set
so_handle_ = dlopen(so_path_.c_str(), ...) inside load_cache_fn (closing any
previous handle if non-null), and return the dlsym result as before; then
implement TearDown() to check if so_handle_ is non-null and call
dlclose(so_handle_) and set it to nullptr, handling error cases from
dlopen/dlsym as already done by ADD_FAILURE().

In `@cpp/azure/client/async_azure_client/async_azure_client.h`:
- Around line 86-99: The thrown exception inside the _cache_enabled branch of
DownloadBlobTask should include the container/blob identifiers to aid debugging:
update the lambda passed to DownloadBlobTask (the one calling
AzCacheProviderLoader::instance().read with container_name and blob_name) so
that when read() returns false it throws a std::runtime_error containing
container_name and blob_name (and optionally offset/length) instead of the
generic "Cache provider read failed", then push_task as before; reference the
DownloadBlobTask lambda, AzCacheProviderLoader::instance().read, container_name
and blob_name.

In `@docs/src/usage.md`:
- Around line 238-243: The fenced code block showing example log output lacks a
language identifier; update the block surrounding the two AzCacheProvider lines
to include a language tag (e.g., use ```text or ```plaintext) so the markdown
linter accepts it and syntax highlighting is consistent—locate the example block
containing "AzCacheProvider: loading cache library..." and "AzCacheProvider:
cache provider loaded successfully..." and add the language identifier after the
opening backticks.

In `@tests/azure/test_azurecache.py`:
- Around line 21-22: Remove the unused import of the time module from
tests/azure/test_azurecache.py by deleting the "import time" statement so there
are no unused imports; locate the top-level import line that references time and
remove it (no other changes to functions or tests are required).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: bbf12699-0b23-42a7-8bb6-9b5b11010e8d

📥 Commits

Reviewing files that changed from the base of the PR and between aae3093 and 7367e06.

📒 Files selected for processing (13)
  • cpp/azure/azcache_provider/BUILD
  • cpp/azure/azcache_provider/azcache_provider_loader.cc
  • cpp/azure/azcache_provider/azcache_provider_loader.h
  • cpp/azure/azcache_provider/azcache_provider_test.cc
  • cpp/azure/azcache_provider/runai_azcache_provider.h
  • cpp/azure/azcache_provider/simple_file_cache_test.cc
  • cpp/azure/client/async_azure_client/BUILD
  • cpp/azure/client/async_azure_client/async_azure_client.cc
  • cpp/azure/client/async_azure_client/async_azure_client.h
  • cpp/azure/client/client.cc
  • docs/src/env-vars.md
  • docs/src/usage.md
  • tests/azure/test_azurecache.py
✅ Files skipped from review due to trivial changes (2)
  • cpp/azure/client/async_azure_client/BUILD
  • docs/src/env-vars.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/azure/client/client.cc

Comment thread cpp/azure/azcache_provider/azcache_provider_loader.h Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
cpp/azure/azcache_provider/azcache_provider_test.cc (1)

62-78: Consider closing the library handle after tests or reusing a single handle.

load_cache_fn() calls dlopen but never dlclose, leaking the handle. While this is acceptable in tests (the OS cleans up on process exit), you could either:

  1. Store the handle in the fixture and call dlclose in TearDown
  2. Reuse a single handle across test cases
♻️ Optional improvement to track and close the handle
 class SimpleFileCacheTest : public ::testing::Test
 {
 protected:
     fs::path cache_dir_;
     fs::path so_path_;
+    void* lib_handle_ = nullptr;

     void SetUp() override
     { ... }

     void TearDown() override
     {
         fs::remove_all(cache_dir_);
         unsetenv("RUNAI_CACHE_DIR");
+        if (lib_handle_) {
+            dlclose(lib_handle_);
+            lib_handle_ = nullptr;
+        }
     }

     az_cache_read_fn load_cache_fn()
     {
-        void* handle = dlopen(so_path_.c_str(), RTLD_NOW | RTLD_LOCAL);
-        if (!handle)
+        if (!lib_handle_) {
+            lib_handle_ = dlopen(so_path_.c_str(), RTLD_NOW | RTLD_LOCAL);
+        }
+        if (!lib_handle_)
         {
             ADD_FAILURE() << "dlopen failed: " << dlerror();
             return nullptr;
         }
         auto fn = reinterpret_cast<az_cache_read_fn>(
-            dlsym(handle, AZ_CACHE_READ_SYMBOL));
+            dlsym(lib_handle_, AZ_CACHE_READ_SYMBOL));
         ...
     }
 };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/azure/azcache_provider/azcache_provider_test.cc` around lines 62 - 78,
The test currently leaks the dlopen handle in load_cache_fn() by never calling
dlclose; modify the test fixture to store the void* handle (returned by dlopen
using so_path_) as a member (e.g., fixture field) and change load_cache_fn() to
reuse that handle (dlopen once) or to set the handle member when opening; then
implement TearDown() in the test fixture to call dlclose(handle) if handle is
non-null, and update load_cache_fn() to not return early without setting or
reusing the stored handle (referencing load_cache_fn, so_path_, dlopen, dlclose,
and the fixture TearDown).
tests/azure/test_azurecache.py (1)

21-21: Remove unused time import.

The time module is imported but not used in this test file.

🧹 Proposed fix
 import subprocess
 import sys
 import tempfile
-import time

 import numpy as np
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/azure/test_azurecache.py` at line 21, Remove the unused import
statement "import time" from the test file; locate the top-level import line
"import time" in tests/azure/test_azurecache.py and delete it so there are no
unused imports left in the module.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/azure/client/async_azure_client/async_azure_client.h`:
- Around line 86-100: The code throws a generic std::runtime_error("Cache
provider read failed") inside the DownloadBlobTask when
AzCacheProviderLoader::instance().read(...) returns false, losing provider error
details; update the failure path in the _cache_enabled branch (DownloadBlobTask
lambda) to capture and include the provider's error information (e.g., call a
loader method like AzCacheProviderLoader::instance().last_error() or equivalent)
or call the loader's logging before throwing, and construct the thrown exception
message to include that detailed text so push_task receives an exception with
the provider error included.

In `@docs/src/usage.md`:
- Around line 238-243: The fenced code block showing the example log output (the
block containing the two "AzCacheProvider: ..." lines) is missing a language
specifier; update that triple-backtick block to include a language like text or
console (e.g., ```text) so linting and syntax highlighting work correctly and
the example log output is consistent.

---

Nitpick comments:
In `@cpp/azure/azcache_provider/azcache_provider_test.cc`:
- Around line 62-78: The test currently leaks the dlopen handle in
load_cache_fn() by never calling dlclose; modify the test fixture to store the
void* handle (returned by dlopen using so_path_) as a member (e.g., fixture
field) and change load_cache_fn() to reuse that handle (dlopen once) or to set
the handle member when opening; then implement TearDown() in the test fixture to
call dlclose(handle) if handle is non-null, and update load_cache_fn() to not
return early without setting or reusing the stored handle (referencing
load_cache_fn, so_path_, dlopen, dlclose, and the fixture TearDown).

In `@tests/azure/test_azurecache.py`:
- Line 21: Remove the unused import statement "import time" from the test file;
locate the top-level import line "import time" in tests/azure/test_azurecache.py
and delete it so there are no unused imports left in the module.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 1b4dbc49-9dc5-403c-b5e1-7905f51c3fd3

📥 Commits

Reviewing files that changed from the base of the PR and between 7367e06 and 69ced42.

📒 Files selected for processing (13)
  • cpp/azure/azcache_provider/BUILD
  • cpp/azure/azcache_provider/azcache_provider_loader.cc
  • cpp/azure/azcache_provider/azcache_provider_loader.h
  • cpp/azure/azcache_provider/azcache_provider_test.cc
  • cpp/azure/azcache_provider/runai_azcache_provider.h
  • cpp/azure/azcache_provider/simple_file_cache_test.cc
  • cpp/azure/client/async_azure_client/BUILD
  • cpp/azure/client/async_azure_client/async_azure_client.cc
  • cpp/azure/client/async_azure_client/async_azure_client.h
  • cpp/azure/client/client.cc
  • docs/src/env-vars.md
  • docs/src/usage.md
  • tests/azure/test_azurecache.py
✅ Files skipped from review due to trivial changes (2)
  • cpp/azure/client/async_azure_client/BUILD
  • docs/src/env-vars.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • cpp/azure/client/async_azure_client/async_azure_client.cc
  • cpp/azure/client/client.cc
  • cpp/azure/azcache_provider/azcache_provider_loader.cc
  • cpp/azure/azcache_provider/BUILD

Comment thread cpp/azure/client/async_azure_client/async_azure_client.h
Comment thread docs/src/usage.md
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
cpp/azure/azcache_provider/azcache_provider_test.cc (1)

62-78: Minor: dlopen handle is not closed.

Each call to load_cache_fn() opens the shared library but never calls dlclose(). While this is acceptable in test code (process termination cleans up), storing and closing the handle would be cleaner. Since dlopen with the same path returns the same handle (reference counted), this doesn't cause multiple loads, just a minor resource tracking issue.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/azure/azcache_provider/azcache_provider_test.cc` around lines 62 - 78,
The test helper load_cache_fn opens the shared object with dlopen and never
calls dlclose, leaking the handle; modify load_cache_fn to store the dlopen
result in a local variable (handle already present), and after obtaining the
symbol via dlsym call dlclose(handle) before returning (or keep and manage the
handle lifecycle in a static/field if you need the handle alive); ensure you
still return the function pointer (az_cache_read_fn) and report errors via
ADD_FAILURE() when dlopen or dlsym fail (use the existing so_path_, handle,
dlopen, dlsym, dlclose, and AZ_CACHE_READ_SYMBOL identifiers to locate and
update the code).
tests/azure/test_azurecache.py (2)

21-21: Unused import.

time is imported but never used in the file.

🧹 Proposed fix
 import subprocess
 import sys
 import tempfile
-import time
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/azure/test_azurecache.py` at line 21, Remove the unused import "time"
from the import section of tests/azure/test_azurecache.py; locate the line that
reads "import time" and delete it so there are no unused imports remaining (or
replace it with a needed import only if intended to be used by a test in
functions/classes in this file).

145-159: Consider removing redundant env var settings in inline script.

Lines 147-148 set RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_LIB and RUNAI_CACHE_DIR in the script, but these are already passed via env=env at line 167. The subprocess inherits the environment dict, making these assignments redundant.

Not a bug, but could simplify the script slightly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/azure/test_azurecache.py` around lines 145 - 159, The inline Python
script assigned to the variable script sets
RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_LIB and RUNAI_CACHE_DIR via os.environ
but those are already provided through the env dictionary passed to the
subprocess (env), so remove the redundant os.environ assignments from the script
string; keep the imports, uri, SafetensorsStreamer usage and assertions intact,
and rely on the existing env dict used when launching the subprocess to supply
those variables.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cpp/azure/azcache_provider/azcache_provider_test.cc`:
- Around line 62-78: The test helper load_cache_fn opens the shared object with
dlopen and never calls dlclose, leaking the handle; modify load_cache_fn to
store the dlopen result in a local variable (handle already present), and after
obtaining the symbol via dlsym call dlclose(handle) before returning (or keep
and manage the handle lifecycle in a static/field if you need the handle alive);
ensure you still return the function pointer (az_cache_read_fn) and report
errors via ADD_FAILURE() when dlopen or dlsym fail (use the existing so_path_,
handle, dlopen, dlsym, dlclose, and AZ_CACHE_READ_SYMBOL identifiers to locate
and update the code).

In `@tests/azure/test_azurecache.py`:
- Line 21: Remove the unused import "time" from the import section of
tests/azure/test_azurecache.py; locate the line that reads "import time" and
delete it so there are no unused imports remaining (or replace it with a needed
import only if intended to be used by a test in functions/classes in this file).
- Around line 145-159: The inline Python script assigned to the variable script
sets RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_LIB and RUNAI_CACHE_DIR via
os.environ but those are already provided through the env dictionary passed to
the subprocess (env), so remove the redundant os.environ assignments from the
script string; keep the imports, uri, SafetensorsStreamer usage and assertions
intact, and rely on the existing env dict used when launching the subprocess to
supply those variables.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: b6e9f90f-9cd6-4ab8-aeae-ada341e2b9b5

📥 Commits

Reviewing files that changed from the base of the PR and between 69ced42 and 877d721.

📒 Files selected for processing (13)
  • cpp/azure/azcache_provider/BUILD
  • cpp/azure/azcache_provider/azcache_provider_loader.cc
  • cpp/azure/azcache_provider/azcache_provider_loader.h
  • cpp/azure/azcache_provider/azcache_provider_test.cc
  • cpp/azure/azcache_provider/runai_azcache_provider.h
  • cpp/azure/azcache_provider/simple_file_cache_test.cc
  • cpp/azure/client/async_azure_client/BUILD
  • cpp/azure/client/async_azure_client/async_azure_client.cc
  • cpp/azure/client/async_azure_client/async_azure_client.h
  • cpp/azure/client/client.cc
  • docs/src/env-vars.md
  • docs/src/usage.md
  • tests/azure/test_azurecache.py
✅ Files skipped from review due to trivial changes (4)
  • cpp/azure/client/async_azure_client/BUILD
  • docs/src/env-vars.md
  • docs/src/usage.md
  • cpp/azure/azcache_provider/BUILD
🚧 Files skipped from review as they are similar to previous changes (3)
  • cpp/azure/client/async_azure_client/async_azure_client.cc
  • cpp/azure/azcache_provider/azcache_provider_loader.cc
  • cpp/azure/client/async_azure_client/async_azure_client.h

@hasethuraman
Copy link
Copy Markdown
Contributor Author

@noa-neria I am also getting an internal review FYI. Will update once it is done

…mental)

- Define blob_read C API header for custom cache providers
- Dynamic loader with dlopen/dlsym for cache provider .so
- Auto-discover cache library in Python site-packages via dladdr
- RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_ENABLED as a disable switch
- Example simple_file_cache_test implementation for reference/testing
- Unit tests and integration test infrastructure
- Documentation: env-vars.md and usage.md updated

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@hasethuraman hasethuraman changed the title feat: pluggable azure blob cache provider interface (experimental) feat: pluggable azure blob cache provider with auto-discovery (experimental) Apr 29, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
tests/azure/test_azurecache.py (1)

27-29: Path resolution may fail in some test environments.

The relative path construction using os.path.dirname(__file__) with multiple .. segments assumes a specific directory structure. If tests are run from an installed package or a different working directory, this path may not resolve correctly.

The existing pytest.skip() in the fixture handles this gracefully, but consider adding a comment explaining the expected directory structure.

📝 Suggested documentation improvement
+# Path to the C++ test provider source file.
+# This assumes the test is run from the repository root with the standard layout:
+#   tests/azure/test_azurecache.py
+#   cpp/azure/azcache_provider/simple_file_cache_test.cc
 SIMPLE_FILE_CACHE_SRC = os.path.join(
     os.path.dirname(__file__), "..", "..", "cpp", "azure", "azcache_provider", "simple_file_cache_test.cc"
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/azure/test_azurecache.py` around lines 27 - 29, SIMPLE_FILE_CACHE_SRC
path resolution can fail in installs or different CWDs; add a concise comment
above the SIMPLE_FILE_CACHE_SRC definition explaining the expected repository
layout (that tests assume the repo source tree with
cpp/azure/azcache_provider/simple_file_cache_test.cc located two levels up) and
note that the test fixture uses pytest.skip() when the file isn't found—this
documents why the relative path is used and when the skip will occur so future
readers understand the assumption and fallback.
cpp/azure/azcache_provider/azcache_provider_test.cc (1)

62-78: Minor: dlopen handle is never closed.

The load_cache_fn() helper calls dlopen() but never stores the handle for later dlclose(). While this is acceptable in tests (process exit cleans up), it could cause issues if tests are run repeatedly or if the test framework reuses processes.

Consider storing the handle in the fixture and calling dlclose() in TearDown().

♻️ Suggested improvement
 class SimpleFileCacheTest : public ::testing::Test
 {
 protected:
     fs::path cache_dir_;
     fs::path so_path_;
+    void* lib_handle_ = nullptr;

     void SetUp() override
     {
         // ... existing code ...
     }

     void TearDown() override
     {
+        if (lib_handle_) {
+            dlclose(lib_handle_);
+            lib_handle_ = nullptr;
+        }
         fs::remove_all(cache_dir_);
         unsetenv("RUNAI_CACHE_DIR");
     }

     blob_read_fn load_cache_fn()
     {
-        void* handle = dlopen(so_path_.c_str(), RTLD_NOW | RTLD_LOCAL);
-        if (!handle)
+        if (!lib_handle_) {
+            lib_handle_ = dlopen(so_path_.c_str(), RTLD_NOW | RTLD_LOCAL);
+        }
+        if (!lib_handle_)
         {
             ADD_FAILURE() << "dlopen failed: " << dlerror();
             return nullptr;
         }
         auto fn = reinterpret_cast<blob_read_fn>(
-            dlsym(handle, BLOB_READ_SYMBOL));
+            dlsym(lib_handle_, BLOB_READ_SYMBOL));
         if (!fn)
         {
             ADD_FAILURE() << "dlsym failed: " << dlerror();
         }
         return fn;
     }
 };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/azure/azcache_provider/azcache_provider_test.cc` around lines 62 - 78,
load_cache_fn() opens a shared object with dlopen but never saves the returned
handle for dlclose; modify the test fixture to store the void* handle (returned
by dlopen using so_path_ and BLOB_READ_SYMBOL lookup) on the fixture (e.g. a
member like dl_handle_), return the blob_read_fn as before, and implement
TearDown() to call dlclose(dl_handle_) if non-null and clear it to ensure the
handle is closed between tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/azure/azcache_provider/azcache_provider_loader.cc`:
- Around line 94-101: The env-var handling currently treats an explicitly-set
empty RUNAI_AZURE_CACHE_LIB_ENV value as valid, which lets dlopen("") bind the
main program; update the logic in the azcache provider loader so that when
utils::try_getenv<std::string>(RUNAI_AZURE_CACHE_LIB_ENV, lib_path) returns true
you also treat lib_path.empty() as a disabled/invalid value (same behavior as
is_disabled_value), log that it was explicitly disabled, and return early before
the dlopen call; apply the same empty-string check where the code later handles
RUNAI_AZURE_CACHE_LIB_ENV/loader path prior to calling dlopen to avoid binding
the main program handle.
- Around line 61-65: The call to ::tolower in is_disabled_value risks undefined
behavior when char is signed; modify the std::transform lambda to cast each
character to unsigned char before calling ::tolower (i.e.,
transform(lower.begin(), lower.end(), lower.begin(), [](unsigned char c){ return
std::tolower(static_cast<unsigned char>(c)); }) or equivalent) so that the
characters in string lower are safely converted to lower-case; update the
transform usage in is_disabled_value accordingly.

---

Nitpick comments:
In `@cpp/azure/azcache_provider/azcache_provider_test.cc`:
- Around line 62-78: load_cache_fn() opens a shared object with dlopen but never
saves the returned handle for dlclose; modify the test fixture to store the
void* handle (returned by dlopen using so_path_ and BLOB_READ_SYMBOL lookup) on
the fixture (e.g. a member like dl_handle_), return the blob_read_fn as before,
and implement TearDown() to call dlclose(dl_handle_) if non-null and clear it to
ensure the handle is closed between tests.

In `@tests/azure/test_azurecache.py`:
- Around line 27-29: SIMPLE_FILE_CACHE_SRC path resolution can fail in installs
or different CWDs; add a concise comment above the SIMPLE_FILE_CACHE_SRC
definition explaining the expected repository layout (that tests assume the repo
source tree with cpp/azure/azcache_provider/simple_file_cache_test.cc located
two levels up) and note that the test fixture uses pytest.skip() when the file
isn't found—this documents why the relative path is used and when the skip will
occur so future readers understand the assumption and fallback.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 3ff28027-e486-4385-bf2e-608adaa0c533

📥 Commits

Reviewing files that changed from the base of the PR and between 877d721 and ccbe45e.

📒 Files selected for processing (13)
  • cpp/azure/azcache_provider/BUILD
  • cpp/azure/azcache_provider/azcache_provider_loader.cc
  • cpp/azure/azcache_provider/azcache_provider_loader.h
  • cpp/azure/azcache_provider/azcache_provider_test.cc
  • cpp/azure/azcache_provider/runai_azcache_provider.h
  • cpp/azure/azcache_provider/simple_file_cache_test.cc
  • cpp/azure/client/async_azure_client/BUILD
  • cpp/azure/client/async_azure_client/async_azure_client.cc
  • cpp/azure/client/async_azure_client/async_azure_client.h
  • cpp/azure/client/client.cc
  • docs/src/env-vars.md
  • docs/src/usage.md
  • tests/azure/test_azurecache.py
✅ Files skipped from review due to trivial changes (4)
  • cpp/azure/client/async_azure_client/async_azure_client.cc
  • cpp/azure/client/async_azure_client/BUILD
  • cpp/azure/azcache_provider/BUILD
  • docs/src/usage.md

Comment thread cpp/azure/azcache_provider/azcache_provider_loader.cc Outdated
Comment thread cpp/azure/azcache_provider/azcache_provider_loader.cc
Comment thread cpp/azure/azcache_provider/azcache_provider_loader.cc Outdated
Comment thread cpp/azure/azcache_provider/azcache_provider_loader.cc Outdated
Comment thread cpp/azure/azcache_provider/azcache_provider_loader.cc Outdated
Comment thread cpp/azure/azcache_provider/azcache_provider_loader.cc Outdated
Comment thread cpp/azure/azcache_provider/azcache_provider_loader.cc Outdated
Copy link
Copy Markdown
Contributor

@kyleknap kyleknap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leaving some comments/questions that I had as I work through the PR

}

// libstreamerazure.so → lib/ → libstreamer/ → runai_model_streamer/ → site-packages/
auto site_packages = azure_so.parent_path().parent_path().parent_path().parent_path();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of relying on where the current location of libstreamerazure.so, I wonder if it is possible to leverage Python itself (e.g., using sys.path or site.getsitepackages()) to get the full list of locations Python will look for packages? Maybe if we are able to call one of these functions as part of the bindings to model streamer interface we could inject the value down into configuration to be read as part of this logic?

Mainly with the logic as is, it will probably work for most cases (e.g., a customer installing the package from PyPI) but not sure if it will hold for cases like editable installs or environments where the site packages is searched in multiple different locations (e.g., setting PYTHONPATH)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Common python layer calls the libstreamerazure.so & so we dont have a chance to call azure specific. And we wanted zero touch without user doing any explicit import. Since libstreamerazure.so is expected to be in site-packages, we will support the tachyon installed in same site. WDYT? We can enhance c++ code to look into PYTHONPATH later?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving it for now & track later.

Copy link
Copy Markdown
Collaborator

@noa-neria noa-neria May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this should be moved to the python layer.

We can add a new hook prepare_runtime similar to the existing scheme in safetensors streamer:

runai_model_streamer/s3_utils/s3_utils.py is the existing scheme-dispatch layer:

  • Scheme detectors: is_s3_path, is_gs_path, is_azure_path.
  • Lazy importers: `get_s3_files_module(), get_gcs_files_module(), get_azure_files_module() — each returns None if the backend package isn't installed.
  • Thin per-scheme wrappers: s3_pull_files, azure_glob, etc. — they fetch the backend module and delegate.

And safetensors_streamer.py:36-45, 56-62 is the dispatch site:

if is_s3_path(path):       files = s3_glob(...)
elif is_gs_path(path):     files = gcs_glob(...)
elif is_azure_path(path):  files = azure_glob(...)

Each cloud's logic lives in its own package — runai_model_streamer_azure is the natural place for "before we touch Azure, prepare any Azure runtime state" — including resolving the cache .so.

The resolver code in azure implementation for prepare_runtime can set the environment variable of the package name (unless alreday set)

@hasethuraman @kyleknap If that is ok with you, I can create a quick PR for prepare_runtime which you can use after rebasing

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noa-neria I like this idea of having a hook to do any prep work prior to going into C++ runtime. I think it will be generally useful in the future.

In terms of passing state from the prepare_runtime() hook, will it only be environment variables we'll be able to leverage/set? Or will it be able to set any in-code configuration that will cross the Python layer to the C++ layer (e.g., a config, context object, or even a configuration dictionary)? Main concern is if it just environment variables we can set, it might make sense to have some sort of teardown hook as well to make sure we reset environment variables back if this is only being called when instantiating the SafetensorsStreamer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we can take this to a follow up PR. As the changes are increasing, can we merge this so that we get soaking time for what we have manwhile.

Copy link
Copy Markdown
Contributor

@kyleknap kyleknap May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hasethuraman I'm good with deferring this improvement to a fast follow up PR. The RUNAI_STREAMER_EXPERIMENTAL_AZURE_CACHE_LIB env var can be leveraged in the scenarios that I was concerned won't work.

Comment thread cpp/azure/azcache_provider/azcache_provider_loader.cc Outdated
Comment thread cpp/azure/azcache_provider/azcache_provider_loader.cc Outdated
Comment thread tests/azure/test_azurecache.py
Comment thread tests/azure/test_azurecache.py Outdated
Comment thread docs/src/usage.md
…o mode, non-singleton loader

- Caller-owned error buffer (stack-allocated, no cross-.so malloc/free)
- CacheMode enum: 0=disabled, 1=required, auto=default (matches RUNAI_STREAMER_DIST)
- Non-singleton: from_env() factory, shared_ptr ownership per client
- Removed disable-via-lib-path; _ENABLED is sole on/off control
- Added ec.message() to debug logs for canonical path and auto-discovery
- Fixed tolower UB (unsigned char cast)
- Updated usage.md blob_read signature to match new API

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@hasethuraman hasethuraman force-pushed the hari/cache-support branch from 3d1b185 to 49ba365 Compare May 5, 2026 14:14
@hasethuraman
Copy link
Copy Markdown
Contributor Author

@noa-neria addressed most of the internal review comments and as it continues, I would like to request your & other reviewers to go through. Thanks. I may be offline for a week but @kyleknap has a better idea now what this PR's intent is & requesting @kyleknap to take questions meanwhile if works. Will take up further review comments / left over questions once I am back.

@hasethuraman
Copy link
Copy Markdown
Contributor Author

if there are no comments, can we go ahead for merge?

Copy link
Copy Markdown
Collaborator

@noa-neria noa-neria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hasethuraman thank you for your patiance. We've been quite busy lately.

There are few comments I've added.

Also, have you considered adding versioning for the ABI, if things may change in the future?
Since the cache shared library is not part of the runai-model-streamer package, querying the ABI version after dlopen is a good practice.
For example:

/* runai_azcache_provider.h */
#define RUNAI_CACHE_ABI_VERSION 1
#define RUNAI_CACHE_ABI_VERSION_SYMBOL "runai_cache_abi_version"

/* Provider exports: */
extern "C" uint32_t runai_cache_abi_version(void) { return RUNAI_CACHE_ABI_VERSION; }

Loader flow:

  1. dlopen the library.
  2. dlsym("runai_cache_abi_version"). Missing → reject (pre-versioning build, treat as incompatible). Present → call it.
  3. If the returned version is in the set the streamer accepts → install blob_read. Otherwise log "expected vN, got vM" and either fall back (Auto) or fail (Required).

Comment thread cpp/azure/azcache_provider/azcache_provider_loader.cc
Comment thread docs/src/env-vars.md Outdated
Comment thread tests/azure/test_azurecache.py Outdated
}

// libstreamerazure.so → lib/ → libstreamer/ → runai_model_streamer/ → site-packages/
auto site_packages = azure_so.parent_path().parent_path().parent_path().parent_path();
Copy link
Copy Markdown
Collaborator

@noa-neria noa-neria May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this should be moved to the python layer.

We can add a new hook prepare_runtime similar to the existing scheme in safetensors streamer:

runai_model_streamer/s3_utils/s3_utils.py is the existing scheme-dispatch layer:

  • Scheme detectors: is_s3_path, is_gs_path, is_azure_path.
  • Lazy importers: `get_s3_files_module(), get_gcs_files_module(), get_azure_files_module() — each returns None if the backend package isn't installed.
  • Thin per-scheme wrappers: s3_pull_files, azure_glob, etc. — they fetch the backend module and delegate.

And safetensors_streamer.py:36-45, 56-62 is the dispatch site:

if is_s3_path(path):       files = s3_glob(...)
elif is_gs_path(path):     files = gcs_glob(...)
elif is_azure_path(path):  files = azure_glob(...)

Each cloud's logic lives in its own package — runai_model_streamer_azure is the natural place for "before we touch Azure, prepare any Azure runtime state" — including resolving the cache .so.

The resolver code in azure implementation for prepare_runtime can set the environment variable of the package name (unless alreday set)

@hasethuraman @kyleknap If that is ok with you, I can create a quick PR for prepare_runtime which you can use after rebasing

@hasethuraman
Copy link
Copy Markdown
Contributor Author

@noa-neria thank you for review. Will work on the comments. Regarding the prepare_runtime hook, if I understand correctly, users dont need to add any additional import statements or something for this hook to be executed. If it is, I am all for it. We can have a follow up PR to integrate with hook to unblock this one.

@noa-neria
Copy link
Copy Markdown
Collaborator

noa-neria commented May 14, 2026

@noa-neria thank you for review. Will work on the comments. Regarding the prepare_runtime hook, if I understand correctly, users dont need to add any additional import statements or something for this hook to be executed. If it is, I am all for it. We can have a follow up PR to integrate with hook to unblock this one.

Correct - prepare_runtime will be called automatically in SafetensorsStreamer

- docs: values accepted are 0, 1, auto (not true/false)
- docs: disable example uses =0 instead of =false
- tests: replace stale 'singleton' references with accurate
  ClientMgr pool explanation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@hasethuraman
Copy link
Copy Markdown
Contributor Author

Updated the code to support ABI version

- Header: RUNAI_CACHE_ABI_VERSION=1, runai_cache_abi_version symbol
- Loader: checks version after dlopen, rejects missing/mismatched
- Test .so: exports runai_cache_abi_version() returning 1
- Tests: 8 new loader tests covering mode × version combinations

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@noa-neria
Copy link
Copy Markdown
Collaborator

@hasethuraman thanks for handling these issues! One more question after reviewing the code again:

Is there a reason each AzureClient independently dlopens the cache library? If not, could we dlopen it once and share a Dylib handle across clients — the same way S3ClientWrapper::manage_backend_handle manages the cloud plugin?

The reason I'm asking: by design dlclose is never called on the cache library, so its resources aren't released when the streamer is closed — they're only freed when the host process exits. For long-running processes (e.g. an inference engine that uses the streamer briefly at startup and then serves for hours), the cache library sits idle for the rest of the process's lifetime, still holding its threads, connections, and memory.

This is unlike our treatment of the CSP plugin libraries, which are dlclose'd at streamer close. Routing the cache library through a shared singleton handle would put it on the same lifecycle: refcount drops to zero when the last client is released at streamer close, and dlclose happens naturally.

- Refactor AzCacheProviderLoader to use a shared CacheLibHandle via
  weak_ptr singleton, ensuring single dlopen per process
- Add shutdown_fn typedef and SHUTDOWN_SYMBOL to ABI header
- CacheLibHandle destructor calls shutdown() for graceful provider
  cleanup before process exit (no dlclose for safety)
- Mutex-protected creation and destruction to prevent race conditions
- Update tests: shared handle reuse, shutdown_called verification
- Update example cache (simple_file_cache_test.cc) to export shutdown()

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@hasethuraman
Copy link
Copy Markdown
Contributor Author

@hasethuraman thanks for handling these issues! One more question after reviewing the code again:

Is there a reason each AzureClient independently dlopens the cache library? If not, could we dlopen it once and share a Dylib handle across clients — the same way S3ClientWrapper::manage_backend_handle manages the cloud plugin?

The reason I'm asking: by design dlclose is never called on the cache library, so its resources aren't released when the streamer is closed — they're only freed when the host process exits. For long-running processes (e.g. an inference engine that uses the streamer briefly at startup and then serves for hours), the cache library sits idle for the rest of the process's lifetime, still holding its threads, connections, and memory.

This is unlike our treatment of the CSP plugin libraries, which are dlclose'd at streamer close. Routing the cache library through a shared singleton handle would put it on the same lifecycle: refcount drops to zero when the last client is released at streamer close, and dlclose happens naturally.

@noa-neria agree on the concern. Thanks for direction. We have limitations in dlclose at the moment. However, introduced the shared ptr for lib cache across clients. And on streamer close, calling a new function shutdown() on cache lib to make the cache resources claim possible in future till we get dlclose to work.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 18, 2026

Want your agent to iterate on Greptile's feedback? Try greploops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants